Url Scrapper Connector
Connector Configuration
The Url Scrapper connector synchronizes content from specific web pages to Wikit Semantics. It retrieves the HTML content of specified pages, converts them to markdown, and indexes them in your knowledge base.
| Field name | Format / Type | Required | Comment |
|---|---|---|---|
| URLs list | List of texts (URLs) | ✅ | List of URLs of the web pages to synchronize. If a URL is invalid, it will not be added to the list. |
| Service URL | URL | ✅ | URL of the Scrappy service used for content extraction. Select the corresponding environment (Development, Preproduction, or Production). |
| Page options | Configuration object | ➖ | Options to customize the extraction of page content. |
| Only the main content | Yes / No | ➖ | If enabled, only the main content of the page will be extracted, excluding peripheral elements (menu, footer, etc.). |
| CSS selectors to exclude | List of texts | ➖ | Allows you to exclude certain parts of the content via CSS selectors (example: .nav, #footer, .sidebar). |
| Headers | Key/value object | ➖ | Allows you to add information to the header of HTTPS requests (for example for authentication or custom parameters). |
💡 Fields marked ✅ are required for the connector to work.
How the Connector Works
The Url Scrapper connector operates in smart synchronization mode:
- On-demand synchronization: The connector is triggered manually or according to the schedule configured in Wikit Connect.
- Content extraction: For each configured URL, the connector retrieves the HTML content of the page, then converts it to markdown format via the Scrappy service.
- Extracted metadata: The connector automatically retrieves page metadata (title, description, keywords, Open Graph, language, etc.).
- Change management: The connector compares the extracted documents with those already present in Semantics to determine whether to insert, update, or delete documents.
- Automatic deletion: Pages that are no longer present in the URLs list are automatically removed from the knowledge base.
The connector processes URLs sequentially and applies page options (main content extraction, CSS selector exclusion) uniformly to all URLs.
Prerequisites
Before configuring the connector in the Wikit Connect console:
- Network access: Ensure that the Wikit Connect server has network access to the URLs you want to synchronize.
- Public or accessible URLs: URLs must be publicly accessible or accessible via configured HTTP headers (basic authentication, tokens, etc.).
- Scrappy service: You must have a functional Scrappy service URL (provided by Wikit according to your environment).
- Valid HTML content: Web pages must return structured HTML content for optimal extraction.
FAQ
What is the difference between "Only the main content" and "CSS selectors to exclude"?
The "Only the main content" option activates an automatic algorithm that attempts to identify and extract only the main content of the page (article, body text), excluding peripheral elements such as menus, footers, and sidebars.
The "CSS selectors to exclude" option allows you to manually specify precise CSS selectors to exclude specific elements (for example .advertising, #comments, .related-articles). This option is more precise and gives you full control over what should be excluded.
You can combine both options for optimal results.
How do I synchronize web pages protected by authentication?
To synchronize web pages requiring authentication, use the "Headers" field to add the necessary authentication information:
- HTTP basic authentication: Add an
Authorizationheader with the valueBasic [base64(username:password)] - Authentication token: Add an
Authorizationheader with the valueBearer [your-token] - Session cookie: If your pages require session cookie authentication, use the Url Scrapper Cookie Auth connector instead, which automatically handles login and cookie retrieval.
What happens if a URL becomes inaccessible or returns an error?
If a URL returns an error (404, 500, timeout, etc.) during synchronization:
- The connector records the error in the synchronization logs
- Other URLs continue to be processed normally
- If a document already existed for this URL in Semantics, it remains unchanged (it is not deleted)
- You can view the errors in the connector's synchronization history in the Wikit Connect console
It is recommended to regularly check the synchronization logs to identify and correct failing URLs.
How do I identify the right CSS selectors to exclude?
To identify CSS selectors to exclude:
- Open the web page in your browser
- Use the developer tools (right-click > "Inspect element" or F12)
- Identify the HTML elements you want to exclude (navigation menu, ads, comments, etc.)
- Note the CSS classes (
.class-name) or IDs (#identifier) of these elements - Add these selectors to the "CSS selectors to exclude" field
Common examples: .header, .footer, .sidebar, .nav, .advertisement, #comments, .related-posts