Skip to content

Url Scrapper Connector

Connector Configuration

The Url Scrapper connector synchronizes content from specific web pages to Wikit Semantics. It retrieves the HTML content of specified pages, converts them to markdown, and indexes them in your knowledge base.

Field nameFormat / TypeRequiredComment
URLs listList of texts (URLs)List of URLs of the web pages to synchronize. If a URL is invalid, it will not be added to the list.
Service URLURLURL of the Scrappy service used for content extraction. Select the corresponding environment (Development, Preproduction, or Production).
Page optionsConfiguration objectOptions to customize the extraction of page content.
Only the main contentYes / NoIf enabled, only the main content of the page will be extracted, excluding peripheral elements (menu, footer, etc.).
CSS selectors to excludeList of textsAllows you to exclude certain parts of the content via CSS selectors (example: .nav, #footer, .sidebar).
HeadersKey/value objectAllows you to add information to the header of HTTPS requests (for example for authentication or custom parameters).

💡 Fields marked ✅ are required for the connector to work.

How the Connector Works

The Url Scrapper connector operates in smart synchronization mode:

  • On-demand synchronization: The connector is triggered manually or according to the schedule configured in Wikit Connect.
  • Content extraction: For each configured URL, the connector retrieves the HTML content of the page, then converts it to markdown format via the Scrappy service.
  • Extracted metadata: The connector automatically retrieves page metadata (title, description, keywords, Open Graph, language, etc.).
  • Change management: The connector compares the extracted documents with those already present in Semantics to determine whether to insert, update, or delete documents.
  • Automatic deletion: Pages that are no longer present in the URLs list are automatically removed from the knowledge base.

The connector processes URLs sequentially and applies page options (main content extraction, CSS selector exclusion) uniformly to all URLs.

Prerequisites

Before configuring the connector in the Wikit Connect console:

  1. Network access: Ensure that the Wikit Connect server has network access to the URLs you want to synchronize.
  2. Public or accessible URLs: URLs must be publicly accessible or accessible via configured HTTP headers (basic authentication, tokens, etc.).
  3. Scrappy service: You must have a functional Scrappy service URL (provided by Wikit according to your environment).
  4. Valid HTML content: Web pages must return structured HTML content for optimal extraction.

FAQ

What is the difference between "Only the main content" and "CSS selectors to exclude"?

The "Only the main content" option activates an automatic algorithm that attempts to identify and extract only the main content of the page (article, body text), excluding peripheral elements such as menus, footers, and sidebars.

The "CSS selectors to exclude" option allows you to manually specify precise CSS selectors to exclude specific elements (for example .advertising, #comments, .related-articles). This option is more precise and gives you full control over what should be excluded.

You can combine both options for optimal results.

How do I synchronize web pages protected by authentication?

To synchronize web pages requiring authentication, use the "Headers" field to add the necessary authentication information:

  • HTTP basic authentication: Add an Authorization header with the value Basic [base64(username:password)]
  • Authentication token: Add an Authorization header with the value Bearer [your-token]
  • Session cookie: If your pages require session cookie authentication, use the Url Scrapper Cookie Auth connector instead, which automatically handles login and cookie retrieval.
What happens if a URL becomes inaccessible or returns an error?

If a URL returns an error (404, 500, timeout, etc.) during synchronization:

  • The connector records the error in the synchronization logs
  • Other URLs continue to be processed normally
  • If a document already existed for this URL in Semantics, it remains unchanged (it is not deleted)
  • You can view the errors in the connector's synchronization history in the Wikit Connect console

It is recommended to regularly check the synchronization logs to identify and correct failing URLs.

How do I identify the right CSS selectors to exclude?

To identify CSS selectors to exclude:

  1. Open the web page in your browser
  2. Use the developer tools (right-click > "Inspect element" or F12)
  3. Identify the HTML elements you want to exclude (navigation menu, ads, comments, etc.)
  4. Note the CSS classes (.class-name) or IDs (#identifier) of these elements
  5. Add these selectors to the "CSS selectors to exclude" field

Common examples: .header, .footer, .sidebar, .nav, .advertisement, #comments, .related-posts