URL Scrapper Cookie Auth

Connector Configuration

The Url Scrapper Cookie Auth connector synchronizes content from authentication-protected web pages to Wikit Semantics. It automatically logs into a login form to obtain the necessary session cookies, then retrieves the content of protected pages.

Field name	Format / Type	Required	Comment
URLs list	List of texts (URLs)	✅	List of URLs of the web pages to synchronize. If a URL is invalid, it will not be added to the list.
Service URL	URL	✅	URL of the Scrappy service used for content extraction. Select the corresponding environment (Development, Preproduction, or Production).
Page options	Configuration object	➖	Options to customize the extraction of page content.
Only the main content	Yes / No	➖	If enabled, only the main content of the page will be extracted, excluding peripheral elements (menu, footer, etc.).
CSS selectors to exclude	List of texts	➖	Allows you to exclude certain parts of the content via CSS selectors (example: `.nav`, `#footer`, `.sidebar`).
Headers	Key/value object	➖	Allows you to add information to the header of HTTPS requests (for example for additional custom parameters).
Authentication	Configuration object	✅	Authentication configuration to access protected pages.
Login page URL	URL	✅	Complete URL of the login page where the authentication form is located.
Username	Free text	✅	Login identifier to access protected pages.
Password	Password / Token	✅	Password associated with the username.
Hidden fields	Key/value object	➖	Additional hidden fields from the login form (for example: `csrf_token`, `redirect_url`, `domain`, etc.).

💡 Fields marked ✅ are required for the connector to work.

How the Connector Works

The Url Scrapper Cookie Auth connector operates in smart synchronization mode with automatic authentication management:

Synchronization Process

Automatic login: At the beginning of each synchronization, the connector accesses the login page and automatically detects the authentication form.
Cookie retrieval: The connector fills in the form with the provided credentials (username, password, and any hidden fields), submits it, then retrieves the generated session cookies.
Content extraction: For each configured URL, the connector uses the session cookies to access protected pages, retrieves their HTML content, then converts it to markdown via the Scrappy service.
Extracted metadata: The connector automatically retrieves metadata from each page (title, description, keywords, Open Graph, language, etc.).
Change management: The connector compares the extracted documents with those already present in Semantics to determine whether to insert, update, or delete documents.
Automatic deletion: Pages that are no longer present in the URLs list are automatically removed from the knowledge base.

Session cookies are valid only for the duration of the synchronization and are regenerated with each new connector execution.

Prerequisites

Before configuring the connector in the Wikit Connect console:

Service account: Create a user account dedicated to synchronization with the necessary access rights to the pages to be synchronized.
Accessible login page: The login page must be accessible from the Wikit Connect server and contain a standard HTML form.
Compatible authentication form: The login form must be a standard HTML form (non-JavaScript). Pages using OAuth, SAML authentication, or other complex mechanisms are not supported by this connector.
Hidden fields identification: If the login form contains hidden fields (CSRF tokens, redirect URL, etc.), you must identify them beforehand (via browser developer tools) and configure them in the "Hidden fields" field.
Scrappy service: You must have a functional Scrappy service URL (provided by Wikit according to your environment).
URLs accessible after authentication: The URLs to synchronize must be accessible once authenticated with the configured service account.

FAQ

How do I identify hidden fields in the login form?

To identify hidden fields in the form:

Access the login page in your browser
Open developer tools (F12 or right-click > "Inspect")
Go to the "Elements" or "Inspector" tab
Locate the <form> tag of the login form
Search for <input type="hidden"> tags inside the form
Note the name and value attributes of these hidden fields

Example: If you find <input type="hidden" name="csrf_token" value="abc123">, you must add to the "Hidden fields" field: {"csrf_token": "abc123"}

Important: If a hidden field's value changes with each page load (like a dynamic CSRF token), you may need to use another type of connector or contact Wikit support.

The connector fails with an authentication error, what should I do?

If the connector returns an authentication error, check the following:

Correct credentials: Verify that the username and password are correct by logging in manually on the login page.
Correct login URL: Ensure that the authentication page URL is accurate and accessible.
Hidden fields: Verify that all required hidden fields are correctly configured with their values.
Field names: The connector automatically detects form fields for username and password. If your form uses non-standard names, authentication may fail.
Complex authentication: If your page uses OAuth, SAML authentication, two-factor authentication (2FA), or a JavaScript form, this connector will not be compatible. Contact Wikit support for alternative solutions.

Check the synchronization logs in the Wikit Connect console for more details about the error.

Do session cookies expire between synchronizations?

Yes, session cookies are not retained between synchronizations. With each connector execution, the complete authentication process is performed:

New login to the login page
Form submission with credentials
Retrieval of new session cookies
Use of these cookies to access protected pages

This approach ensures that the connector always uses valid cookies, even if session lifetime is short on your target system.

What is the difference with the standard Url Scrapper connector?

Criterion	Url Scrapper	Url Scrapper Cookie Auth
Accessible pages	Public pages or with HTTP header authentication	Pages protected by login form
Authentication	Via HTTP headers (Basic Auth, Bearer Token, etc.)	Via HTML form with session cookies
Configuration	Simpler (no session management)	Requires login credentials configuration
Use cases	Public pages, APIs with tokens	Intranets, member areas, platforms with login

Use the Url Scrapper Cookie Auth connector if your pages require login via a standard HTML form. Use the standard Url Scrapper connector for all other situations.

Url Scrapper Cookie Auth Connector ​

Connector Configuration ​

How the Connector Works ​

Synchronization Process ​

Prerequisites ​

FAQ ​

Url Scrapper Cookie Auth Connector

Connector Configuration

How the Connector Works

Synchronization Process

Prerequisites

FAQ