Url Scrapper Cookie Auth Connector
Connector Configuration
The Url Scrapper Cookie Auth connector synchronizes content from authentication-protected web pages to Wikit Semantics. It automatically logs into a login form to obtain the necessary session cookies, then retrieves the content of protected pages.
| Field name | Format / Type | Required | Comment |
|---|---|---|---|
| URLs list | List of texts (URLs) | ✅ | List of URLs of the web pages to synchronize. If a URL is invalid, it will not be added to the list. |
| Service URL | URL | ✅ | URL of the Scrappy service used for content extraction. Select the corresponding environment (Development, Preproduction, or Production). |
| Page options | Configuration object | ➖ | Options to customize the extraction of page content. |
| Only the main content | Yes / No | ➖ | If enabled, only the main content of the page will be extracted, excluding peripheral elements (menu, footer, etc.). |
| CSS selectors to exclude | List of texts | ➖ | Allows you to exclude certain parts of the content via CSS selectors (example: .nav, #footer, .sidebar). |
| Headers | Key/value object | ➖ | Allows you to add information to the header of HTTPS requests (for example for additional custom parameters). |
| Authentication | Configuration object | ✅ | Authentication configuration to access protected pages. |
| Login page URL | URL | ✅ | Complete URL of the login page where the authentication form is located. |
| Username | Free text | ✅ | Login identifier to access protected pages. |
| Password | Password / Token | ✅ | Password associated with the username. |
| Hidden fields | Key/value object | ➖ | Additional hidden fields from the login form (for example: csrf_token, redirect_url, domain, etc.). |
💡 Fields marked ✅ are required for the connector to work.
How the Connector Works
The Url Scrapper Cookie Auth connector operates in smart synchronization mode with automatic authentication management:
Synchronization Process
Automatic login: At the beginning of each synchronization, the connector accesses the login page and automatically detects the authentication form.
Cookie retrieval: The connector fills in the form with the provided credentials (username, password, and any hidden fields), submits it, then retrieves the generated session cookies.
Content extraction: For each configured URL, the connector uses the session cookies to access protected pages, retrieves their HTML content, then converts it to markdown via the Scrappy service.
Extracted metadata: The connector automatically retrieves metadata from each page (title, description, keywords, Open Graph, language, etc.).
Change management: The connector compares the extracted documents with those already present in Semantics to determine whether to insert, update, or delete documents.
Automatic deletion: Pages that are no longer present in the URLs list are automatically removed from the knowledge base.
Session cookies are valid only for the duration of the synchronization and are regenerated with each new connector execution.
Prerequisites
Before configuring the connector in the Wikit Connect console:
Service account: Create a user account dedicated to synchronization with the necessary access rights to the pages to be synchronized.
Accessible login page: The login page must be accessible from the Wikit Connect server and contain a standard HTML form.
Compatible authentication form: The login form must be a standard HTML form (non-JavaScript). Pages using OAuth, SAML authentication, or other complex mechanisms are not supported by this connector.
Hidden fields identification: If the login form contains hidden fields (CSRF tokens, redirect URL, etc.), you must identify them beforehand (via browser developer tools) and configure them in the "Hidden fields" field.
Scrappy service: You must have a functional Scrappy service URL (provided by Wikit according to your environment).
URLs accessible after authentication: The URLs to synchronize must be accessible once authenticated with the configured service account.
FAQ
How do I identify hidden fields in the login form?
To identify hidden fields in the form:
- Access the login page in your browser
- Open developer tools (F12 or right-click > "Inspect")
- Go to the "Elements" or "Inspector" tab
- Locate the
<form>tag of the login form - Search for
<input type="hidden">tags inside the form - Note the
nameandvalueattributes of these hidden fields
Example: If you find <input type="hidden" name="csrf_token" value="abc123">, you must add to the "Hidden fields" field: {"csrf_token": "abc123"}
Important: If a hidden field's value changes with each page load (like a dynamic CSRF token), you may need to use another type of connector or contact Wikit support.
The connector fails with an authentication error, what should I do?
If the connector returns an authentication error, check the following:
Correct credentials: Verify that the username and password are correct by logging in manually on the login page.
Correct login URL: Ensure that the authentication page URL is accurate and accessible.
Hidden fields: Verify that all required hidden fields are correctly configured with their values.
Field names: The connector automatically detects form fields for username and password. If your form uses non-standard names, authentication may fail.
Complex authentication: If your page uses OAuth, SAML authentication, two-factor authentication (2FA), or a JavaScript form, this connector will not be compatible. Contact Wikit support for alternative solutions.
Check the synchronization logs in the Wikit Connect console for more details about the error.
Do session cookies expire between synchronizations?
Yes, session cookies are not retained between synchronizations. With each connector execution, the complete authentication process is performed:
- New login to the login page
- Form submission with credentials
- Retrieval of new session cookies
- Use of these cookies to access protected pages
This approach ensures that the connector always uses valid cookies, even if session lifetime is short on your target system.
What is the difference with the standard Url Scrapper connector?
| Criterion | Url Scrapper | Url Scrapper Cookie Auth |
|---|---|---|
| Accessible pages | Public pages or with HTTP header authentication | Pages protected by login form |
| Authentication | Via HTTP headers (Basic Auth, Bearer Token, etc.) | Via HTML form with session cookies |
| Configuration | Simpler (no session management) | Requires login credentials configuration |
| Use cases | Public pages, APIs with tokens | Intranets, member areas, platforms with login |
Use the Url Scrapper Cookie Auth connector if your pages require login via a standard HTML form. Use the standard Url Scrapper connector for all other situations.