URL Crawler Connector

Configuration

Field	Description	Example / Note
`URLs`	The URL of the site to index or a list of topics to retrieve	`https://wikit.ai` or `https://wikit.ai/blog` to index only blog posts
`Use sitemap`	Allows automatic retrieval of the site's sitemap	Recognized sitemaps: `sitemap.xml`, `sitemap_index.xml`, `sitemap`
`Custom sitemap`	To be used if the sitemap URL does not match recognized formats	Manually enter the sitemap URL
`Path to exclude`	Allows excluding one or more specific paths from the site	Example: `/private`, `/admin`
`Exclude file URLs`	Automatically excludes URLs pointing to files (PDF, images, etc.)	Useful to avoid indexing documents

Understanding Parameters

Conditions for a URL to be crawled

A URL is crawled ONLY IF all of the following conditions are true:

✅ Same origin AND same sub-path
✅ No anchor (#section)
✅ Not in excluded paths
✅ Not already visited
✅ Not a social media URL
✅ Not a file (if excludeFileUrls = true)
✅ Does not end with '#'
✅ Parameters allowed or no parameters

Detailed Examples

Example Base

baseUrl : https://example.com/docs
pathsToExclude : ["/admin", "/private"]
excludeFileUrls : true
includeUrlWithParam : false
visitedUrl : Set(["https://example.com/docs/intro"])

✅ URLs that WILL be crawled

URL	Reason
`https://example.com/docs/guide`	✅ All conditions met
`https://example.com/docs/api/v1`	✅ Same origin, valid sub-path
`https://example.com/docs/tutorial.html`	✅ HTML file allowed even if excludeFileUrls=true
`https://example.com/docs`	✅ Valid root path

❌ URLs that WILL NOT be crawled

1. Different origin

URL	Problem
`https://other-site.com/docs`	❌ Different origin
`http://example.com/docs`	❌ Different protocol
`https://subdomain.example.com/docs`	❌ Different subdomain

2. Invalid path

URL	Problem
`https://example.com/blog`	❌ Does not start with `/docs`
`https://example.com/`	❌ Not the same sub-path

3. Presence of anchor

URL	Problem
`https://example.com/docs/guide#section1`	❌ Contains an anchor
`https://example.com/docs#top`	❌ Anchor present

4. Excluded paths

URL	Problem
`https://example.com/admin/users`	❌ Starts with `/admin`
`https://example.com/private/data`	❌ Starts with `/private`
`https://example.com/docs/admin`	❌ Sub-path excluded

5. Already visited URLs

URL	Problem
`https://example.com/docs/intro`	❌ Already in visitedUrl

URL	Problem
`https://twitter.com/example`	❌ Social media URL
`https://facebook.com/page`	❌ Social media URL

7. Files (if excludeFileUrls = true)

URL	Problem
`https://example.com/docs/file.pdf`	❌ PDF file
`https://example.com/docs/image.jpg`	❌ Image file
`https://example.com/docs/doc.docx`	❌ Word document

Exception : .html and .htm files are allowed even if excludeFileUrls = true

8. URLs ending with ‘#’

URL	Problem
`https://example.com/docs/guide#`	❌ Ends with ‘#’

9. Unauthorized parameters

If includeUrlWithParam = false :

URL	Problem
`https://example.com/docs?search=test`	❌ Contains parameters
`https://example.com/docs/guide?page=1`	❌ Unauthorized parameters

If includeUrlWithParam = true :

URL	Status
`https://example.com/docs?search=test`	✅ Parameters allowed

Decision Tree

URL to test
│
├─ Same origin? ─── NO ──► ❌ Rejected
│   │
│   YES
│   │
├─ Same sub-path? ─── NO ──► ❌ Rejected
│   │
│   YES
│   │
├─ Contains an anchor? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Excluded path? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Already visited? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Social media URL? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ File + excludeFileUrls? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Ends with '#'? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
└─ Parameters allowed? ─── NO ──► ❌ Rejected
    │
    YES
    │
    ✅ URL crawled

Special Cases

HTML Files

file.html and file.htm are always allowed even if excludeFileUrls = true
Other extensions (.pdf, .jpg, etc.) are blocked if excludeFileUrls = true

URL Parameters

If includeUrlWithParam = false : any URL with ?param=value is rejected
If includeUrlWithParam = true : parameters are allowed

Excluded Paths

Excluded paths are converted to absolute paths based on the first baseUrl:

"/admin" becomes the full pathname for comparison
The check uses startsWith() so /admin/users is excluded if /admin is in the list

URL Crawler Connector ​

Configuration ​

Understanding Parameters ​

Conditions for a URL to be crawled ​

Detailed Examples ​

Example Base ​

✅ URLs that WILL be crawled ​

❌ URLs that WILL NOT be crawled ​

1. Different origin ​

2. Invalid path ​

3. Presence of anchor ​

4. Excluded paths ​

5. Already visited URLs ​

6. Social Media ​

7. Files (if excludeFileUrls = true) ​

8. URLs ending with ‘#’ ​

9. Unauthorized parameters ​

Decision Tree ​

Special Cases ​

HTML Files ​

URL Parameters ​

Excluded Paths ​