URL Crawler Connector
Configuration
| Field | Description | Example / Note |
|---|---|---|
URLs | The URL of the site to index or a list of topics to retrieve | https://wikit.ai or https://wikit.ai/blog to index only blog posts |
Use sitemap | Allows automatic retrieval of the site's sitemap | Recognized sitemaps: sitemap.xml, sitemap_index.xml, sitemap |
Custom sitemap | To be used if the sitemap URL does not match recognized formats | Manually enter the sitemap URL |
Path to exclude | Allows excluding one or more specific paths from the site | Example: /private, /admin |
Exclude file URLs | Automatically excludes URLs pointing to files (PDF, images, etc.) | Useful to avoid indexing documents |
Understanding Parameters
Conditions for a URL to be crawled
A URL is crawled ONLY IF all of the following conditions are true:
✅ Same origin AND same sub-path
✅ No anchor (#section)
✅ Not in excluded paths
✅ Not already visited
✅ Not a social media URL
✅ Not a file (if excludeFileUrls = true)
✅ Does not end with '#'
✅ Parameters allowed or no parametersDetailed Examples
Example Base
- baseUrl :
https://example.com/docs - pathsToExclude :
["/admin", "/private"] - excludeFileUrls :
true - includeUrlWithParam :
false - visitedUrl :
Set(["https://example.com/docs/intro"])
✅ URLs that WILL be crawled
| URL | Reason |
|---|---|
https://example.com/docs/guide | ✅ All conditions met |
https://example.com/docs/api/v1 | ✅ Same origin, valid sub-path |
https://example.com/docs/tutorial.html | ✅ HTML file allowed even if excludeFileUrls=true |
https://example.com/docs | ✅ Valid root path |
❌ URLs that WILL NOT be crawled
1. Different origin
| URL | Problem |
|---|---|
https://other-site.com/docs | ❌ Different origin |
http://example.com/docs | ❌ Different protocol |
https://subdomain.example.com/docs | ❌ Different subdomain |
2. Invalid path
| URL | Problem |
|---|---|
https://example.com/blog | ❌ Does not start with /docs |
https://example.com/ | ❌ Not the same sub-path |
3. Presence of anchor
| URL | Problem |
|---|---|
https://example.com/docs/guide#section1 | ❌ Contains an anchor |
https://example.com/docs#top | ❌ Anchor present |
4. Excluded paths
| URL | Problem |
|---|---|
https://example.com/admin/users | ❌ Starts with /admin |
https://example.com/private/data | ❌ Starts with /private |
https://example.com/docs/admin | ❌ Sub-path excluded |
5. Already visited URLs
| URL | Problem |
|---|---|
https://example.com/docs/intro | ❌ Already in visitedUrl |
6. Social Media
| URL | Problem |
|---|---|
https://twitter.com/example | ❌ Social media URL |
https://facebook.com/page | ❌ Social media URL |
7. Files (if excludeFileUrls = true)
| URL | Problem |
|---|---|
https://example.com/docs/file.pdf | ❌ PDF file |
https://example.com/docs/image.jpg | ❌ Image file |
https://example.com/docs/doc.docx | ❌ Word document |
Exception : .html and .htm files are allowed even if excludeFileUrls = true
8. URLs ending with ‘#’
| URL | Problem |
|---|---|
https://example.com/docs/guide# | ❌ Ends with ‘#’ |
9. Unauthorized parameters
If includeUrlWithParam = false :
| URL | Problem |
|---|---|
https://example.com/docs?search=test | ❌ Contains parameters |
https://example.com/docs/guide?page=1 | ❌ Unauthorized parameters |
If includeUrlWithParam = true :
| URL | Status |
|---|---|
https://example.com/docs?search=test | ✅ Parameters allowed |
Decision Tree
URL to test
│
├─ Same origin? ─── NO ──► ❌ Rejected
│ │
│ YES
│ │
├─ Same sub-path? ─── NO ──► ❌ Rejected
│ │
│ YES
│ │
├─ Contains an anchor? ─── YES ──► ❌ Rejected
│ │
│ NO
│ │
├─ Excluded path? ─── YES ──► ❌ Rejected
│ │
│ NO
│ │
├─ Already visited? ─── YES ──► ❌ Rejected
│ │
│ NO
│ │
├─ Social media URL? ─── YES ──► ❌ Rejected
│ │
│ NO
│ │
├─ File + excludeFileUrls? ─── YES ──► ❌ Rejected
│ │
│ NO
│ │
├─ Ends with '#'? ─── YES ──► ❌ Rejected
│ │
│ NO
│ │
└─ Parameters allowed? ─── NO ──► ❌ Rejected
│
YES
│
✅ URL crawledSpecial Cases
HTML Files
file.htmlandfile.htmare always allowed even ifexcludeFileUrls = true- Other extensions (
.pdf,.jpg, etc.) are blocked ifexcludeFileUrls = true
URL Parameters
- If
includeUrlWithParam = false: any URL with?param=valueis rejected - If
includeUrlWithParam = true: parameters are allowed
Excluded Paths
Excluded paths are converted to absolute paths based on the first baseUrl:
"/admin"becomes the full pathname for comparison- The check uses
startsWith()so/admin/usersis excluded if/adminis in the list