Skip to content

URL Crawler Connector

Configuration

FieldDescriptionExample / Note
URLsThe URL of the site to index or a list of topics to retrievehttps://wikit.ai or https://wikit.ai/blog to index only blog posts
Use sitemapAllows automatic retrieval of the site's sitemapRecognized sitemaps: sitemap.xml, sitemap_index.xml, sitemap
Custom sitemapTo be used if the sitemap URL does not match recognized formatsManually enter the sitemap URL
Path to excludeAllows excluding one or more specific paths from the siteExample: /private, /admin
Exclude file URLsAutomatically excludes URLs pointing to files (PDF, images, etc.)Useful to avoid indexing documents

Understanding Parameters

Conditions for a URL to be crawled

A URL is crawled ONLY IF all of the following conditions are true:

✅ Same origin AND same sub-path
✅ No anchor (#section)
✅ Not in excluded paths
✅ Not already visited
✅ Not a social media URL
✅ Not a file (if excludeFileUrls = true)
✅ Does not end with '#'
✅ Parameters allowed or no parameters

Detailed Examples

Example Base

  • baseUrl : https://example.com/docs
  • pathsToExclude : ["/admin", "/private"]
  • excludeFileUrls : true
  • includeUrlWithParam : false
  • visitedUrl : Set(["https://example.com/docs/intro"])

✅ URLs that WILL be crawled

URLReason
https://example.com/docs/guide✅ All conditions met
https://example.com/docs/api/v1✅ Same origin, valid sub-path
https://example.com/docs/tutorial.html✅ HTML file allowed even if excludeFileUrls=true
https://example.com/docs✅ Valid root path

❌ URLs that WILL NOT be crawled

1. Different origin

URLProblem
https://other-site.com/docs❌ Different origin
http://example.com/docs❌ Different protocol
https://subdomain.example.com/docs❌ Different subdomain

2. Invalid path

URLProblem
https://example.com/blog❌ Does not start with /docs
https://example.com/❌ Not the same sub-path

3. Presence of anchor

URLProblem
https://example.com/docs/guide#section1❌ Contains an anchor
https://example.com/docs#top❌ Anchor present

4. Excluded paths

URLProblem
https://example.com/admin/users❌ Starts with /admin
https://example.com/private/data❌ Starts with /private
https://example.com/docs/admin❌ Sub-path excluded

5. Already visited URLs

URLProblem
https://example.com/docs/intro❌ Already in visitedUrl

6. Social Media

URLProblem
https://twitter.com/example❌ Social media URL
https://facebook.com/page❌ Social media URL

7. Files (if excludeFileUrls = true)

URLProblem
https://example.com/docs/file.pdf❌ PDF file
https://example.com/docs/image.jpg❌ Image file
https://example.com/docs/doc.docx❌ Word document

Exception : .html and .htm files are allowed even if excludeFileUrls = true

8. URLs ending with ‘#’

URLProblem
https://example.com/docs/guide#❌ Ends with ‘#’

9. Unauthorized parameters

If includeUrlWithParam = false :

URLProblem
https://example.com/docs?search=test❌ Contains parameters
https://example.com/docs/guide?page=1❌ Unauthorized parameters

If includeUrlWithParam = true :

URLStatus
https://example.com/docs?search=test✅ Parameters allowed

Decision Tree

URL to test

├─ Same origin? ─── NO ──► ❌ Rejected
│   │
│   YES
│   │
├─ Same sub-path? ─── NO ──► ❌ Rejected
│   │
│   YES
│   │
├─ Contains an anchor? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Excluded path? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Already visited? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Social media URL? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ File + excludeFileUrls? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
├─ Ends with '#'? ─── YES ──► ❌ Rejected
│   │
│   NO
│   │
└─ Parameters allowed? ─── NO ──► ❌ Rejected

    YES

    ✅ URL crawled

Special Cases

HTML Files

  • file.html and file.htm are always allowed even if excludeFileUrls = true
  • Other extensions (.pdf, .jpg, etc.) are blocked if excludeFileUrls = true

URL Parameters

  • If includeUrlWithParam = false : any URL with ?param=value is rejected
  • If includeUrlWithParam = true : parameters are allowed

Excluded Paths

Excluded paths are converted to absolute paths based on the first baseUrl:

  • "/admin" becomes the full pathname for comparison
  • The check uses startsWith() so /admin/users is excluded if /admin is in the list