Crawl has the following default behavior:
- If a seed URL contains a non-www subdomain (
https://docs.diffbot.com), crawling will be limited to the specified subdomain.
- If a seed URL lacks a subdomain or uses “www” (
http://www.diffbot.com), crawling will extend to the entire domain.
If you enter a seed of
http://blog.diffbot.com, only URLs from
http://blog.diffbot.com will be crawled. If you enter a seed of
http://www.diffbot.com, URLs from
https://docs.diffbot.com, etc. will be crawled.
To make Diffbot visit other subdomains on that domain as well, deactivate the toggle "Restrict Subdomains".
Crawl offers limited support for processing pages on other domains.
If you need to process pages on other domains or subdomains (e.g., a blog home page presents all its links as shortened URLs), you may do so by disabling “Restrict Domain” functionality in the Crawl Dashboard UI (or the
restrictDomain parameter in the Crawl API).
Doing so will enable Crawl to spider all links regardless of domain, up to one “hop” from your seed URLs. (A “hop” is one link-depth from your seed.)
To prevent over-spidering, Crawl cannot exhaustively spider multiple domains from a limited set of seed. If you wish to include multiple domains in your crawl, please provide multiple domains in your seed URLs.
Updated over 1 year ago