Crawlbot has the following default behavior:
- If a seed URL contains a non-www subdomain (
https://docs.diffbot.com), crawling will be limited to the specified subdomain.
- If a seed URL lacks a subdomain or uses “www” (
http://www.diffbot.com), crawling will extend to the entire domain.
If you enter a seed of
http://blog.diffbot.com, only URLs from http://blog.diffbot.com will be crawled. If you enter a seed of
http://www.diffbot.com, URLs from http://www.diffbot.com, http://blog.diffbot.com, https://docs.diffbot.com, etc. will be crawled.
To make Diffbot visit other subdomains on that domain as well, deactivate the toggle "Restrict Subdomains".
Processing Pages From Other Domains
Crawlbot offers limited support for processing pages on other domains.
If you need to process pages on other domains or subdomains (e.g., a blog home page presents all its links as shortened URLs), you may do so by disabling “Restrict Domain” functionality in the Crawlbot UI (or the
restrictDomain parameter in the Crawlbot API). Doing so will enable Crawlbot to spider all links regardless of domain, up to one “hop” from your seed URLs. (A “hop” is one link-depth from your seed. Read more on hops.)
To prevent over-spidering, Crawlbot cannot exhaustively spider multiple domains from a limited set of seed. If you wish to include multiple domains in your crawl, please provide multiple domains in your seed URLs.