Does Diffbot respect robots.txt?
Yes.
By default Diffbot's web crawls adhere to a site’s robots.txt instructions, including the disallow
and crawl-delay
directives.
In specific cases — typically because of a partnership or agreement you have with the site to be crawled — the robots.txt instruction can be ignored/overridden. This is often faster than waiting for the third-party site to update its robots.txt file.
A User-Agent
is a string used by HTTP clients to identify to the server the user initiating the request. To whitelist Diffbot for a site, specify the appropriate Diffbot user-agents in the site’s robots.txt.
User-Agent | Description & Purpose | Similar To |
---|---|---|
Diffbot | General, proactive web crawling for building a general search engine. This allows websites to be discovered and cited from the Diffbot Knowledge Graph and web search services in response to keyword queries. It is not used for AI training. | Googlebot, Bing |
Diffbot-User | This is used by requests originating on behalf of a human user browsing a URL using Diffbot software, in response to their input. | Mozilla, AppleWebKit |
Note that Diffbot does not crawl pages for training generative AI foundation models, and crawling uses best practices such as caching, compression, conditional GETs, and predictive scheduling to minimize resources on web servers.
Note on Extract and Crawlbot
When users use Extract or Crawlbot APIs, they are defining their own crawl parameters using hosted software. The best practices would be to set a User-Agent for your crawl representing your organization and to enable robots.txt adherence feature in Crawlbot, which is toggled on by default.
Updated 6 days ago