Crawlbot works hand-in-hand with a Diffbot API (either automatic or custom). It quickly spiders a site for appropriate links and hands these links to a Diffbot API for processing. All structured page results are then compiled into a single "collection," which can be downloaded in full or searched using the Search API.
Crawlbot is limited to Extraction API Plus plans and above, and is accessible in the Developer Dashboard here. Note that the limit of active crawls on a single token is 1000. More information here.
By default Crawlbot adheres to a site’s robots.txt instructions, including the disallow and crawl-delay directives.
In specific cases — typically because of a partnership or agreement you have with the site to be crawled — the robots.txt instruction can be ignored/overridden. This is often faster than waiting for the third-party site to update its robots.txt file.
To whitelist Crawlbot for a site, specify the “Diffbot” user-agent in the site’s robots.txt:
User-agent: Diffbot Disallow
Note that Crawlbot does not adhere to the
Depending on your Diffbot Plan, inactive crawls will be removed from your account either 14 or 30 days after completion.
This includes the extracted data as well as the job meta information (name, settings, etc.).
“Active” crawls are those that are recurring/repeating and that are not in a permanently “paused” state. Currently active jobs will not be deleted or removed from your account. After a recurring crawl completes its final round it will be subject to regular deletion policies.
- Crawlbot Basic Walkthrough
- Crawlbot Video Tutorials
- Crawlbot API
- Crawling vs Processing
- How does Diffbot handle duplicate pages/content while crawling?
- How long does it take to crawl a site?
- When is crawl or bulk job data deleted?
- Crawlbot URL Report
- Do Diffbot APIs Follow Redirects?
- Does crawlbot process hashtag-links?
- Using Diffbot Proxy Servers / Proxy IPs
- Sending Custom Headers during jobs