Crawl works hand-in-hand with Extract API (either automatic or custom). It quickly spiders a site for appropriate links and hands these links to an Extract API for processing. All structured page results are then compiled into a single "collection," which can be downloaded in full or searched using the Search API.
Note: If you have a complete list of all the URLs you wish to extract, you might be looking for Bulk Extract instead.
For documentation on how to use Crawl via API, check out Introduction to Crawl API.
Access to Crawl API is Limited to Plus Plans and Up
A Crawl job requires just 2 inputs (apart from auth and a name) to work:
- A seed URL
- A choice of Extract API to process URLs
A crawl job given a seed URL of
https://www.diffbot.com and Analyze API will spider for every URL under the
www.diffbot.com domain and process all of them with Analyze API.
The result is effectively a list of every page on diffbot.com, the page type classification of each, and the extracted data in the schema of its classified page type.
Advanced crawl jobs simply add additional filtering logic at each step of this process to optimize for speed and reduce noise from the output data.
For example, a crawl job can be setup to extract only the products in a single product category from an e-commerce website.
- Plus plans may have up to 25 active crawls at a time.
- Enterprise plans may have over 100+ active crawls simultaneously.
- All plans have a limit of up to 1000 crawls in a single token.
Updated 10 months ago