This tutorial will guide you through creating a simple crawljob.
Use the parameter
maxHops to specify the depth of your crawl from your seed URL(s). A value of
0 will limit crawling only to your seed(s).
maxHops=1 will spider all links from your seed URL(s);
maxHops=2 will follow the links on those pages; etc.
maxHops setting of
-1 (default) will spider all links at any depth.
This can be used in conjunction with URL crawl patterns to fine tune your crawl further.
Crawlbot serves as a controller for sending pages to the appropriate Diffbot API for processing/extraction. By default, these will be generic requests to the appropriate API and will return the default fields from that API.
For example, Crawlbot URLs handed to the Article API will be equivalent to calling
You can adjust individual API fields returned or the parameters of extraction API requests via the Crawlbot querystring field.
For example, to specify certain fields and adjust the timeout value in your Article API requests, enter
timeout=10000&fields=title,text,meta in the querystring field:
This will pass
&timeout=10000&fields=title,text,meta in each Article API request.