Create and start a job to spider and extract pages through a site.
To create a crawl, make a POST request to this endpoint with Content-Type
set to application/x-www-form-urlencoded
and include the minimum settings specified below. Please note that this API does not accept JSON payloads.
Creating a crawl job will instruct Diffbot to immediately start spidering through the provided seed URLs for links and process them with a specified Extract API.
Additional settings are available to crawl only links that match a certain URL pattern, or extract only some crawled links.
Quickstart Examples
These request examples will initiate a crawl on https://example.com using the Analyze API. This is a one page site, so crawling should be complete within seconds.
import requests
import json
url = "https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>"
payload = {
"name": "test-crawl",
"seeds": "https://example.com",
"apiUrl": "https://api.diffbot.com/v3/analyze",
"maxToCrawl": 100
}
headers = {
'Content-Type': 'application/x-www-form-urlencoded'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(json.dumps(response.json(), indent=4))
const headers = new Headers();
headers.append("Content-Type", "application/x-www-form-urlencoded");
const payload = new URLSearchParams();
payload.append("name", "test-crawl");
payload.append("seeds", "https://example.com");
payload.append("apiUrl", "https://api.diffbot.com/v3/analyze");
payload.append("maxToCrawl", 100);
const requestOptions = {
method: "POST",
headers: headers,
body: payload
};
fetch("https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>", requestOptions)
.then((response) => response.json())
.then((result) => console.log(result))
.catch((error) => console.error(error));
curl --location 'https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'name=test-crawl' \
--data-urlencode 'seeds=https://example.com' \
--data-urlencode 'apiUrl=https://api.diffbot.com/v3/analyze'
--data-urlencode 'maxToCrawl=100'
Required Settings
Due to the breadth of available settings and options for Crawl, the built-in API tester will only list the required settings for a simple Crawl job. Additional available settings will be described in separate sections.
Additional Settings
Optional parameters that apply to both the crawling and processing components of a Crawl job.
Argument | Description |
---|---|
customHeaders | Set multiple custom headers to be used while crawling and processing pages sent to Diffbot APIs. Each header should be sent in its own customHeaders argument, with a colon delimiting the header name and value, and should be URL-encoded. For example, &customHeaders=Accept-Language%3Aen-us . See more on using custom headers. |
useCanonical | Pass useCanonical=0 to disable deduplication of pages based on a canonical link definition. See more. |
obeyRobots | Pass obeyRobots=0 to ignore a site's robots.txt instructions. |
restrictDomain | Pass restrictDomain=0 to allow limited crawling across subdomains/domains. See more. |
useProxies | Set value to 1 to force the use of proxy IPs for the crawl. This will utilize proxy servers for both crawling and processing of pages. |
maxHops | Specify the depth of your crawl. A maxHops=0 will limit processing to the seed URL(s) only -- no other links will be processed; maxHops=1 will process all (otherwise matching) pages whose links appear on seed URL(s); maxHops=2 will process pages whose links appear on those pages; and so on.By default ( maxHops=-1 ) Crawl will crawl and process links at any depth. |
notifyEmail | Send a message to this email address when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes. |
notifyWebhook | Pass a URL to be notified when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes. You will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the job's JSON metadata in the POST body. Note that in webhook POSTs the parent jobs will not be sent—only the individual job object will be returned. |
repeat | Specify the number of days as a floating-point (e.g. repeat=7.0 ) to repeat this crawl. By default crawls will not be repeated. |
seedRecrawlFrequency | Useful for specifying a frequency, in number of days, to recrawl seed urls, which is independent of the overall recrawl frequency given by repeat . Defaults to seedRecrawlFrequency=-1 to use the default frequency. |
maxRounds | Specify the maximum number of crawl repeats. By default (maxRounds=0 ) repeating crawls will continue indefinitely. |
Additional Crawl Settings
Optional parameters that apply to just the crawling/spidering component of a crawl job. (See The Difference Between Crawling and Processing )
Argument | Description |
---|---|
urlCrawlPattern | Specify ||-separated strings to limit pages crawled to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. !product to exclude URLs containing the string "product," and the ^ and $ characters to limit matches to the beginning or end of the URL.The use of a urlCrawlPattern will allow Crawl to spider outside of the seed domain; it will follow all matching URLs regardless of domain. |
urlCrawlRegEx | Specify a regular expression to limit pages crawled to those URLs that contain a match to your expression. This will override any urlCrawlPattern value. |
maxToCrawl | Specify max pages to spider. Default: 100,000. |
maxToCrawlPerSubdomain | Specify max pages to spider per subdomain. Default: no limit (-1) |
crawlDelay | Wait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number (e.g., crawlDelay=0.25 ). |
Additional Processing Settings
Optional parameters that apply to just the processing/extraction component of a crawl job. (See The Difference Between Crawling and Processing )
Argument | Description |
---|---|
urlProcessPattern | Specify ||-separated strings to limit pages processed to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. !/category to exclude URLs containing the string "/category," and the ^ and $ characters to limit matches to the beginning or end of the URL. |
urlProcessRegEx | Specify a regular expression to limit pages processed to those URLs that contain a match to your expression. This will override any urlProcessPattern value. |
pageProcessPattern | Specify ||-separated strings to limit pages processed to those whose HTML contains any of the content strings. |
maxToProcess | Specify max pages to process through Diffbot APIs. Default: 100,000. |
maxToProcessPerSubdomain | Specify max pages to process per subdomain. Default: no limit (-1) |
onlyProcessIfNew | By default repeat crawls will only process new (previously unprocessed) pages. Set to 0 (onlyProcessIfNew=0 ) to process all content on repeat crawls. |
Response
Upon adding a new crawl, you will receive a success message in the JSON response, in addition to full crawl details:
{
"response": "Successfully added urls for spidering.",
"jobs": [
{
"jobStatus": {
"message": "Job is initializing.",
"status": 0
},
"maxHops": -1,
"downloadJson": "...json",
"urlProcessPattern": "",
"jobCompletionTimeUTC": 0,
"maxRounds": -1,
"type": "bulk",
"pageCrawlSuccessesThisRound": 0,
"urlCrawlRegEx": "",
"pageProcessPattern": "",
"apiUrl": "https://api.diffbot.com/v3/analyze",
"useCanonical": 1,
"jobCreationTimeUTC": 1649950325,
"repeat": 0,
"downloadUrls": "...csv",
"obeyRobots": 1,
"roundsCompleted": 0,
"pageCrawlAttempts": 0,
"notifyWebhook": "",
"pageProcessSuccessesThisRound": 0,
"customHeaders": {},
"objectsFound": 0,
"roundStartTime": 0,
"urlCrawlPattern": "",
"seedRecrawlFrequency": -1,
"urlProcessRegEx": "",
"pageProcessSuccesses": 0,
"urlsHarvested": 0,
"crawlDelay": -1,
"currentTime": 1649950325,
"useProxies": 0,
"sentJobDoneNotification": 0,
"currentTimeUTC": 1649950325,
"name": "bulkTest",
"notifyEmail": "",
"pageCrawlSuccesses": 0,
"pageProcessAttempts": 0
}
]
}
Please note that if you get the "Too Many Collections" error, you hit our 1000-crawls limit.