Create and start a job to spider and extract pages through a site.
To create a crawl, make a POST request to this endpoint with Content-Type set to application/x-www-form-urlencoded and include the minimum settings specified below. Please note that this API does not accept JSON payloads.
Creating a crawl job will instruct Diffbot to immediately start spidering through the provided seed URLs for links and process them with a specified Extract API.
Additional settings are available to crawl only links that match a certain URL pattern, or extract only some crawled links.
Quickstart Examples
These request examples will initiate a crawl on https://example.com using the Analyze API. This is a one page site, so crawling should be complete within seconds.
import requests
import json
url = "https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>"
payload = {
"name": "test-crawl",
"seeds": "https://example.com",
"apiUrl": "https://api.diffbot.com/v3/analyze",
"maxToCrawl": 100
}
headers = {
'Content-Type': 'application/x-www-form-urlencoded'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(json.dumps(response.json(), indent=4))
const headers = new Headers();
headers.append("Content-Type", "application/x-www-form-urlencoded");
const payload = new URLSearchParams();
payload.append("name", "test-crawl");
payload.append("seeds", "https://example.com");
payload.append("apiUrl", "https://api.diffbot.com/v3/analyze");
payload.append("maxToCrawl", 100);
const requestOptions = {
method: "POST",
headers: headers,
body: payload
};
fetch("https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>", requestOptions)
.then((response) => response.json())
.then((result) => console.log(result))
.catch((error) => console.error(error));curl --location 'https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'name=test-crawl' \
--data-urlencode 'seeds=https://example.com' \
--data-urlencode 'apiUrl=https://api.diffbot.com/v3/analyze'
--data-urlencode 'maxToCrawl=100'Required Settings
Due to the breadth of available settings and options for Crawl, the built-in API tester will only list the required settings for a simple Crawl job. Additional available settings will be described in separate sections.
Additional Settings
Optional parameters that apply to both the crawling and processing components of a Crawl job.
Argument | Description |
|---|---|
| Set multiple custom headers to be used while crawling and processing pages sent to Diffbot APIs. Each header should be sent in its own |
| Pass |
| Pass |
| Pass |
| Set value to |
| Specify the depth of your crawl. A |
| Send a message to this email address when the crawl hits the |
| Pass a URL to be notified when the crawl hits the |
| Specify the number of days as a floating-point (e.g. |
| Useful for specifying a frequency, in number of days, to recrawl seed urls, which is independent of the overall recrawl frequency given by |
| Specify the maximum number of crawl repeats. By default ( |
Additional Crawl Settings
Optional parameters that apply to just the crawling/spidering component of a crawl job. (See The Difference Between Crawling and Processing )
Argument | Description |
|---|---|
| Specify ||-separated strings to limit pages crawled to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. |
| Specify a regular expression to limit pages crawled to those URLs that contain a match to your expression. This will override any |
| Specify max pages to spider. Default: 100,000. |
| Specify max pages to spider per subdomain. Default: no limit (-1) |
| Wait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number (e.g., |
Additional Processing Settings
Optional parameters that apply to just the processing/extraction component of a crawl job. (See The Difference Between Crawling and Processing )
| Argument | Description |
|---|---|
urlProcessPattern | Specify \ |
urlProcessRegEx | Specify a regular expression to limit pages processed to those URLs that contain a match to your expression. This will override any urlProcessPattern value. |
pageProcessPattern | Specify \ |
maxToProcess | Specify max pages to process through Diffbot APIs. Default: 100,000. |
maxToProcessPerSubdomain | Specify max pages to process per subdomain. Default: no limit (-1) |
onlyProcessIfNew | By default repeat crawls will only process new (previously unprocessed) pages. Set to 0 (onlyProcessIfNew=0) to process all content on repeat crawls. |
Response
Upon adding a new crawl, you will receive a success message in the JSON response, in addition to full crawl details:
{
"response": "Successfully added urls for spidering.",
"jobs": [
{
"jobStatus": {
"message": "Job is initializing.",
"status": 0
},
"maxHops": -1,
"downloadJson": "...json",
"urlProcessPattern": "",
"jobCompletionTimeUTC": 0,
"maxRounds": -1,
"type": "bulk",
"pageCrawlSuccessesThisRound": 0,
"urlCrawlRegEx": "",
"pageProcessPattern": "",
"apiUrl": "https://api.diffbot.com/v3/analyze",
"useCanonical": 1,
"jobCreationTimeUTC": 1649950325,
"repeat": 0,
"downloadUrls": "...csv",
"obeyRobots": 1,
"roundsCompleted": 0,
"pageCrawlAttempts": 0,
"notifyWebhook": "",
"pageProcessSuccessesThisRound": 0,
"customHeaders": {},
"objectsFound": 0,
"roundStartTime": 0,
"urlCrawlPattern": "",
"seedRecrawlFrequency": -1,
"urlProcessRegEx": "",
"pageProcessSuccesses": 0,
"urlsHarvested": 0,
"crawlDelay": -1,
"currentTime": 1649950325,
"useProxies": 0,
"sentJobDoneNotification": 0,
"currentTimeUTC": 1649950325,
"name": "bulkTest",
"notifyEmail": "",
"pageCrawlSuccesses": 0,
"pageProcessAttempts": 0
}
]
}Please note that if you get the "Too Many Collections" error, you hit our 1000-crawls limit.