Create a Crawl

To create a crawl, make a POST request to this endpoint with Content-Type set to application/x-www-form-urlencoded and include the minimum settings specified below. Please note that this API does not accept JSON payloads.

Creating a crawl job will instruct Diffbot to immediately start spidering through the provided seed URLs for links and process them with a specified Extract API.

Additional settings are available to crawl only links that match a certain URL pattern, or extract only some crawled links.

Quickstart Examples

These request examples will initiate a crawl on https://example.com using the Analyze API. This is a one page site, so crawling should be complete within seconds.

import requests
import json

url = "https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>"

payload = {
  "name": "test-crawl",
  "seeds": "https://example.com",
  "apiUrl": "https://api.diffbot.com/v3/analyze",
  "maxToCrawl": 100
}

headers = {
  'Content-Type': 'application/x-www-form-urlencoded'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(json.dumps(response.json(), indent=4))

const headers = new Headers();
headers.append("Content-Type", "application/x-www-form-urlencoded");

const payload = new URLSearchParams();
payload.append("name", "test-crawl");
payload.append("seeds", "https://example.com");
payload.append("apiUrl", "https://api.diffbot.com/v3/analyze");
payload.append("maxToCrawl", 100);

const requestOptions = {
  method: "POST",
  headers: headers,
  body: payload
};

fetch("https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>", requestOptions)
  .then((response) => response.json())
  .then((result) => console.log(result))
  .catch((error) => console.error(error));

curl --location 'https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'name=test-crawl' \
--data-urlencode 'seeds=https://example.com' \
--data-urlencode 'apiUrl=https://api.diffbot.com/v3/analyze'
--data-urlencode 'maxToCrawl=100'

Required Settings

See below.

Due to the breadth of available settings and options for Crawl, the built-in API tester will only list the required settings for a simple Crawl job. Additional available settings will be described in separate sections.

Additional Settings

Optional parameters that apply to both the crawling and processing components of a Crawl job.

Argument	Description
`customHeaders`	Set multiple custom headers to be used while crawling and processing pages sent to Diffbot APIs. Each header should be sent in its own `customHeaders` argument, with a colon delimiting the header name and value, and should be URL-encoded. For example, `&customHeaders=Accept-Language%3Aen-us`. See more on using custom headers.
`useCanonical`	Pass `useCanonical=0` to disable deduplication of pages based on a canonical link definition. See more.
`obeyRobots`	Pass `obeyRobots=0` to ignore a site's robots.txt instructions.
`restrictDomain`	Pass `restrictDomain=0` to allow limited crawling across subdomains/domains. See more.
`useProxies`	Set value to `1` to force the use of proxy IPs for the crawl. This will utilize proxy servers for both crawling and processing of pages.
`maxHops`	Specify the depth of your crawl. A `maxHops=0` will limit processing to the seed URL(s) only -- no other links will be processed; `maxHops=1` will process all (otherwise matching) pages whose links appear on seed URL(s); `maxHops=2` will process pages whose links appear on those pages; and so on. By default (`maxHops=-1`) Crawl will crawl and process links at any depth.
`notifyEmail`	Send a message to this email address when the crawl hits the `maxToCrawl` or `maxToProcess` limit, or when the crawl completes.
`notifyWebhook`	Pass a URL to be notified when the crawl hits the `maxToCrawl` or `maxToProcess` limit, or when the crawl completes. You will receive a POST with `X-Crawl-Name` and `X-Crawl-Status` in the headers, and the job's JSON metadata in the POST body. Note that in webhook POSTs the parent `jobs` will not be sent—only the individual job object will be returned.
`repeat`	Specify the number of days as a floating-point (e.g. `repeat=7.0`) to repeat this crawl. By default crawls will not be repeated.
`seedRecrawlFrequency`	Useful for specifying a frequency, in number of days, to recrawl seed urls, which is independent of the overall recrawl frequency given by `repeat`. Defaults to `seedRecrawlFrequency=-1` to use the default frequency.
`maxRounds`	Specify the maximum number of crawl repeats. By default (`maxRounds=0`) repeating crawls will continue indefinitely.

Additional Crawl Settings

Optional parameters that apply to just the crawling/spidering component of a crawl job. (See The Difference Between Crawling and Processing )

Argument	Description
`urlCrawlPattern`	Specify \|\|-separated strings to limit pages crawled to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. `!product` to exclude URLs containing the string "product," and the `^` and `$` characters to limit matches to the beginning or end of the URL. The use of a `urlCrawlPattern` will allow Crawl to spider outside of the seed domain; it will follow all matching URLs regardless of domain.
`urlCrawlRegEx`	Specify a regular expression to limit pages crawled to those URLs that contain a match to your expression. This will override any `urlCrawlPattern` value.
`maxToCrawl`	Specify max pages to spider. Default: 100,000.
`maxToCrawlPerSubdomain`	Specify max pages to spider per subdomain. Default: no limit (-1)
`crawlDelay`	Wait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number (e.g., `crawlDelay=0.25`).

Additional Processing Settings

Optional parameters that apply to just the processing/extraction component of a crawl job. (See The Difference Between Crawling and Processing )

Argument	Description
`urlProcessPattern`	Specify \|\|-separated strings to limit pages processed to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. `!/category` to exclude URLs containing the string "/category," and the `^` and `$` characters to limit matches to the beginning or end of the URL.
`urlProcessRegEx`	Specify a regular expression to limit pages processed to those URLs that contain a match to your expression. This will override any `urlProcessPattern` value.
`pageProcessPattern`	Specify \|\|-separated strings to limit pages processed to those whose HTML contains any of the content strings.
`maxToProcess`	Specify max pages to process through Diffbot APIs. Default: 100,000.
`maxToProcessPerSubdomain`	Specify max pages to process per subdomain. Default: no limit (-1)
`onlyProcessIfNew`	By default repeat crawls will only process new (previously unprocessed) pages. Set to 0 (`onlyProcessIfNew=0`) to process all content on repeat crawls.

Response

Upon adding a new crawl, you will receive a success message in the JSON response, in addition to full crawl details:

{
    "response": "Successfully added urls for spidering.",
    "jobs": [
        {
            "jobStatus": {
                "message": "Job is initializing.",
                "status": 0
            },
            "maxHops": -1,
            "downloadJson": "...json",
            "urlProcessPattern": "",
            "jobCompletionTimeUTC": 0,
            "maxRounds": -1,
            "type": "bulk",
            "pageCrawlSuccessesThisRound": 0,
            "urlCrawlRegEx": "",
            "pageProcessPattern": "",
            "apiUrl": "https://api.diffbot.com/v3/analyze",
            "useCanonical": 1,
            "jobCreationTimeUTC": 1649950325,
            "repeat": 0,
            "downloadUrls": "...csv",
            "obeyRobots": 1,
            "roundsCompleted": 0,
            "pageCrawlAttempts": 0,
            "notifyWebhook": "",
            "pageProcessSuccessesThisRound": 0,
            "customHeaders": {},
            "objectsFound": 0,
            "roundStartTime": 0,
            "urlCrawlPattern": "",
            "seedRecrawlFrequency": -1,
            "urlProcessRegEx": "",
            "pageProcessSuccesses": 0,
            "urlsHarvested": 0,
            "crawlDelay": -1,
            "currentTime": 1649950325,
            "useProxies": 0,
            "sentJobDoneNotification": 0,
            "currentTimeUTC": 1649950325,
            "name": "bulkTest",
            "notifyEmail": "",
            "pageCrawlSuccesses": 0,
            "pageProcessAttempts": 0
        }
    ]
}

Please note that if you get the "Too Many Collections" error, you hit our 1000-crawls limit.