Create a Crawl

Create and start a job to spider and extract pages through a site.

To create a crawl, make a POST request to this endpoint with Content-Type set to application/x-www-form-urlencoded and include the minimum settings specified below. Please note that this API does not accept JSON payloads.

Creating a crawl job will instruct Diffbot to immediately start spidering through the provided seed URLs for links and process them with a specified Extract API.

Additional settings are available to crawl only links that match a certain URL pattern, or extract only some crawled links.

Quickstart Examples

These request examples will initiate a crawl on https://example.com using the Analyze API. This is a one page site, so crawling should be complete within seconds.

import requests
import json

url = "https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>"

payload = {
  "name": "test-crawl",
  "seeds": "https://example.com",
  "apiUrl": "https://api.diffbot.com/v3/analyze",
  "maxToCrawl": 100
}

headers = {
  'Content-Type': 'application/x-www-form-urlencoded'
}

response = requests.request("POST", url, headers=headers, data=payload)

print(json.dumps(response.json(), indent=4))

const headers = new Headers();
headers.append("Content-Type", "application/x-www-form-urlencoded");

const payload = new URLSearchParams();
payload.append("name", "test-crawl");
payload.append("seeds", "https://example.com");
payload.append("apiUrl", "https://api.diffbot.com/v3/analyze");
payload.append("maxToCrawl", 100);

const requestOptions = {
  method: "POST",
  headers: headers,
  body: payload
};

fetch("https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>", requestOptions)
  .then((response) => response.json())
  .then((result) => console.log(result))
  .catch((error) => console.error(error));
curl --location 'https://api.diffbot.com/v3/crawl?token=<YOUR DIFFBOT TOKEN>' \
--header 'Content-Type: application/x-www-form-urlencoded' \
--data-urlencode 'name=test-crawl' \
--data-urlencode 'seeds=https://example.com' \
--data-urlencode 'apiUrl=https://api.diffbot.com/v3/analyze'
--data-urlencode 'maxToCrawl=100'

Required Settings

See below.

Due to the breadth of available settings and options for Crawl, the built-in API tester will only list the required settings for a simple Crawl job. Additional available settings will be described in separate sections.

Additional Settings

Optional parameters that apply to both the crawling and processing components of a Crawl job.

ArgumentDescription
customHeadersSet multiple custom headers to be used while crawling and processing pages sent to Diffbot APIs. Each header should be sent in its own customHeaders argument, with a colon delimiting the header name and value, and should be URL-encoded. For example, &customHeaders=Accept-Language%3Aen-us. See more on using custom headers.
useCanonicalPass useCanonical=0 to disable deduplication of pages based on a canonical link definition. See more.
obeyRobotsPass obeyRobots=0 to ignore a site's robots.txt instructions.
restrictDomainPass restrictDomain=0 to allow limited crawling across subdomains/domains. See more.
useProxiesSet value to 1 to force the use of proxy IPs for the crawl. This will utilize proxy servers for both crawling and processing of pages.
maxHopsSpecify the depth of your crawl. A maxHops=0 will limit processing to the seed URL(s) only -- no other links will be processed; maxHops=1 will process all (otherwise matching) pages whose links appear on seed URL(s); maxHops=2 will process pages whose links appear on those pages; and so on.

By default (maxHops=-1) Crawl will crawl and process links at any depth.
notifyEmailSend a message to this email address when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes.
notifyWebhookPass a URL to be notified when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes. You will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the job's JSON metadata in the POST body. Note that in webhook POSTs the parent jobs will not be sent—only the individual job object will be returned.
repeatSpecify the number of days as a floating-point (e.g. repeat=7.0) to repeat this crawl. By default crawls will not be repeated.
seedRecrawlFrequencyUseful for specifying a frequency, in number of days, to recrawl seed urls, which is independent of the overall recrawl frequency given by repeat. Defaults to seedRecrawlFrequency=-1 to use the default frequency.
maxRoundsSpecify the maximum number of crawl repeats. By default (maxRounds=0) repeating crawls will continue indefinitely.

Additional Crawl Settings

Optional parameters that apply to just the crawling/spidering component of a crawl job. (See The Difference Between Crawling and Processing )

ArgumentDescription
urlCrawlPatternSpecify ||-separated strings to limit pages crawled to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. !product to exclude URLs containing the string "product," and the ^ and $ characters to limit matches to the beginning or end of the URL.

The use of a urlCrawlPattern will allow Crawl to spider outside of the seed domain; it will follow all matching URLs regardless of domain.
urlCrawlRegExSpecify a regular expression to limit pages crawled to those URLs that contain a match to your expression. This will override any urlCrawlPattern value.
maxToCrawlSpecify max pages to spider. Default: 100,000.
maxToCrawlPerSubdomainSpecify max pages to spider per subdomain. Default: no limit (-1)
crawlDelayWait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number (e.g., crawlDelay=0.25).

Additional Processing Settings

Optional parameters that apply to just the processing/extraction component of a crawl job. (See The Difference Between Crawling and Processing )

ArgumentDescription
urlProcessPatternSpecify ||-separated strings to limit pages processed to those whose URLs contain any of the content strings. You can use the exclamation point to specify a negative string, e.g. !/category to exclude URLs containing the string "/category," and the ^ and $ characters to limit matches to the beginning or end of the URL.
urlProcessRegExSpecify a regular expression to limit pages processed to those URLs that contain a match to your expression. This will override any urlProcessPattern value.
pageProcessPatternSpecify ||-separated strings to limit pages processed to those whose HTML contains any of the content strings.
maxToProcessSpecify max pages to process through Diffbot APIs. Default: 100,000.
maxToProcessPerSubdomainSpecify max pages to process per subdomain. Default: no limit (-1)
onlyProcessIfNewBy default repeat crawls will only process new (previously unprocessed) pages. Set to 0 (onlyProcessIfNew=0) to process all content on repeat crawls.

Response

Upon adding a new crawl, you will receive a success message in the JSON response, in addition to full crawl details:

{
    "response": "Successfully added urls for spidering.",
    "jobs": [
        {
            "jobStatus": {
                "message": "Job is initializing.",
                "status": 0
            },
            "maxHops": -1,
            "downloadJson": "...json",
            "urlProcessPattern": "",
            "jobCompletionTimeUTC": 0,
            "maxRounds": -1,
            "type": "bulk",
            "pageCrawlSuccessesThisRound": 0,
            "urlCrawlRegEx": "",
            "pageProcessPattern": "",
            "apiUrl": "https://api.diffbot.com/v3/analyze",
            "useCanonical": 1,
            "jobCreationTimeUTC": 1649950325,
            "repeat": 0,
            "downloadUrls": "...csv",
            "obeyRobots": 1,
            "roundsCompleted": 0,
            "pageCrawlAttempts": 0,
            "notifyWebhook": "",
            "pageProcessSuccessesThisRound": 0,
            "customHeaders": {},
            "objectsFound": 0,
            "roundStartTime": 0,
            "urlCrawlPattern": "",
            "seedRecrawlFrequency": -1,
            "urlProcessRegEx": "",
            "pageProcessSuccesses": 0,
            "urlsHarvested": 0,
            "crawlDelay": -1,
            "currentTime": 1649950325,
            "useProxies": 0,
            "sentJobDoneNotification": 0,
            "currentTimeUTC": 1649950325,
            "name": "bulkTest",
            "notifyEmail": "",
            "pageCrawlSuccesses": 0,
            "pageProcessAttempts": 0
        }
    ]
}

Please note that if you get the "Too Many Collections" error, you hit our 1000-crawls limit.

Language
Credentials
Query
Click Try It! to start a request and see the response here!