Manage a Crawl Job

Pause, delete, restart, or view the status of a crawl job.

A single endpoint allows both control and status requests for one or more active crawl jobs with any given token.

View the Status of Crawl Jobs

Your token's active crawl jobs (along with any active bulk jobs) will be returned in a jobs object when a GET request supplying just a token parameter is made to this endpoint.

Note that this endpoint without any query parameters returns exactly the same output as its Bulk Job equivalent.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN \
     --header 'Accept: application/json'

To retrieve a single crawl job's details, provide the job's name in addition to your token in your request.

Pause a Crawl Job

To pause a crawl job, send a GET request to this endpoint supplying your token, the name of the crawl job to pause, and the pause parameter set to 1.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&pause=1 \
     --header 'Accept: application/json'

To resume a paused crawl job, pass pause=0 in the same GET request.

Delete a Crawl Job

To delete a crawl job, send a GET request to this endpoint supplying your token, the name of the crawl job to delete, and the delete parameter set to 1. Job deletions are irreversible.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&delete=1 \
     --header 'Accept: application/json'

Restart a Crawl Job

To restart a crawl job, send a GET request to this endpoint supplying your token, the name of the crawl job to restart, and the restart parameter set to 1. This will erase all previously processed data and re-process all of the submitted URLs.

curl --request GET \
     --url https://api.diffbot.com/v3/crawl?token=YOURDIFFBOTTOKEN&name=crawlTest&restart=1 \
     --header 'Accept: application/json'

Response

All requests will return a JSON response. The following is a sample response.

{
  "jobs": [
    {
      "name": "crawlJob",
      "type": "crawl",
      "jobCreationTimeUTC": 1427410692,
      "jobCompletionTimeUTC": 1427410798,
      "jobStatus": {
        "status": 9,
        "message": "Job has completed and no repeat is scheduled."
      },
      "sentJobDoneNotification": 1,
      "objectsFound": 177,
      "urlsHarvested": 2152,
      "pageCrawlAttempts": 367,
      "pageCrawlSuccesses": 365,
      "pageCrawlSuccessesThisRound": 365,
      "pageProcessAttempts": 210,
      "pageProcessSuccesses": 210,
      "pageProcessSuccessesThisRound": 210,
      "maxRounds": 0,
      "repeat": 0.0,
      "crawlDelay": 0.25,
      "obeyRobots": 1,
      "maxToCrawl": 100000,
      "maxToProcess": 100000,
      "onlyProcessIfNew": 1,
      "seeds": "http://docs.diffbot.com",
      "roundsCompleted": 0,
      "roundStartTime": 0,
      "currentTime": 1443822683,
      "currentTimeUTC": 1443822683,
      "apiUrl": "https://api.diffbot.com/v3/analyze",
      "urlCrawlPattern": "",
      "urlProcessPattern": "",
      "pageProcessPattern": "",
      "urlCrawlRegEx": "",
      "urlProcessRegEx": "",
      "maxHops": -1,
      "downloadJson": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_data.json",
      "downloadUrls": "http://api.diffbot.com/v3/crawl/download/sampletoken-crawlJob_urls.csv",
      "notifyEmail": "[email protected]",
      "notifyWebhook": "http://www.diffbot.com"
    }
  ]
}

Status Codes

The jobStatus object will return the following status codes and associated messages:

StatusMessage
0Job is initializing
1Job has reached maxRounds limit
2Job has reached maxToCrawl limit
3Job has reached maxToProcess limit
4Next round to start in _ seconds
5No URLs were added to the crawl
6Job paused
7Job in progress
8All crawling temporarily paused by root administrator for maintenance
9Job has completed and no repeat is scheduled
10Failed to crawl any seed
Indicates a problem retrieving links from the seed URL(s)
11Job automatically paused because crawl is inefficient. Successfully downloaded 10000+ consecutive pages without a single successfully processed page
Language
Click Try It! to start a request and see the response here!