The Difference Between Crawling and Extraction

Crawl serves as both a web spider — pursuing and collecting links — and also as controller of Diffbot’s Extract APIs. These functions are contained within the crawling and extraction sections of the Crawl service.

Page Crawling

A page is said to be crawled when it is evaluated for additional links to follow, or for links to be processed. A crawled page is not automatically processed by a Diffbot Extract API unless the page also matches the processing criteria. (If there are no crawling or processing restrictions, then all pages will be both crawled and processed.)

Crawled pages can be limited based on the following:

Crawl Patterns: only pages whose URLs match one of the pattern strings will be spidered for additional links.
Crawl Regular Expression: only pages whose URLs match the regular expression will be crawled for additional links.

The above options can be combined; however, Crawl Regular Expression take precedence over Crawl Patterns in the event of conflicting rules.

Page Extraction (Processing)

A web page is said to be extracted if it is submitted to an Extract API. This can either be an Automatic Extract API (Article, Product, Image, etc.) or a Custom API.

Crawl determines which URLs to hand off for extraction based on the following parameters:

By default, if there is no extraction restriction, every spidered/crawled page will also be extracted by your specified extraction API.
If you enter a Crawling Pattern (or Regular Expression) and no Processing Pattern (or Processing Regular Expression), your Crawling Pattern (RegEx) will act for both crawling and extraction.
If you enter one or more Processing Patterns, only pages whose URLs match (at least one of) your patterns will be processed.
If you enter a Processing Regular Expression, only pages whose URLs match the regular expression will be processed.
If you enter one or more HTML Processing Patterns, only pages whose markup contains (one of) your patterns will be processed.

HTML Processing Patterns can be combined with Processing Patterns or Regular Expressions; however, a Processing Regular Expression will override a Processing Pattern.

A page does not have to be crawled to be extraced, but its URL must at least appear on a crawled page, and it must match a processing pattern, in order to be discovered and handed off to an Extract API.

Crawl will restrict its crawl to the same subdomain(s) as your seed URL(s), unless your seed lacks a subdomain (e.g. diffbot.com) or uses the ‘www’ subdomain, in which case the entire domain will be spidered. See Restricting Crawls to Domains and Subdomains for more information.