Crawl and Processing Patterns and Regexes

Diffbot Crawl offers many ways to manually narrow or refine the pages crawled or processed by Diffbot Extract APIs.

Patterns ("Crawl" and "Processing")

Patterns allow you to quickly and easily restrict pages crawled or processed based on simple URL string matches.

For example, if a web site organizes its pages under categories — e.g., http://www.example.com/sports/heres-a-sports-article.html — I can instruct Crawl to only crawl pages within the "sports" category by specifying a crawl pattern of /sports/. (Including the slashes is even more precise and makes sure not to match a "sports" string elsewhere in the URL.)

I can also use a crawl pattern if I want to limit crawling to a particular subdomain. For instance, on a crawl starting at https://docs.diffbot.com, I can enter a crawl pattern of docs.diffbot.com to keep Crawl from following links to http://www.diffbot.com and http://blog.diffbot.com.

You can enter multiple patterns to match multiple strings. For instance, to crawl both https://docs.diffbot.com and http://blog.diffbot.com (but not http://www.diffbot.com), I would enter a crawl pattern of:

docs.diffbot.com
blog.diffbot.com

In the Crawl interface, place each individual pattern on a new line. Via the API, separate patterns with a ||.

Limiting Matches to the Beginning of URLs

You can use the caret character (^) to limit pattern matches only to the beginning of a URL. For instance, a processing pattern of:

^https://docs.diffbot.com

...will limit processing only to pages whose URLs begin with https://docs.diffbot.com. This will prevent processing of URLs like http://www.twitter.com/share?tweet=https://docs.diffbot.com.

Negative-Match Patterns

Use the exclamation-point to specify a "negative match" if you want to explicitly exclude pages from being crawled or processed. For instance, to process all pages except those containing "sports" in the URL, I would enter a crawl pattern of !sports.

When entering multiple patterns, negative matches will override other crawl patterns. That is, a URL with a negative match will be fully ignored, even if another (positive) crawl pattern is also a match.

Note, this also works for the HTML Processing Patterns

Regular Expressions (Crawl and Processing Regexes)

If you want complete control over your crawling or processing URL matches, you can write a regular expression to only crawl or process URLs that contain a match to your expression.

For example, to only process pages at https://docs.diffbot.com/ under the "/crawl" path and containing "regex", you could enter a processing regex of:

\/crawl.*?regex

Crawlbot does not use a specific implementation, but rather a custom regular expression engine to ensure the best possible performance while evaluating pages.

In terms of character class syntax — the most common regex concept/sequence used in Crawlbot parsing — Crawlbot supports all ASCII processing characters in the following table, and most Perl/Tcl shortcuts:

Perl/Tcl	ASCII	Description
	[A-Za-z0-9]	Alphanumeric characters
\w	[A-Za-z0-9_]	Alphanumeric characters plus “_”
\W	[^A-Za-z0-9_]	Non-word characters
	[A-Za-z]	Alphabetic characters
	[\t]	Space and tab
\b	(?<=\W)(?=\w)\|(?<=\w)(?=\W)	Word boundaries
	[\x00-\x1F\x7F]	Control characters
\d	[0-9]	Digits
\D	[^0-9]	Non-digits
	[\x21-\x7E]	Visible characters
	[a-z]	Lowercase letters
	[\x20-\x7E]	Visible characters and the space character
	[][!"#$%&'()*+,./:;<=>?@\^_`{\|}~-]	Punctuation characters
\s	[\t\r\n\v\f]	Whitespace characters
\S	[^ \t\r\n\v\f]	Non-whitespace characters
	[A-Z]	Uppercase letters
	[A-Fa-f0-9]	Hexadecimal digits

Note that crawling and processing regular expressions cannot be used simultaneously with crawling/processing patterns. If both are provided, the crawling/processing patterns will be ignored (regex will take precedence).

HTML Processing Patterns

Crawl offers one more option for limiting pages processed. If you enter an HTML Processing Pattern, only pages whose HTML source contains the exact string will be processed.

Note that Crawl only examines the raw source, and does not execute Javascript/Ajax at crawl-time.

Implications for Crawl Performance

While there are many factors that will influence how long it takes to crawl a site, one of the best ways to speed up your crawl is to use crawling and processing patterns or regular expressions to limit Crawl just to the pages you are interested in.