Diffbot Crawl offers many ways to manually narrow or refine the pages crawled or processed by Diffbot Extract APIs.
Patterns allow you to quickly and easily restrict pages crawled or processed based on simple URL string matches.
For example, if a web site organizes its pages under categories — e.g., http://www.example.com/sports/heres-a-sports-article.html — I can instruct Crawl to only crawl pages within the "sports" category by specifying a crawl pattern of
/sports/. (Including the slashes is even more precise and makes sure not to match a "sports" string elsewhere in the URL.)
I can also use a crawl pattern if I want to limit crawling to a particular subdomain. For instance, on a crawl starting at https://docs.diffbot.com, I can enter a crawl pattern of
docs.diffbot.com to keep Crawl from following links to http://www.diffbot.com and http://blog.diffbot.com.
You can enter multiple patterns to match multiple strings. For instance, to crawl both https://docs.diffbot.com and http://blog.diffbot.com (but not http://www.diffbot.com), I would enter a crawl pattern of:
In the Crawl interface, place each individual pattern on a new line. Via the API, separate patterns with a
You can use the caret character (
^) to limit pattern matches only to the beginning of a URL. For instance, a processing pattern of:
...will limit processing only to pages whose URLs begin with https://docs.diffbot.com. This will prevent processing of URLs like http://www.twitter.com/share?tweet=https://docs.diffbot.com.
Use the exclamation-point to specify a "negative match" if you want to explicitly exclude pages from being crawled or processed. For instance, to process all pages except those containing "sports" in the URL, I would enter a crawl pattern of
When entering multiple patterns, negative matches will override other crawl patterns. That is, a URL with a negative match will be fully ignored, even if another (positive) crawl pattern is also a match.
If you want complete control over your crawling or processing URL matches, you can write a regular expression to only crawl or process URLs that contain a match to your expression.
For example, to only process pages at https://docs.diffbot.com/ under the "/crawl" path and containing "regex", you could enter a processing regex of:
Crawlbot does not use a specific implementation, but rather a custom regular expression engine to ensure the best possible performance while evaluating pages.
In terms of character class syntax — the most common regex concept/sequence used in Crawlbot parsing — Crawlbot supports all ASCII processing characters in the following table, and most Perl/Tcl shortcuts:
|\w||[A-Za-z0-9_]||Alphanumeric characters plus “_”|
|[\t]||Space and tab|
|[\x20-\x7E]||Visible characters and the space character|
|\S||[^ \t\r\n\v\f]||Non-whitespace characters|
Note that crawling and processing regular expressions cannot be used simultaneously with crawling/processing patterns. If both are provided, the crawling/processing patterns will be ignored (regex will take precedence).
Crawl offers one more option for limiting pages processed. If you enter an HTML Processing Pattern, only pages whose HTML source contains the exact string will be processed.
While there are many factors that will influence how long it takes to crawl a site, one of the best ways to speed up your crawl is to use crawling and processing patterns or regular expressions to limit Crawl just to the pages you are interested in.
Updated over 1 year ago