How can I crawl (news) sites and monitor/extract only recent content?

There are a few ways to use Crawlbot to extract only the most recent content from a site, and/or to do so on a regular basis. Here is a description of some options:

Best Option: Crawl the Entire Site

Proper date identification is difficult — that’s why the normalized date field is one of the key components of our automatic Article API. Ultimately the most thorough approach to extracting the most recent content from a site is to completely crawl and process the entire site, so that you have a complete catalog of the content, including date stamps.

This can be done using either the Analyze API (which will automatically identify and process any article pages), or the Article API directly along with processing patterns or regular expressions to limit the URLs processed. This will provide you with the entire site’s articles in an easily searchable or parseable form: you can then either filter by date while searching, or simply restrict your data processing to those articles with a date in the range you’re seeking.

For a recurring crawl, you’ll want to also make sure that “Only Process New Pages” is set to “on.” This will ensure that on repeated crawl rounds, only newly-appearing URLs will be processed. Thus, ongoing, only the new articles will appear within each crawl round.

Narrow the Crawl Job Using Crawl and Processing Patterns

Depending on the site you are crawling, you may be able to use information in the URL to your advantage. For instance, if a site’s article URLs look something like:

http://news.diffbot.com/2014/10/01/diffbot-releases-discussion-api

…you can enter a processing regular expression or processing pattern (e.g., news.diffbot.com/2014/) to limit pages processed to those with “2014” in the URL.

You can also restrict processing based on page markup using an HTML Processing Pattern. This will limit pages processed to those whose markup contains the exact string. For instance, if articles on a site contain certain metadata or other date-specific markup, you can limit processing to a string like:

itemprop="datePublished" content="2014

or even:

2014</div>

(presuming the year, 2014, is always at the end of a specific element)

Note: if you do this you will need to update your pattern(s) or regular expressions to ensure that content in subsequent years (2015, 2016, etc.) is included.

Do Shallow Crawl Rounds (from a Few Seeds)

To minimize pages processed and/or time spent crawling, you can limit the number of pages processed in a crawl using the “Max Pages to Process” setting (and the pages crawled using “Max Pages to Crawl”). Typically this will prioritize most of the new content from a site. If you include multiple seed URLs — for example, the main pages of a news site’s subsections — this will be even more focused on the new content from various areas of a larger site.

If you are only interested in getting new content on a regular basis, one Crawl trick is to do the following:

  1. Initially, set your crawl up to process a medium number of pages using the “Maximum Pages to Process” value, e.g. 1000.
  2. Set your crawl to repeat daily, and to “only process new pages.”
  3. After the initial round completes, change your “Maximum Pages to Process” to a smaller number (depending on the amount of content you expect on a daily basis from the site). For a site that publishes ~20 articles a day, set it to 25. For a site that publishes 200, set it to 250.
  4. Then Crawl will look for only new pages — up to your smaller number — each day thereafter, and you can use the DQL Search API or date filters to ensure that the content you use is only the most recent by date.

Use the maxHops Parameter to Limit the Depth of Your Crawl

The maxHops parameter controls the depth of your crawl. For instance, a value of 1 will extract all outlinks from your seed URL(s), but will prevent Crawl from extracting any other links beyond these original outlinks. A value of 2 will extract the outlinks from the Seed URL, spider the links from those outlinks, and extract those links as well. A value of 0 will only extract the seed URL(s).

On a news site with numerous category or section pages you can make the category/section pages your seed URLs, and set maxHops to 1 to ensure that only links directly on these seeds will be analyzed/processed. By repeating your crawl regularly (and setting it to “Only Process New Pages”), you can ensure that only the latest content linked to from these section pages will be extracted.