How to find and access JavaScript-generated links while crawling

Diffbot Extract and Custom APIs automatically execute Javascript, but Crawl does not execute Javascript by default while spidering a site for links. Doing so is slow and usually redundant, as most sites’ links are typically available in the raw HTML source.

Some sites deliver the majority of their content via JavaScript. If you need to access Javascript-rendered pages to discover a site’s links, you can do so via the following:

  1. Add &links to the Querystring parameter in your Crawl job, or the your apiUrl parameter if creating the job via API (yes, we're adding a parameter to a parameter), when creating a Crawl

    Adding the argument &links enables Javascript to run when the crawler is looking for links on a page, so that the links will have time to load via Javascript before the crawler tries to extract them. Note that any page Crawled (downloaded to search for links) with &links is also Processed (by a Javascript-enabled browser), incurring a credit. A Seed URL is always Processed when &links is applied

  2. Make sure that any URLs requiring JS are not excluded from your Processing Pattern(s) or Processing Regular Expression.

    In order to find Ajax-generated links, your seed URL(s) (and, commonly, other “listing” pages) will need to be processed. Your Seed URLs are Processed automatically when &links is used, but if you are using Processing Patterns you will need to make sure that they allow for any other pages you wish to extract links from (assuming they also require JS to generate those links) to also be Processed.

A note on deduplication

When using &links in a crawl, Crawl's default duplicate page detection will be disabled. This is because Ajax-powered sites can have identical HTML source code for multiple pages, even though the actual on-page content (when JavaScript is fully executed) is quite different.

Additional note for recurring crawls: Do not “Only Process New Pages”

If “Only Process New Pages” is set to “on,” only brand new URLs will be processed in subsequent crawl rounds, except for Seed URLs when &links is present. However, if you expect to extract Javascript/AJAX-enabled links each round from pages besides the Seed URL, this setting must be disabled to allow those pages to be re-processed in each round.