How to find and access JavaScript-generated links while crawling
Diffbot Extract and Custom APIs automatically execute Javascript, but Crawl does not execute Javascript by default while spidering a site for links. Doing so is slow and usually redundant, as most sites’ links are typically available in the raw HTML source.
Some sites deliver the majority of their content via JavaScript. If you need to access Javascript-rendered pages to discover a site’s links, you can do so via the following:
-
Add
&links
to theQuerystring
parameter in your Crawl job, or the yourapiUrl
parameter if creating the job via API (yes, we're adding a parameter to a parameter), when creating a CrawlAdding the argument
&links
enables Javascript to run when the crawler is looking for links on a page, so that the links will have time to load via Javascript before the crawler tries to extract them. Note that any page Crawled (downloaded to search for links) with&links
is also Processed (by a Javascript-enabled browser), incurring a credit. A Seed URL is always Processed when&links
is applied -
Make sure that any URLs requiring JS are not excluded from your Processing Pattern(s) or Processing Regular Expression.
In order to find Ajax-generated links, your seed URL(s) (and, commonly, other “listing” pages) will need to be processed. Your Seed URLs are Processed automatically when
&links
is used, but if you are using Processing Patterns you will need to make sure that they allow for any other pages you wish to extract links from (assuming they also require JS to generate those links) to also be Processed.
A note on deduplication
When using &links
in a crawl, Crawl's default duplicate page detection will be disabled. This is because Ajax-powered sites can have identical HTML source code for multiple pages, even though the actual on-page content (when JavaScript is fully executed) is quite different.
Additional note for recurring crawls: Do not “Only Process New Pages”
If “Only Process New Pages” is set to “on,” only brand new URLs will be processed in subsequent crawl rounds, except for Seed URLs when &links
is present. However, if you expect to extract Javascript/AJAX-enabled links each round from pages besides the Seed URL, this setting must be disabled to allow those pages to be re-processed in each round.
Updated 28 days ago