1. Use the Analyze API
Accessing a site’s Ajax-delivered links requires the use of Diffbot’s Analyze API. The Analyze API automatically identifies a page’s type, and processes those pages supported by extraction APIs.
2. Add &links as a Diffbot Querystring Argument
Adding the argument
&links uses Diffbot core API link-extracting functionality to return all links found on a page. Crawlbot will use these additional links, found within the rendered page, to augment those found in the raw source.
If you are using the Crawlbot API, simply append
&links to your
3. Include your seed page (and any other JS-requiring pages) in your processing pattern(s) or regular expression.
Make sure you broaden your processing patterns or processing regular expression, or remove them entirely.
A note on “deduplication”
Additional note for recurring crawls: Do not “Only Process New Pages”
If “Only Process New Pages” is set to “on,” only brand new URLs will be processed in subsequent crawl rounds. But in order to find Ajax-generated links per the above solution, pages will have to be re-processed each crawl round in order to discover new links.
If you are crawling an Ajax-heavy site regularly using the above method (e.g., for new products or new articles), please make sure you process all pages each round in order to find new URLs.