apiUrlparameter (yes, we're adding a parameter to a parameter) when creating a Crawl
Adding the argument
&linksuses Diffbot core API link-extracting functionality to return all links found on a page. Crawl will use these additional links, found within the rendered page, to augment those found in the raw source.
Include your seed page (and any other JS-requiring pages) in your processing pattern(s) or regular expression.
Make sure you broaden your processing patterns or processing regular expression, or remove them entirely.
A note on deduplication
Additional note for recurring crawls: Do not “Only Process New Pages”
If “Only Process New Pages” is set to “on,” only brand new URLs will be processed in subsequent crawl rounds, except for Seed URLs which are always Processed when
&links is present. But in order to find Ajax-generated links per the above solution, pages will have to be re-processed each crawl round in order to discover new links.
Therefore, if you are crawling an Ajax-heavy site regularly using the above method (e.g., for new products or new articles), and your "Max Hops" setting is greater than 1, please make sure you process all pages each round in order to find new URLs.
Updated 4 months ago