Crawlbot will often encounter duplicate pages (with different URLs) while canvassing a site. There are a handful of ways Diffbot helps you handle these duplicates:
Pages with duplicate HTML sources will be ignored while crawling.
While crawling (spidering for links), and before sending a URL to be processed, Crawlbot examines the raw HTML source of each page and compares it to the source HTML of all previously-spidered pages. Any exact matches to previously-seen pages will be flagged as duplicates and ignored.
The Crawlbot URL Report — available from each crawl’s status page, or via the Crawlbot API — will note each duplicate URL, and the document ID (docId) of the page it duplicates.
Pages with a different canonical link definition will be ignored
Note: This behavior can be disabled on an individual crawl basis via the
useCanonicalargument in the Crawlbot API.
Two things will happen when a page contains a canonical link element different from its own URL:
- The current page will be skipped/ignored as a duplicate. - The canonical URL will be automatically added to the Crawlbot queue (if not already in the queue)
Similar to above, duplicate pages will be so identified in the Crawlbot URL Report.
Duplicated extractions will have the same
Each Diffbot JSON object contains the
diffbotUrifield. The value is uniquely calculated from a subset of extracted fields and can be used to uniquely identify the extracted content. The
diffbotUriwill be the same across duplicate extractions.
For URLs that are not exact-source duplicates (and are thus not ignored while crawling), but that result in the same extracted output, the
diffbotUrivalues will be the same. When you process your crawl data, filtering and removing objects with the same
diffbotUriwill allow you to retain only one example of each entity.