How does Diffbot handle duplicate pages/content while crawling?
Crawl will often encounter duplicate pages (with different URLs) while canvassing a site. There are a handful of ways Diffbot helps you handle these duplicates:
Pages with duplicate HTML sources will be ignored while crawling
While crawling (spidering for links), and before sending a URL to be processed, Crawl examines the raw HTML source of each page and compares it to the source HTML of all previously-spidered pages. Any exact matches to previously-seen pages will be flagged as duplicates and ignored.
Duplicate URLs are noted in the URL Report
The URL Report — available from each crawl’s status page, or via the Crawl API — will note each duplicate URL, and the document ID (docId) of the page it duplicates.
Pages with a different canonical link definition will be ignored
Two things will happen when a page contains a canonical link element different from its own URL:
- The current page will be skipped/ignored as a duplicate.
- The canonical URL will be automatically added to the Crawl queue (if not already in the queue)
Updated about 1 year ago