How does Diffbot handle duplicate pages/content while crawling?
Crawl will often encounter duplicate pages (with different URLs) while canvassing a site. There are a handful of ways Diffbot helps you handle these duplicates:
Pages with duplicate HTML sources will be ignored while crawling
While crawling (spidering for links), and before sending a URL to be processed, Crawl examines the raw HTML source of each page and compares it to the source HTML of all previously-spidered pages. Any exact matches to previously-seen pages will be flagged as duplicates and ignored.
The duplicate comparison is made on the raw HTML source only. Only when processing a page will Javascript be executed.
Duplicate URLs are noted in the URL Report
The URL Report — available from each crawl’s status page, or via the Crawl API — will note each duplicate URL, and the document ID (docId) of the page it duplicates.
Note: If your crawl takes advantage of Analyze API’s ability to execute Javascript to find Ajax-delivered links, Crawl's duplication detection will be disabled. This is because Ajax-powered sites can have identical HTML source code for multiple pages, even though the actual on-page content (when Javascript is fully executed) is quite different.
Pages with a different canonical link definition will be ignored
Two things will happen when a page contains a canonical link element different from its own URL:
- The current page will be skipped/ignored as a duplicate.
- The canonical URL will be automatically added to the Crawl queue (if not already in the queue)
Updated over 2 years ago