Does Crawl follow “hashtag” links / internal links / fragment identifiers?
No. Diffbot's crawler, like all crawlers, does not pursue internal links.
Internal links — also known as hashtag links, intra-page links, bookmark links, or (officially) as links containing fragment identifiers — indicate a subordinate resource or section of a primary resource. In most cases, this means a discrete location on a web page. Were crawlers to follow these they would visit individual pages many more times that is necessary (or, in the case of most Wikipedia pages, dozens or hundreds of times).
Increasingly sites are using the #
convention to load unique resources via JavaScript. While Diffbot Extract APIs do execute JavaScript, for the purposes of crawling, these individual resources do not represent valid uses of the fragment identifier syntax. Thus, only the base/primary resource — the part of the URL preceding the #
-sign, will be spidered.
Updated over 2 years ago