Normalized HTML Fields for Article API

Diffbot's html field returns normalized HTML maintaining the structure and layout of the source article, while standardizing its element and attributes for reliable parsing and processing.

Content will be normalized into the following elements and attributes:

ElementAttributesDescription
*data-*As of January 2017 normalized HTML will retain and return data-* attributes.
Block elements
p--Unless returned within a more specific element below, all text will be returned within p elements at the top-level of the HTML response.
h1 - h5--Headers will be maintained if originally provided.
aside--Returned at top-level of HTML response.
blockquote--Returned at top-level of HTML response.
code, pre--Returned at top-level of HTML response.
ul, olstartReturned at top-level of HTML response.
li--
table--Original content within table elements will be largely retained, including images and other media items.
tbody--
thvalign, colspan, rowspan
tr--
tdvalign, colspan, rowspan
dl--Returned at top-level of HTML response.
dt--
dd--
Inline elements
br--Single linebreaks entities will be maintained in markup and returned as <br>. Double-linebreaks will be removed and surrounding content will be returned within p block elements.
b, strong--Inline emphasis tags will be retained inside of other elements.
i, em--
u--
sup--
sub--
ahrefAnchor tags and their href values will be retained.
Media
figure--Media elements will be returned at the top-level of the HTML content and contained within figure tags.
imgsrc, alt, srcset, sizesImage layout specifics (floats, etc.) and CSS-specified widths/heights will be discarded.
pictureThe img elements underneath this will be returned and normalized as described above.
video/audiosrcThe child source elements within video and audio elements will be retained along with the type attribute, if provided.
sourcesrc, type, srcset, sizes
figcaption--If present, media captions will be returned as figcaption elements within the figure container.
iframesrc, frameborder, width, height
embed, objectsrc, type
svgIf present, svg elements will be returned as is. That is, every element under it will be returned with all original attributes (as the rendering of charts is very sensitive to the attributes)
tweetsIf a div or similar element is identified to be a tweet, the content of that node and all elements in its subtree will be returned as it is, with all attributes preserved.

Example HTML Response

<p>Diffbot's human wranglers are proud today to announce the release of our newest product: an API for... products!</p>

<p>The <a href="https://www.diffbot.com/data/product/">Product API</a> can be used for extracting clean, structured data from any e-commerce product page. It automatically makes available all the product data you'd expect: price, discount/savings amount, shipping cost, product description, any relevant product images, SKU and/or other product IDs.</p>

<p>Even cooler: pair the Product API with <a href="https://www.diffbot.com/products/crawl/">Crawlbot</a>, our intelligent site-spidering tool, and let Diffbot determine which pages are products, then automatically structure the entire catalog. Here's a quick demonstration of Crawlbot at work:</p>

<figure>
  <iframe frameborder="0" src="http://www.youtube.com/embed/lfcri5ungRo?feature=oembed"></iframe>
</figure>

<p>We've developed the Product API over the course of two years, building upon our core vision technology that's extracted structured data from billions of web pages, and training our machine learning systems using data from tens of thousands of unique shopping sites. We can't wait for you to try it out.</p>

<p>What are you waiting for? Check out the <a href="https://www.diffbot.com/data/product/">Product API</a> and dive on in! <a href="http://app.diffbot.com/get-started">Get a trial token here.</a></p>

<p>Questions? Hit us up at <a href="mailto:[email protected]">[email protected]</a>.</p>