Introduction to Extract API

Extract uses computer vision and natural language processing to automatically categorize and extract their contents into clean, structured JSON.

The following is a list of all Extract APIs available at your disposal with any valid Diffbot token.

Automatic APIs

  • Analyze API. If you aren't sure what type of content is at your URL, start out with the Analyze API. The Analyze API uses machine learning in order to automatically classify your URL and route it to the appropriate type of extraction based on the API.

Page Type APIs

If you know what type of content your URL is, or what to force extraction as a specfic type of content, use one of the page-type specfic APIs below.

  • Article API allows you to extract information about news articles, blog posts, and other written content. Diffbot can recognize authors and their profile images and links, dates and locations of publication, sentiment, tags based on content, images in the article, comments, language the content is written in, and more.
  • Product API allows you to extract information about products, including specifications, colors, availability, price, discount offers, shipping, description, reviews, and more.
  • Image API allows you to extract detailed information about images, from dimensions and download URLs to what's on the image through image recognition.
  • Video API same as above, for videos.
  • Discussion API is used for extracting threads of content. This can be a review section of a product (indeed, Product API uses the Discussion API internally when extracting comments to include them in the output), a forum or Reddit thread, or a comment section in a blog.
  • Event API (BETA) is used for extracting online and in-person event details for standalone events that occur within a single day. Support for multi-day, multi-track events, i.e. full conferences and festivals, is planned but not yet scheduled.
  • List API (BETA) is used for extracting data from any single listings page, such as news index pages, product listings pages, and search engine results pages.
  • Job API (BETA) is used for extracting data from a single job listing page

Custom API

  • Custom API can be used to either correct & augment automatically extracted output or create entirely new custom extractions by defining rules. We have a point-and-click interface that allows you to easily build CSS-based selectors, regular expressions, and attribute filters, or you can also use the Custom API programmatically via its API.