2017-04-21

by Jerome Choo
  • The beta category field has been added to the Product API. See documentation.
  • All extraction APIs now support the sending of completely custom headers using X-Forward- terminology. Previously only four defined headers were supported.

2017-04-10

by Jerome Choo
  • In the Article and Discussion APIs' tags element, DBPedia uri values are now properly URL-encoded.
  • Fixed an issue when sorting by date in the Search API.
  • Various improvements and fixes to the Global Index

2017-01-12

by Jerome Choo
  • The Account API now tracks Global Index search calls/requests.
  • Improved SKU detection and extraction in the Product API.
  • Article API: Added support for the start attribute (ol elements) and data- attributes in normalized HTML.
  • In the Article API, identified image captions will no longer be returned in the text field content.
  • Various improvements to replacement rule regular expressions in Custom APIs.
  • PDF processing improvements.

2016-12-09

by Jerome Choo
  • Product API: overriding the sku, mpn or related fields using custom rules will now affect the productId field as well.
  • Crawls using the Analyze API will now correctly index video pages.
  • Improved the reliability of the fields=links argument in all Automatic APIs.

2016-12-01

by Jerome Choo

Updates to our rendering engine to properly support more Unicode scripts

2016-11-30

by Jerome Choo
  • Updates to our status page for improved coverage and reliability.
  • Crawlbot crawls can now have repeat settings adjusted or added after a crawl completes.
  • Fixed a Crawlbot issue wherein users could completely erase the seeds field.

2016-11-13

by Jerome Choo
  • POSTing to our APIs is speedier, particularly when content includes slow-loading third-party assets.
  • Crawlbot now has limited support for crawling/processing content across multiple domains.

2016-11-06

by Jerome Choo
  • Improvements to background image detection and extraction across all Automatic APIs. This resolves many issues with proper extraction from sites that use background CSS properties for image delivery.
  • Improvements to specification extraction in the Product API.
  • Improvements to HTML <figure> parsing in the Article API.

2016-10-24

by Jerome Choo
  • The diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
  • Various improvements to caption detection and parsing in the Article API.
  • Crawlbot now adheres to the "Diffbot" user-agent in robots.txt directives, so that our crawling can be whitelisted when crawling partner or other sites.

2016-10-04

by Jerome Choo
  • Increased the size limits for content POSTed to Diffbot APIs.
  • Bulk Service jobs now require a minimum of 50 URLs for Startup plan customers.
  • Bulk Service and Crawlbot jobs now automatically retry failing URLs.