Improvements to background image detection and extraction across all Automatic APIs. This resolves many issues with proper extraction from sites that use background CSS properties for image delivery.
Improvements to specification extraction in the Product API.
Improvements to HTML <figure> parsing in the Article API.
The diffbotUri value for Custom APIs is now calculated from the entirety of custom content (and can be used to detect changes between extractions).
Various improvements to caption detection and parsing in the Article API.
Crawlbot now adheres to the "Diffbot" user-agent in robots.txt directives, so that our crawling can be whitelisted when crawling partner or other sites.
Numerous improvements to normalizedSpecs in the Product API.
Diffbot Automatic APIs now process PDFs. PDF URLs will be converted to HTML and then analyzed for extractable content. PDFs are not currently supported while crawling.
Crawlbot fixes to reduce DNS errors when starting new crawls or crawl rounds.
Crawlbot and Bulk Processing: deletion of a nonexistent job will no longer return a "success" message.
Improved handling of UTF-8 encoded characters within Crawlbot.
Fixed an issue where large Crawlbot and Bulk job downloads would prematurely terminate.
Added beta support for executing custom Javascript before processing a page via an extraction API. See Analyze API example (works with all Automatic and Custom APIs).