Can Extract APIs Extract Content from PDFs or Other Documents?

Yes, but only in direct Extract API calls.

As of September 2016, Diffbot’s Automatic Extract APIs are able to structure content from PDF files.

This is only available in direct API calls - it is not currently possible to process PDFs while using Crawl. (PDF URLs will be successfully processed in Bulk Extract jobs.)

Quality of PDF extraction varies and depends significantly on the underlying structure of the document itself.

Updated 4 months ago

Did this page help you?