Diffbot Extract is a popular solution for replacing high volume web scraping pipelines, as rule-based web scraping tend to become costly and frustrating to maintain at scale.
Instead of a set of rules, Diffbot Extract uses computer vision to "read" a web page, categorize it into a standard page type, and extract its contents based on a standard schema.
If your use case involves scraping potentially thousands of pages across several different sites, you could define rules for each individual page, or you just use Diffbot Extract. You can test drive Diffbot Extract for your use case (no sign up required) on diffbot.com/testdrive.
Instead of site-specific rules, Diffbot Extract relies on a standard ontology that describes most page types on the web. It can classify any page on the web into one of these standard page types, and then "read" the page using pre-trained ML models to look for standard fields like
offerPrice for product pages and
author for article pages.
Some Extract APIs, like List API, may have a few standard fields, but is designed to be as adaptable as possible to any kind of list on any website.
Others, like Product API, feature more opinionated ontologies that make it easy to integrate with an existing product database.
A full list of Extract APIs is available here.
For the less technical, you might find already pre-crawled and extracted data in the Diffbot Knowledge Graph more accessible.
If none of the above methods apply to you, consider rule-based web scraping solutions. These are often a bit simpler to understand and implement. Here're a few options (no affiliation):
Updated about 1 year ago