Tutorial: How to use Prefilters to Ignore Website Elements

Ads, popups, or other modals getting in your way of a clean extraction? Prefilters to the rescue! (⏲️ 10 Minutes)

Prefilters allow you to make Diffbot Extract API completely ignore certain elements on some websites. This is very useful for ads, popups, and other pesky UI elements that might confuse Diffbot while extracting information.

There are two ways to add prefilters to your rules: via the dashboard, or programmatically via API.

Add Prefilters on the Dashboard

Let's assume we want to process this URL as an Article. The result we get from Diffbot is:

1133

The text is missing. Let's look at the website and see what might be going on.

1608

The website has a rather invasive overlay and popup. This is what's probably interfering with Diffbot's extraction. Prefilters to the rescue!

Finding the selectors

Prefilters block page elements by targeting their CSS selectors. We can find the selectors by using a browser's dev tools and right clicking on the element we want to remove, then selecting "Inspect element". The dev tools will highlight that element in the HTML of the page, but that's usually not enough - often we need to target the parent of that element, a level or two above. For best results, keep deleting elements in the DOM until the undesired elements disappear. In this case, the selectors were .modalOverlay and .modal.

Blocking the elements

To block the elements in the UI, we'll head on into the Dashboard. Once in, access the Custom API menu option in the left sidebar. Create a new Custom Article API (select Article in the api menu), enter the URL above, and hit 'Create'.

859

Once loaded, open the "Other Settings" tab and scroll down to Prefilters. In that text area, enter .modal, .modalOverlay. Scroll to the bottom and press save.

1010

Once the preview reloads, you should already see results in the text and html fields. And indeed, if you use the top-right option to make an "API Call", you will notice the article is now extracted properly.

1615

Add Prefilters Programmatically via API

Prefilters can also be added via API. Similarly to the process on the Dashboard, we will create a Custom Article API that extends the default Article API.

To do this, we will be using the endpoint that Updates a Custom API.

We need to first construct a ruleset to update our Custom API with. This is what it will look like:

{
    prefilters: [
        ".modal",
        ".modelOverlay"
    ],
    api: "/api/article",
    urlPattern: "(http(s)?://)?(.*\.)?www.thirdsector.co.uk.*",
    testUrl: "https://www.thirdsector.co.uk/rspca-union-unable-agree-date-talks-avert-possible-industrial-action/management/article/1669860"
}

Finally, we will send this ruleset in the body of a POST request to https://api.diffbot.com/v3/custom?token=YOURTOKEN. Here's what it looks like as a cURL:

curl --location --request POST 'https://api.diffbot.com/v3/custom?token=YOURTOKEN' \
--header 'Content-Type: application/json' \
--data-raw '{
    prefilters: [
        ".modal",
        ".modelOverlay"
    ],
    api: "/api/article",
    urlPattern: "(http(s)?://)?(.*\.)?www.thirdsector.co.uk.*",
    testUrl: "https://www.thirdsector.co.uk/rspca-union-unable-agree-date-talks-avert-possible-industrial-action/management/article/1669860"
}'

The response to this request will contain a hash that can be used to update this rule directly with the same endpoint.

Keep in mind that if an existing Custom API has already been created with the same urlPattern and api values, this request will completely override your existing Custom API. Otherwise, it will simply create a new Custom API.