Getting Started with Custom API

Our extraction APIs usually provide a clean extraction of a page, but in some cases you may encounter issues with specific fields, such as:

  • a field is missing from the default API result (because our AI could not locate it on the page)
  • a field contains the incorrect data

In some cases you may also want a custom field to be returned, containing data from the page which you define.

All of these cases can be handled with the Custom API tool (https://app.diffbot.com/custom/), which allows you to set custom Selectors (https://docs.diffbot.com/reference/custom-api-selectors) to define the data that will be extracted into each field.

A simple example

Suppose you want to extract the page https://wiki.polkadot.network/docs/en/learn-staking, which has a clearly defined author at the bottom of the article. However, our API does not extract it in the API Call.

14191419

First you would browse to the Custom API section of the Dashboard here), select the API we want to process the page with, and enter the problematic URL. Then click "Create".

636636

You will then see a list of fields that are returned by the API for this page. Click "Edit" next to the Author field.

11781178

You now have a clickable page preview and a form. You can either manually enter a Selector, or point-and-click to choose the correct element. A preview of the output will be displayed at the top of the screen.

10071007

We click the author in the preview window and a selector is filled in .theme-last-updated b (see Selectors). However, we can see from the preview result that we are not only picking up the author but also the date, which we do not want:

10081008

To correct this, we click Filters, then in the drop-down box select replace for the filter type. This filter allows us to do a regex replace. In the Value field we enter \d+/\d+/\d+\s+(.*) and then in the Replace with field we enter $1 to leave us with only the .* that contains the Author name.

974974

Click Save to save and apply your rule.

Once saved, your rule will take immediate effect for Article API calls targeting pages on the wiki.polkadot.network domain (by default; this is a regular expression that can be modified).

Now we can see that the API Call gives the intended result.

13961396