Our extraction APIs usually provide a clean extraction of a page, but in some cases you may encounter issues with specific fields, such as:
- a field is missing from the default API result (because our AI could not locate it on the page)
- a field contains the incorrect data
In some cases you may also want a custom field to be returned, containing data from the page which you define.
All of these cases can be handled with the Custom API tool (https://app.diffbot.com/custom/), which allows you to set custom Selectors (https://docs.diffbot.com/reference/custom-api-selectors) to define the data that will be extracted into each field.
Suppose you want to extract the page https://wiki.polkadot.network/docs/en/learn-staking, which has a clearly defined author at the bottom of the article. However, our API does not extract it in the API Call.
First you would browse to the Custom API section of the Dashboard here), select the API we want to process the page with, and enter the problematic URL. Then click "Create".
You will then see a list of fields that are returned by the API for this page. Click "Edit" next to the Author field.
You now have a clickable page preview and a form. You can either manually enter a Selector, or point-and-click to choose the correct element. A preview of the output will be displayed at the top of the screen.
We click the author in the preview window and a selector is filled in
.theme-last-updated b (see Selectors). However, we can see from the preview result that we are not only picking up the author but also the date, which we do not want:
To correct this, we click Filters, then in the drop-down box select replace for the filter type. This filter allows us to do a regex replace. In the Value field we enter
\d+/\d+/\d+\s+(.*) and then in the Replace with field we enter
$1 to leave us with only the
.* that contains the Author name.
Click Save to save and apply your rule.
Once saved, your rule will take immediate effect for Article API calls targeting pages on the wiki.polkadot.network domain (by default; this is a regular expression that can be modified).
Now we can see that the API Call gives the intended result.
Updated 5 months ago