Custom API Rulesets

A set of rules and parameters defining what a Custom API actually extracts.

Every instance of a Custom API is defined by a JSON ruleset object, which will include a rules objects array, the name of the custom api, and a urlPattern matching URLs to be extracted with this API.

A simple ruleset object looks like this.

{
  "rules": [
    {
      "name": "Description",
      "selector": ".entry-content p"
    }
  ],
  "api": "/api/list",
  "urlPattern": "(http(s)?://)?(.*\\.)?blog.diffbot.com.*",
  "testUrl": "https://blog.diffbot.com/knowledge-graph-glossary/"
}

In this ruleset, the List API is extended to also extract a Description field for URLs matching the urlPattern.

A complete Custom API ruleset contains (at minimum) all of the following fields.

Field

Description

urlPattern

Regular expression used to match URLs to the appropriate rule.

api

Diffbot API against which the ruleset should be applied. The api value should include the /api/ string, e.g. /api/article.

rules

An array of rules applying to individual fields of the Diffbot API. The rules array can be empty (rules=[]). More on rules.

name

Field to correct (e.g., title) or add (e.g., customField).

selector

CSS selector to find the appropriate content on the page.

value

Optional: a specific value to hard-code, in lieu of a selector.

filters

Optional: additional options to replace content, ignore selectors, or extract HTML attribute values. See below.

In addition, Custom API rulesets may also include these optional parameters.

Field

Description

testUrl

Optional: A sample URL used to preview your rule within the Custom API Toolkit in the Dashboard.

prefilters

Optional: An array of selectors that should be completely dropped from the DOM. These selectors will be fully ignored by all Diffbot processing.

renderOptions

Optional: Querystring arguments to be passed to the Diffbot rendering engine, e.g. wait=5000. More on renderOptions.

xForwardHeaders

Optional: An object containing any custom headers to be passed along in all requests to URLs matching the urlPattern. Header values can either be a single string, or an array of strings (from which one will be selected at request-time). Custom headers can include:

User-Agent

Optional: User agent to use in place of Diffbot default.

Referrer

Optional: Custom referrer to use in place of Diffbot default.

Cookie

Optional: Custom cookie content to be sent with all requests.

Accept-Language

Optional: Custom accept-language header to be sent.

X-Evaluate

Optional: Custom Javascript to be executed at render-time.


Defining a Rule

To recap — a single Custom API instance is defined by a JSON ruleset object. This ruleset object contains an array of rule objects as well as the parameters listed above.

In this section, we look at what defines a single rules object that lives within the rules field of a complete Custom API ruleset.

Here's an example of a simple rule object

{
    "selector": ".entry-content p",
    "name": "text"
}

A Custom API with this rule will

  1. Look for a DOM element corresponding to the CSS selector .entry-content p
  2. Extract the text content of that element
  3. Return it in the response of the Custom API under the field named text

📘

Custom API rules can be used to "correct" individual fields of an Extract API

To correct a field that isn't extracting automatically, define a custom rule using the same name as the incorrectly extracted field.

Experience with CSS selectors will be very helpful in defining Custom API rules. A reference of all supported selectors and operators are available here.

Should multiple elements match a selector, the text contents of all the elements will be returned string concatenated in the output value.

A rule may also extract the value of an attribute on the selected element. To do this, we can use a rule filter.

Using Rule Filters

filters may be used in a Custom API rule to get an attribute value of an element, replace content extracted, or exclude certain sections of content.

Here's an example of a rule filter that extracts the src value of all img elements.

{
  "selector": "img",
  "name": "url",
  "filters": [
    {
      "args": [
        "src"
      ],
      "type": "attribute"
    }
  ]
}

A filter object is constructed with an args and a type field.

  • type specifies a filter type to be used (attribute, exclude, or replace)
  • args is an array of arguments to be provided to the filter

A rule may contain multiple filters, hence its representation in a rule as a JSON array.

More details on the use of each available filter is shared below.

Filter Type: attribute

Retrieves the attribute value of an element specified in args.

For example, to extract the link http://blog.diffbot.com from the anchor tag <a href="http://www.blog.diffbot.com" class="outbound">, we may use the following rule:

{
  "selector": "a.outbound",
  "name": "link",
  "filters": [
    {
      "args": [
        "href"
      ],
      "type": "attribute"
    }
  ]
}

Filter Type: exclude

Ignores selectors (and all descendants) supplied in args if they are found within the CSS selector of the parent rule.

Filter Type: replace

Use regular expression syntax to extract only specific sections of text from the original extraction output. Supply your regular expression in the 1st index of your array and the regex group to extract in the 2nd.

For example, this is how you would extract just the numerical price (12.99) off a pricing element (.offerPrice) that extracts as "$12.99" by default.

{
  "selector": ".offerPrice",
  "name": "price",
  "filters": [
    {
      "args": [
        "^\$(.*)$",
        "$1"
      ],
      "type": "replace"
    }
  ]
}

Back references are also supported. For example, you can prepend text with the replace selector (^.*$) and replacement prefix: $1

Diffbot uses a Java implementation for its regular expression parsing. Regular-Expressions.info offers an excellent overview of language-specific distinctions.

Extracting Multiple Elements into a List

If a CSS selector matches multiple elements on a page, the text values of all the matched elements will be concatenated into a single output value for the field.

To structure the output into an array instead, we can nest rules within rules, we call this a collection.

This is an example of a collection and the HTML structure it will extract.

<div class="img-thumbnail">
  <img src="img-1.png" />
  <span class="img-caption">Image #1's caption.</span>
</div>
<div class="img-thumbnail">
  <img src="img-2.png" />
  <span class="img-caption">Image #1's caption.</span>
</div>
{
  "selector": "img-thumbnail",
  "name": "images",
  "rules": [
    {
      "selector": "img",
      "name": "url",
      "filters": [
        {
          "args": [
            "src"
          ],
          "type": "attribute"
        }
      ]
    }
  ]
}

We start by defining the largest parent element enclosing the repeating elements (.img-thumbnail). We then define a nested rules object that extracts the src attribute of every img element inside the repeating parent element.

Notice that each img-thumbnail element also encloses a caption. We can extract that caption alongside the src of each image by adding an additional rule in the same nested level as the src extraction rule.

{
  "selector": "img-thumbnail",
  "name": "images",
  "rules": [
    {
      "selector": "img",
      "name": "url",
      "filters": [
        {
          "args": [
            "src"
          ],
          "type": "attribute"
        }
      ]
    },
    {
      "selector": "span.img-caption",
      "name": "caption"
    }
  ]
}