Docs Suite

Docs Suite

  • Debugging

›API

Custom API

    Basics

    • Introduction
    • Basic Usage
    • Custom API Video Tutorials

    Recipes

    • Index
    • Back up / restore custom APIs
    • Sending Custom Headers with API Calls
    • Applying the same API to multiple domains

    API

    • Custom APIs
    • Custom API Selectors and Filters
    • Managing Custom Rules Programmatically
Edit

Custom API Selectors and Filters

The API Toolkit uses advanced CSS selector logic to override the output of default Diffbot fields (in an Automatic API) or to create entirely new fields. When editing your rules, you can use the following selectors and logic to populate your output.

Basic Selectors

PatternMatchesExample
*any element*
tagnameelements with the given tag namediv, p
namespace|typeelements of type 'type' in the namespace nsfb|name finds <fb:name> elements
#idelements with attribute ID of "id"div#container, #header
.classelements with a class name of "class"div.left, .post-body
element[attr] or [attr]elements with an attribute named "attr" (with any value)a[href], [title]
element[attr=val] or [attr=val]elements with an attribute named "attr" and value equal to "val"img[width=500], a[rel=nofollow]
[^attrPrefix]elements with an attribute name starting with "attrPrefix". Use to find elements with HTML5 datasets[^data-], div[^data-]
[attr^=valPrefix]elements with an attribute named "attr", and value starting with "valPrefix"a[href^=http:]
[attr$=valSuffix]elements with an attribute named "attr", and value ending with "valSuffix"img[src$=.png]
[attr*=valContaining]elements with an attribute named "attr", and value containing "valContaining"a[href*=/search/]
[attr~=regex]elements with an attribute named "attr", and value matching the regular expressionimg[src~=(?i)\.(png|jpe?g)]

The above may be combined in any order, such as div.header[title]

Combinators

The following can be used to specify certain elements based on their relation to other elements on the page (parents, children, siblings, etc.).

PatternMatchesExample
E Fan F element descended from an E elementdiv a, .logo h1
E > Fan F direct child of Eol > li
E + Fan F element immediately preceded by sibling Eli + li, div.head + div
E ~ Fan F element preceded by sibling Eh1 ~ p
E, F, Gall matching elements E, F, or Ga[href], div, h3

Pseudo Selectors

The following advanced selectors are also available.

PatternMatchesExample
:first-childelements that are the first child of some other elementdiv > p:first-child finds the first child element of a div that happens to be a p
:last-childelements that are the last child of some other elementul > li:last-child finds the last list-item in each unordered list
:only-childelements that are the only child of a parent elementp:only-child finds paragraphs without sibling elements
:first-of-typeelements that are the first sibling of its type in the list of children of its parent elementdiv > p:first-of-type finds the first p element of each div
:last-of-typeelements that are the last sibling of its type in the list of children of its parent elementdiv > span:last-of-type finds the last span element within div elements
:only-of-typean element that has a parent element and whose parent element has no other element children with the same expanded element namep:only-of-type finds paragraphs without sibling p elements
:emptyelements that have no children at allp:empty finds paragraphs without children
:nth-child(an+b)elements that have an+b-1 siblings before them in the document tree, for any positive integer or zero value of n, and have a parent element. Can also take 'odd' and 'even' as arguments.tr:nth-child(2n+1) finds every odd row of a table
:nth-last-child(an+b)elements that have an+b-1 siblings after after them in the document tree.tr:nth-lastchild(-n+2) finds the last two rows of a table
:nth-of-type(an+b)represents an element that has an+b-1 siblings with the same expanded element name before it in the document tree, for any zero or positive integer value of n, and has a parent elementimg:nth-of-type(2n+1)
:nth-last-of-type(an+b)represents an element that has an+b-1 siblings with the same expanded element name after it in the document tree, for any zero or positive integer value of n, and has a parent elementimg:nth-last-of-type(2n+1)
:lt(n)elements whose sibling index is less than ntd:lt(3) finds the first 2 cells of each row
:gt(n)elements whose sibling index is greater than ntd:gt(1) finds cells after skipping the first two
:eq(n)elements whose sibling index is equal to ntd:eq(0) finds the first cell of each row
:has(selector)elements that contains at least one element matching the selectordiv:has(p) finds divs that contain p elements
:not(selector)elements that do not match the selectordiv:not(.logo) finds all divs that do not have the "logo" class.
div:not(:has(div)) finds divs that do not contain divs.
:contains(text)elements that contains the specified text. The search is case insensitive. The text may appear in the found element, or any of its descendants.p:contains(jsoup) finds p elements containing the text "jsoup".
:matches(regex)elements whose text matches the specified regular expression. The text may appear in the found element, or any of its descendants.td:matches(\\d+) finds table cells containing digits. div:matches((?i)login) finds divs containing the text, case insensitively.
:containsOwn(text)elements that directly contains the specified text. The search is case insensitive. The text must appear in the found element, not any of its descendants.p:containsOwn(jsoup) finds p elements with own text "jsoup".
:matchesOwn(regex)elements whose own text matches the specified regular expression. The text must appear in the found element, not any of its descendants.td:matchesOwn(\\d+) finds table cells directly containing digits. div:matchesOwn((?i)login) finds divs containing the text, case insensitively.

The above may be combined in any order and with other selectors, such as .light:contains(name):eq(0)

Output Filters

When creating a new rule, the following filters can be applied to the default selector output to further refine returned data.

FilterDescription
AttributeRetrieves the specified attribute value of an element. For example, to extract the link (http://blog.diffbot.com) from the anchor tag <a href="http://www.blog.diffbot.com" class="outbound">, you would enter href as your attribute filter. You can only use a single attribute filter per rule.
IgnoreIgnores the specified selectors (and all descendants) if they are found within the primary CSS selector. You may use any of the selector formats specified in this help screen.
ReplaceAllows you to specify match and replace regular expressions to alter the output returned by the Diffbot API. To remove matching content, simply leave the "replace with" field blank. Backreferences are also supported. For example, you can prepend text with the replace selector (^.*$) and replacement prefix: $1

Diffbot uses a Java implementation for its regular expression parsing. Regular-Expressions.info offers an excellent overview of language-specific distinctions. For more details on Diffbot's regular expression implementation, please see this Support article.
Last updated by Dan Urman
← Custom APIsManaging Custom Rules Programmatically →
  • Basic Selectors
  • Combinators
  • Pseudo Selectors
  • Output Filters
Docs Suite
Docs
ExtractionCrawlingKnowledge GraphDiffbot and GDPR
Community
Stack OverflowTwitter
More
BlogHelpGitHub
Diffbot.com
Copyright © 2021 Diffbot.com