Filtering Fields

Filter the results of DQL or Bulk Enhance APIs requests to just the fields you need.

You can specify the filter parameter with DQL Search and Bulk Enhance APIs to return a subset of fields in in the JSON response.

Note that this is not the same as DQL filters, which allows you to search and filter results in the Knowledge Graph. Instead, the filter parameter is used to constrain the actual JSON fields returned from each entity record returned from the API response of DQL or Bulk Enhance.

Basic Mode

The easiest way to filter the entity JSON is to provide a space delimited list of fields you want in the filter parameter. For example, &filter=name description. The response looks like this:

{
  "name": "Diffbot",
  "description": "Diffbot is a developer of machine learning and computer vision algorithms and public APIs for extracting data from web pages / web scraping to create a knowledge base."
}

Advanced Mode using JSONPath

For more advanced cases, such as specifying how many of a list of industries to return, or the ideal employment record, you may use JsonPath. We've implemented a variant of the original JsonPath specification for our use case, though most of the language from the original spec will be followed in the guide below.

Basic Structure of Path

JsonPath expressions always refer to a JSON structure in the same way as XPath expression are used in combination with an XML document. The "root member object" in JsonPath is always referred to as $ regardless if it is an object or array.

JSONPath expressions can use:

  1. Dot–notation when path segment matches [a-zA-Z0-9_]* pattern
    • Example: $.location.country.name gets only the country name from the primary location.
  2. or bracket–notation
    • Example: $['locations']['country']['name'] gets only the country name from the all locations.
  3. Wildcard operator to match a single node
    • Example: $.locations.*.name
    • Example: $['locations'][*]['name']
  4. Recursive-descent operator to match any number of interleaving nodes (from E4X)
    • Example: $.locations..name
    • Example: $['locations']..['name']
TOKEN=YOURDIFFBOTTOKEN
curl --request GET \
     --url "https://kg.diffbot.com/kg/v3/dql?token=${TOKEN}&size=5&query=type%3AOrganization&filter=%24.location.country.name" \
     --header 'Accept: application/json'

Operators

OperatorDescription
$The root element to query. This starts all path expressions.
@The current node being processed by a filter predicate.
*Wildcard. Available anywhere a name or numeric are required.
..Deep scan. Available anywhere a name is required.
.<name>Dot-notated child
['<name>' (, '<name>')]Bracket-notated child or children
[<number> (, <number>)]Array index or indexes
[start:end]Array slice operator (from ECMA 2022 Language Specification)
[?(<expression>)]Filter expression. Expression must evaluate to a boolean value.

Filter Operators

Filters are logical expressions used to filter arrays.

  • A typical filter would be [?(@.year > 2018)] where @ represents the current item being processed.
  • More complex filters can be created with logical operators && and ||. String literals must be enclosed by single or double quotes ([?(@. name == 'Spain')] or [?(@.name == "France")]).
  • You can use ! to negate a predicate [?(!(@.year < 2018 && @.year > 2020))].
OperatorDescription
==left is equal to right (note that 1 is not equal to '1')
!=left is not equal to right
<left is less than right
<=left is less or equal to right
>left is greater than right
> =left is greater than or equal to right
=~left matches regular expression [?(@.name =~ /foo.*?/i)]
inleft exists in right [?(@.name in ['S', 'M'])]
ninleft does not exists in right
subsetofleft is a subset of right [?(@.sizes subsetof ['S', 'M', 'L'])]
anyofleft has an intersection with right [?(@.sizes anyof ['M', 'L'])]
noneofleft has no intersection with right [?(@.sizes noneof ['M', 'L'])]
sizesize of left (array or string) should match right
emptyleft (array or string) should be empty

Path Examples

The examples will refer to this partial Diffbot Organization entity sample:

{
  "type": "Corporation",
  "name": "IBM",
  "homepageUri": "ibm.com",
  "nbEmployees": 345000,
  "yearlyRevenues": [
    {
      "revenue": {
        "value": 7.362E+10
      },
      "isCurrent": false,
      "year": 2020
    },
    {
      "revenue": {
        "value": 7.9591E+10
      },
      "isCurrent": false,
      "year": 2018
    }
  ],
  "capitalization": {
    "currency": "USD",
    "value": 1.12935797E+11
  },
  "categories": [
    {
      "name": "Computer Hardware Companies"
    },
    {
      "name": "Cloud Computing Companies"
    },
    {
      "name": "Software Consulting Firms"
    }
  ],
  "locations": [
    {
      "country": {
        "summary": "Sovereign state in North America",
        "name": "United States of America"
      },
      "isCurrent": true,
      "address": "1 New Orchard Road, Armonk, 10504-1722, New York, United States"
    },
    {
      "country": {
        "summary": "Sovereign state in Southern Africa",
        "name": "South Africa"
      },
      "isCurrent": false,
      "address": "90 Grayston Dr, Sandton, Gauteng Province, South Africa"
    }
  ]
}
JsonPathResult
$.nameThe name of the entity
$.locations[?(@.country.name=='United States of America')]All locations in US
$.locations[?(@.country.name=='United States of America')].['address', 'isCurrent']address and isCurrent for all locations in US
$.locations[*].addressThe address of all locations
$.locations[0]The first location
$.locations[-2]The second to last location
$.locations[0,1]The first two locations
$.locations[:2]All locations from index 0 (inclusive) until index 2 (exclusive)
$.locations[1:2]All locations from index 1 (inclusive) until index 2 (exclusive)
$.locations[-2:]Last two locations
$.locations[?(@.isCurrent)]All locations which are current
$.yearlyRevenues[?(@.year > 2018)]yearlyRevenues for years > 2018

Specifying multiple paths

Multiple paths will be ; separated.

Example: $.name;$.homepageUri;$.yearlyRevenues[?(@.year > 2018)] specifies 3 elements to be returned (separated by ;):

  • $.name: entity name
  • $.homepageUri: entity homepageUri
  • $.yearlyRevenues[?(@.year > 2018)]: yearlyRevenues for years > 2018

mostRelevant() Function

mostRelevant() function supports selecting the element which is most relevant to the DQL or Enhance query.

For example, locations[mostRelevant()].['country', 'address'] selects the country and address fields of the location that best matches the DQL query such as:

type:Organization locations.country.name:"United States of America"

or an Enhance query such as:

{
  "type": "Organization",
  "name": "Diffbot",
  "location": "United States of America"
}

The above queries with the example mostRelevant filter would return the response

{  
  "locations": [
    {
      "country": {
        "summary": "Sovereign state in North America",
        "name": "United States of America"
      },
      "address": "1 New Orchard Road, Armonk, 10504-1722, New York, United States"
    }
  ]
}

Why a variant of JsonPath?

  • JsonPath does not support multiple paths being applied to a single Json. This is reflected in their recommended output structure. With this variant, multiple elements can be selected from a single json by specifying multiple paths.
  • The output format as specified by JsonPath is simplistic and loses the original structure of the document.

📘

Compatibility Notes

  1. JsonPath supports [start:end:step] from ECMASCRIPT4, but we won't support step to be compatible with ECMA 2022. It doesn't seem to be supported by the Jayway Java implementation either.
  2. JsonPath uses .length to refer to the length of the array. This is inconsistent when there is actually an element named length, so we don't support this.
  3. We don't support any aggregation functions provided by the Jayway Java implementation as the goal of this implementation is to filter json.
  4. We won't support referencing an absolute path through an expression like $..book[?(@.price <= $['expensive'])]
  5. The original spec supports unions but not multiple paths. It's easier to use multiple paths when the filtered nodes are disjoint.

References for JsonPath