Tutorial: How to Search a Crawl/Bulk job using DQL

Query your crawl or bulk job collections for data with DQL. (⏲️ 10 Minutes)

DQL allows you to search the extracted content of your crawl or bulk jobs (also known as collections) and query for a subset of this data.

What is DQL?

DQL is short for Diffbot Query Language. It is a structured query language custom built by Diffbot to query data from graph structured databases like Diffbot Knowledge Graph or Crawl jobs (which are kind of like small Knowledge Graphs). The syntax is designed to be minimal and resembles JSON.

The links below are additional helpful references for learning DQL.

Using DQL to search over crawl collections does not consume any credits.

Getting Started

In order to search a collection, you must first create a Crawl or Bulk Job. A collection can be searched before a crawl or bulk job is finished.

To search a crawled collection with DQL, you have to specify type=crawl and specify one or more collections in the col parameter. The parameter col=all searches all your custom crawl collections. You can then query the collection using DQL.

An example API request looks like this:

import requests

url = "https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN"

params = {
  "type": "crawl",
  "query": "title:'Riesling'",
  "col": "winemore,bevmo",
  "size": "-1",
}

headers = {
  'Content-Type': 'application/json',
  'Accept': 'application/json'
}

response = requests.get(url, params=params, headers=headers)

print(response.text)
const headers = new Headers();
headers.append("Content-Type", "application/json");
headers.append("Accept", "application/json");

const params = {
  "type": "crawl",
  "query": "title:'Riesling'",
  "col": "winemore,bevmo",
  "size": "-1",
}
const queryString = new URLSearchParams(params).toString();

const requestOptions = {
  method: "GET",
  headers: headers
};

fetch(`https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN&${queryString}`, requestOptions)
  .then((response) => response.text())
  .then((result) => console.log(result))
  .catch((error) => console.error(error));
curl --request GET \
     --url 'https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN&type=crawl&query=title%3A%27Riesling%27&col=winemore%2Cbevmo&from=0&size=-1' \
     --header 'accept: application/json'

Let's break down the parameters

ParameterValueDescription
tokenYOUR_DIFFBOT_TOKENThe Diffbot token that you used to create the custom crawl
typecrawlSpecify type=crawlwhen searching a crawl collection with DQL
colwinemore,bevmoA comma-delimited list of collections to search. The parameter col=all searches all your custom crawl collections.
querytitle:'Riesling'A DQL query that looks for records that contain "Riesling" in the title property. See Search(DQL) to learn how to write DQL queries.
size-1-1 will return all records that match the query. DQL defaults to 50.

These are the minimum required parameters to query crawl or bulk jobs.

Free-text search

Although DQL syntax is generally preferred to query collections, free-test is also supported. Here are some examples of how free-text search works:

query=Returns...
"computer vision"All objects containing "computer" and "vision" anywhere in all Diffbot-extracted fields.
"web page analysis"All objects containing the phrase "web page analysis" anywhere in all Diffbot-extracted fields.

Helpful Field Guide

Fields in crawl and bulk jobs are queried similarly to fields in the Diffbot Knowledge Graph. Correspondingly, certain specialized fields will also include specialized querying capability. Here are some of those fields.

🚧

Field names for custom collections

Because custom crawl collections are user defined, there is no ontology against which DQL can validate field names. This means that DQL may return empty results (as opposed to field validation errors) if the wrong field name or a field name with a type is used in the DQL query.

site

siteis a special field which searches for webpages crawled from a website. It searches directly on the index and not the field.

site:'nature.com'

title:'Kotlin' facet:site

date

date fields (and its sibling date.str) have special handling for easier querying. Dates can be queried in epoch time (the number of seconds or milliseconds since 00:00:00 UTC on January 1, 1970) or as date literals. The usual relational operators >, >=, <, <=, : (equal to) are supported for either case.

min:date:1502734806 // Equal to or later than Monday, August 14, 2017, 6:20:06 PM GMT

date>'2017-08-14' // Later than Monday, August 14, 2017, 6:20:06 PM GMT

date.str>='08-14-2017' // Equal to or later than Monday, August 14, 2017, 6:20:06 PM GMT

min:date.str:'2017-08-14' // Equal to or later than Monday, August 14, 2017, 6:20:06 PM GMT

date fields also support special time period literals . A time period literal consists of a numeric value and a unit specifier. For example, the Time Period Literal "4h" represents 4 hours.

type:Article date>=4h // Articles published at least 4 hours ago

type:Article lastCrawlTime<=4h // Articles that were last crawled within the last 4 hours

For more details on querying for dates and time period literals, see Dates and Timestamps .

facet

facet is actually a function, not a field. Faceting allows you to summarize large datasets directly while querying for that dataset. For more details, see Facet Queries.

You can facet on custom crawl collection like other DQL queries.

title:'iPhone' facet:regularPriceDetails.amount

title:'Earthquakes' facet[-0.5:0,0:0.5]:posts.sentiment

title:'COVID' facet[day]:date

Because custom crawls do not have an ontology for DQL to use, facet fields are interpreted as string by default. You can specify a numeric (integer or decimal) type for facet queries. Examples:

facet[int]:date

facet[float]:price