Tutorial: How to Search a Crawl/Bulk job using DQL
Query your crawl or bulk job collections for data with DQL. (⏲️ 10 Minutes)
DQL allows you to search the extracted content of your crawl or bulk jobs (also known as collections) and query for a subset of this data.
What is DQL?
DQL is short for Diffbot Query Language. It is a structured query language custom built by Diffbot to query data from graph structured databases like Diffbot Knowledge Graph or Crawl jobs (which are kind of like small Knowledge Graphs). The syntax is designed to be minimal and resembles JSON.
The links below are additional helpful references for learning DQL.
Using DQL to search over crawl collections does not consume any credits.
Getting Started
In order to search a collection, you must first create a Crawl or Bulk Job. A collection can be searched before a crawl or bulk job is finished.
To search a crawled collection with DQL, you have to specify type=crawl
and specify one or more collections in the col
parameter. The parameter col=all
searches all your custom crawl collections. You can then query the collection using DQL.
An example API request looks like this:
import requests
url = "https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN"
params = {
"type": "crawl",
"query": "title:'Riesling'",
"col": "winemore,bevmo",
"size": "-1",
}
headers = {
'Content-Type': 'application/json',
'Accept': 'application/json'
}
response = requests.get(url, params=params, headers=headers)
print(response.text)
const headers = new Headers();
headers.append("Content-Type", "application/json");
headers.append("Accept", "application/json");
const params = {
"type": "crawl",
"query": "title:'Riesling'",
"col": "winemore,bevmo",
"size": "-1",
}
const queryString = new URLSearchParams(params).toString();
const requestOptions = {
method: "GET",
headers: headers
};
fetch(`https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN&${queryString}`, requestOptions)
.then((response) => response.text())
.then((result) => console.log(result))
.catch((error) => console.error(error));
curl --request GET \
--url 'https://kg.diffbot.com/kg/v3/dql?token=YOUR_DIFFBOT_TOKEN&type=crawl&query=title%3A%27Riesling%27&col=winemore%2Cbevmo&from=0&size=-1' \
--header 'accept: application/json'
Let's break down the parameters
Parameter | Value | Description |
---|---|---|
token | YOUR_DIFFBOT_TOKEN | The Diffbot token that you used to create the custom crawl |
type | crawl | Specify type=crawl when searching a crawl collection with DQL |
col | winemore,bevmo | A comma-delimited list of collections to search. The parameter col=all searches all your custom crawl collections. |
query | title:'Riesling' | A DQL query that looks for records that contain "Riesling" in the title property. See Search(DQL) to learn how to write DQL queries. |
size | -1 | -1 will return all records that match the query. DQL defaults to 50. |
These are the minimum required parameters to query crawl or bulk jobs.
Free-text search
Although DQL syntax is generally preferred to query collections, free-test is also supported. Here are some examples of how free-text search works:
query= | Returns... |
---|---|
"computer vision" | All objects containing "computer" and "vision" anywhere in all Diffbot-extracted fields. |
"web page analysis" | All objects containing the phrase "web page analysis" anywhere in all Diffbot-extracted fields. |
Helpful Field Guide
Fields in crawl and bulk jobs are queried similarly to fields in the Diffbot Knowledge Graph. Correspondingly, certain specialized fields will also include specialized querying capability. Here are some of those fields.
Field names for custom collections
Because custom crawl collections are user defined, there is no ontology against which DQL can validate field names. This means that DQL may return empty results (as opposed to field validation errors) if the wrong field name or a field name with a type is used in the DQL query.
site
site
site
is a special field which searches for webpages crawled from a website. It searches directly on the index and not the field.
site:'nature.com'
title:'Kotlin' facet:site
date
date
date
fields (and its sibling date.str
) have special handling for easier querying. Dates can be queried in epoch time (the number of seconds or milliseconds since 00:00:00 UTC on January 1, 1970) or as date literals. The usual relational operators >
, >=
, <
, <=
, :
(equal to) are supported for either case.
min:date:1502734806 // Equal to or later than Monday, August 14, 2017, 6:20:06 PM GMT
date>'2017-08-14' // Later than Monday, August 14, 2017, 6:20:06 PM GMT
date.str>='08-14-2017' // Equal to or later than Monday, August 14, 2017, 6:20:06 PM GMT
min:date.str:'2017-08-14' // Equal to or later than Monday, August 14, 2017, 6:20:06 PM GMT
date
fields also support special time period literals . A time period literal consists of a numeric value and a unit specifier. For example, the Time Period Literal "4h" represents 4 hours.
type:Article date>=4h // Articles published at least 4 hours ago
type:Article lastCrawlTime<=4h // Articles that were last crawled within the last 4 hours
For more details on querying for dates and time period literals, see Dates and Timestamps .
facet
facet
facet
is actually a function, not a field. Faceting allows you to summarize large datasets directly while querying for that dataset. For more details, see Facet Queries.
You can facet on custom crawl collection like other DQL queries.
title:'iPhone' facet:regularPriceDetails.amount
title:'Earthquakes' facet[-0.5:0,0:0.5]:posts.sentiment
title:'COVID' facet[day]:date
Because custom crawls do not have an ontology for DQL to use, facet fields are interpreted as string by default. You can specify a numeric (integer or decimal) type for facet queries. Examples:
facet[int]:date
facet[float]:price
Updated about 1 month ago