Search a Crawl/Bulk job using DQL

DQL allows you to search the extracted content of your Diffbot collections. A collection is a discrete Crawl or Bulk job output, and includes all of the web pages processed within that job.

In order to search a collection, you must first create that collection using either Crawl or the Bulk API. A collection can be searched before a crawl or bulk job is finished.

To search a crawled collections, you have to specify type=crawl and specify one or more collections in the col parameter. The parameter col=all searches all your custom crawl collections. You can then query the collection using DQL.

An example API request looks like this:

https://kg.diffbot.com/kg/v3/dql?token=<DIFFBOT-TOKEN>&type=crawl&col=winemore,bevmo&query=title:'Riesling'

The above API request has the following parameter

ParameterValueDescription
tokenDIFFBOT-TOKENThe Diffbot token that you used to create the custom crawl
typecrawlSpecify type=crawlwhen searching a crawl collection
colwinemore,bevmoA comma-delimited list of collections to search. The parameter col=all searches all your custom crawl collections.
querytitle:'Riesling'DQL query. See Search(DQL) to learn how to write DQL queries.

🚧

Field names for custom collections

Because custom crawl collections are user defined, there is no ontology against which DQL can validate field names. This means that DQL may return empty results if the wrong field name or a field name with a type is used in the DQL query.

site field

siteis a special field which searches for webpages crawled from a website. It searches directly on the index and not the field. Example:

site:'nature.com'

title:'Kotlin' facet:site

Date handling

Fields of name date and date.str are interpreted as date fields and have special handling. They can be referred to as epoch time (the number of seconds or milliseconds since 00:00:00 UTC on January 1, 1970) or as date literals. Examples:

min:date:1502734806

date>'2022-03-01'

date.str>='01-20-2018'

min:date.str:'2018-01-20'

Comparison operators >, >=, <, <=, :(equal to) are supported.

Special Time Period Literals

For a date field, we can use a Time Period Literal to represent some length of time. A Time Period Literal consists of a numeric value and a unit specifier. For example, the Time Period Literal "4h" represents 4 hours. These are the time units that are supported:

  • s: seconds
  • m: minutes
  • h: hours
  • d: days
  • w: weeks
  • y: 365 days

Examples:

date>=4d

date>=5d date<=10d

Comparison operators >, >=, <, <= are supported.

type:Article date>=4h

Example: To find all Articles for which any origin was crawled within the last 4 hours:

type:Article lastCrawlTime<=4h

Example: To find all Organizations for which any origin was crawled within the last 365 days:

type:Organization crawlTimestamp<=1y

Faceting

You can facet on custom crawl collection like other DQL queries. See Facet Queries. Examples:

title:'iPhone' facet:regularPriceDetails.amount

title:'Earthquakes' facet[-0.5:0,0:0.5]:posts.sentiment

title:'COVID' facet[day]:date

Because custom crawls do not have an ontology for DQL to use, facet fields are interpreted as string by default. You can specify a numeric (integer or decimal) type for facet queries. Examples:

facet[int]:date

facet[float]:price

Free-text search

You can use standard DQL syntax to query the collections. Aside from that, you can also use free-text queries as follows:

query=Returns...
computer visionAll objects containing "computer" and "vision" anywhere in all Diffbot-extracted fields.
"web page analysis"All objects containing the phrase "web page analysis" anywhere in all Diffbot-extracted fields.