Search a Crawl/Bulk job using DQL
DQL allows you to search the extracted content of your Diffbot collections. A collection is a discrete Crawl or Bulk job output, and includes all of the web pages processed within that job.
In order to search a collection, you must first create that collection using either Crawl or the Bulk API. A collection can be searched before a crawl or bulk job is finished.
To search a crawled collections, you have to specify type=crawl
and specify one or more collections in the col
parameter. The parameter col=all
searches all your custom crawl collections. You can then query the collection using DQL.
An example API request looks like this:
https://kg.diffbot.com/kg/v3/dql?token=<DIFFBOT-TOKEN>&type=crawl&col=winemore,bevmo&query=title:'Riesling'
The above API request has the following parameter
Parameter | Value | Description |
---|---|---|
token | DIFFBOT-TOKEN | The Diffbot token that you used to create the custom crawl |
type | crawl | Specify type=crawl when searching a crawl collection |
col | winemore,bevmo | A comma-delimited list of collections to search. The parameter col=all searches all your custom crawl collections. |
query | title:'Riesling' | DQL query. See Search(DQL) to learn how to write DQL queries. |
Field names for custom collections
Because custom crawl collections are user defined, there is no ontology against which DQL can validate field names. This means that DQL may return empty results if the wrong field name or a field name with a type is used in the DQL query.
site
field
site
fieldsite
is a special field which searches for webpages crawled from a website. It searches directly on the index and not the field. Example:
site:'nature.com'
title:'Kotlin' facet:site
Date handling
Fields of name date
and date.str
are interpreted as date fields and have special handling. They can be referred to as epoch time (the number of seconds or milliseconds since 00:00:00 UTC on January 1, 1970) or as date literals. Examples:
min:date:1502734806
date>'2022-03-01'
date.str>='01-20-2018'
min:date.str:'2018-01-20'
Comparison operators >
, >=
, <
, <=
, :
(equal to) are supported.
Special Time Period Literals
For a date field, we can use a Time Period Literal to represent some length of time. A Time Period Literal consists of a numeric value and a unit specifier. For example, the Time Period Literal "4h" represents 4 hours. These are the time units that are supported:
- s: seconds
- m: minutes
- h: hours
- d: days
- w: weeks
- y: 365 days
Examples:
date>=4d
date>=5d date<=10d
Comparison operators >
, >=
, <
, <=
are supported.
Example: To find all Articles for which any origin was crawled within the last 4 hours:
type:Article lastCrawlTime<=4h
Example: To find all Organizations for which any origin was crawled within the last 365 days:
type:Organization crawlTimestamp<=1y
Faceting
You can facet on custom crawl collection like other DQL queries. See Facet Queries. Examples:
title:'iPhone' facet:regularPriceDetails.amount
title:'Earthquakes' facet[-0.5:0,0:0.5]:posts.sentiment
title:'COVID' facet[day]:date
Because custom crawls do not have an ontology for DQL to use, facet fields are interpreted as string by default. You can specify a numeric (integer or decimal) type for facet queries. Examples:
facet[int]:date
facet[float]:price
Free-text search
You can use standard DQL syntax to query the collections. Aside from that, you can also use free-text queries as follows:
query= | Returns... |
---|---|
computer vision | All objects containing "computer" and "vision" anywhere in all Diffbot-extracted fields. |
"web page analysis" | All objects containing the phrase "web page analysis" anywhere in all Diffbot-extracted fields. |
Updated about 2 years ago