DQL allows you to search the extracted content of your Diffbot collections. A collection is a discrete Crawl or Bulk job output, and includes all of the web pages processed within that job.
In order to search a collection, you must first create that collection using either Crawl or the Bulk API. A collection can be searched before a crawl or bulk job is finished.
To search a crawled collections, you have to specify
type=crawl and specify one or more collections in the
col parameter. The parameter
col=all searches all your custom crawl collections. You can then query the collection using DQL.
An example API request looks like this:
The above API request has the following parameter
|token||The Diffbot token that you used to create the custom crawl|
|col||A comma-delimited list of collections to search. The parameter |
|query||DQL query. See Search(DQL) to learn how to write DQL queries.|
Field names for custom collections
Because custom crawl collections are user defined, there is no ontology against which DQL can validate field names. This means that DQL may return empty results if the wrong field name or a field name with a type is used in the DQL query.
siteis a special field which searches for webpages crawled from a website. It searches directly on the index and not the field. Example:
site:'nature.com' title:'Kotlin' facet:site
Fields of name
date.str are interpreted as date fields and have special handling. They can be referred to as epoch time (the number of seconds or milliseconds since 00:00:00 UTC on January 1, 1970) or as date literals. Examples:
min:date:1502734806 date>'2022-03-01' date.str>='01-20-2018' min:date.str:'2018-01-20'
:(equal to) are supported.
For a date field, we can use a Time Period Literal to represent some length of time. A Time Period Literal consists of a numeric value and a unit specifier. For example, the Time Period Literal "4h" represents 4 hours. These are the time units that are supported:
- s: seconds
- m: minutes
- h: hours
- d: days
- w: weeks
- y: 365 days
date>=4d date>=5d date<=10d
<= are supported.
Example: To find all Articles for which any origin was crawled within the last 4 hours:
Example: To find all Organizations for which any origin was crawled within the last 365 days:
You can facet on custom crawl collection like other DQL queries. See Facet Queries. Examples:
title:'iPhone' facet:regularPriceDetails.amount title:'Earthquakes' facet[-0.5:0,0:0.5]:posts.sentiment title:'COVID' facet[day]:date
Because custom crawls do not have an ontology for DQL to use, facet fields are interpreted as string by default. You can specify a numeric (integer or decimal) type for facet queries. Examples:
You can use standard DQL syntax to query the collections. Aside from that, you can also use free-text queries as follows:
|All objects containing "computer" and "vision" anywhere in all Diffbot-extracted fields.|
|All objects containing the phrase "web page analysis" anywhere in all Diffbot-extracted fields.|