Docs Suite

Docs Suite

  • Debugging

›API

Product API

    Basics

    • Introduction
    • Basic Usage
    • Product API: Category Taxonomy

    Recipes

    • Index

    API

    • Product Extraction API
Edit

Product Extraction API

The Product API automatically extracts complete data from any shopping or e-commerce product page. Retrieve full pricing information, product IDs (SKU, UPC, MPN), images, product specifications, brand and more.

Request

To use the Product API, perform a HTTP GET request on the following endpoint:

https://api.diffbot.com/v3/product

Provide the following arguments:

ArgumentDescription
tokenDeveloper token
urlWeb page URL of the product to process (URL encoded)
Optional arguments
fieldsUsed to specify optional fields to be returned by the Product API. See the Fields section below.
discussionPass discussion=false to disable automatic extraction of product reviews. See below.
timeoutSets a value in milliseconds to wait for the retrieval/fetch of content from the requested URL. The default timeout for the third-party response is 30 seconds (30000).
callbackUse for jsonp requests. Needed for cross-domain ajax.
proxyUsed to specify the IP address of a custom proxy that will be used to fetch the target page, instead of Diffbot's default IPs/proxies. (Ex: &proxy=168.212.226.204)
proxyAuthUsed to specify the authentication parameters that will be used with the proxy specified in the &proxy parameter. (Ex: &proxyAuth=username:password)

The fields argument

Use the fields argument to return optional fields in the JSON response. The default fields will always be returned. For nested arrays, use parentheses to retrieve specific fields, or * to return all sub-fields.

For example, to return links and meta (in addition to the default fields), your &fields argument would be:

&fields=links,meta

Response

The Product API returns data in JSON format.

Each V3 response includes a request object (which returns request-specific metadata), and an objects array, which will include the extracted information for all objects on a submitted page.

Objects in the Product API's objects array will include the following fields:

FieldDescription
typeType of object (always product).
pageUrlURL of submitted page / page from which the product is extracted.
resolvedPageUrlReturned if the pageUrl redirects to another URL.
titleTitle of the product.
textText description, if available, of the product.
brandItem's brand name.
offerPriceOffer or actual/final price of the product.
regularPriceRegular or original price of the product, if available.
shippingAmountShipping price.
saveAmountDiscount or amount saved off the regular price.
offerPriceDetailsofferPrice separated into its constituent parts: amount, symbol, and full text.
regularPriceDetailsregularPrice separated into its constituent parts: amount, symbol, and full text.
saveAmountDetailssaveAmount separated into its constituent parts: amount, symbol, full text, and whether or not it is a percentage value.
productIdDiffbot-determined unique product ID. If upc, isbn, mpn or sku are identified on the page, productId will select from these values in the above order.
upcUniversal Product Code (UPC/EAN), if available.
skuStock Keeping Unit -- store/vendor inventory number or identifier.
mpnManufacturer's Product Number.
isbnInternational Standard Book Number (ISBN), if available.
specsIf a specifications table or similar data is available on the product page, individual specifications will be returned in the specs object as name/value pairs. Names will be normalized to lowercase with spaces replaced by underscores, e.g. display_resolution.
imagesArray of images, if present within the product.
↳urlFully resolved link to image. If the image SRC is encoded as base64 data, the complete data URI will be returned.
↳titleDescription or caption of the image.
↳heightHeight of image as (re-)sized via browser/CSS.
↳widthWidth of image as (re-)sized via browser/CSS.
↳naturalHeightRaw image height, in pixels.
↳naturalWidthRaw image width, in pixels.
↳primaryReturns true if image is identified as primary based on visual analysis.
↳xpathXPath expression identifying the image node.
↳diffbotUriInternal ID used for indexing.
discussionProduct reviews, as extracted by the Diffbot Discussion API. See below.
prefixCodeCountry of origin as identified by UPC/ISBN.
productOriginIf available, two-character ISO country code where the product was produced.
humanLanguageReturns the (spoken/human) language of the submitted page, using two-letter ISO 639-1 nomenclature.
diffbotUriUnique object ID. The diffbotUri is generated from the values of various Product fields and uniquely identifies the object. This can be used for deduplication.
Optional fields, available using fields= argument
linksReturns a top-level object (links) containing all hyperlinks found on the page.
metaReturns a top-level object (meta) containing the full contents of page meta tags, including sub-arrays for OpenGraph tags, Twitter Card metadata, schema.org microdata, and -- if available -- oEmbed metadata.
querystringReturns any key/value pairs present in the URL querystring. Items without a discrete value will be returned as true.
breadcrumbReturns a top-level array (breadcrumb) of URLs and link text from page breadcrumbs.
The following fields are in an early beta stage:
availabilityItem's availability, either true or false.
categoryReturns an inferred category from Diffbot's product categorization taxonomy.
colorsReturns array of product color options.
normalizedSpecsReturns normalized specifications if a specifications table (or similar element) is found on the product page. More details on normalization.
multipleProductsReturns true if multiple products are distinctly available on the product page.
priceRangeIf the product is available in a range of prices, the minimum and maximum values will be returned. The lowest price will also be returned as the offerPrice.
↳minPriceThe minimum price for the offered item.
↳maxPriceThe maximum price for the offered item.
quantityPricesIf the product is available with quantity-based discounts, all identifiable price points will be returned. The lowest price will also be returned as the offerPrice.
↳minQuantityThe minimum quantity required to purchase for the associated price.
↳pricePrice of the specific quantity level.
sizeSize(s) available, if identified on the page.

Review Extraction

By default the Product API will attempt to extract user reviews from product pages, using integrated functionality from the Diffbot Discussion API. Review data will be returned in the discussion object (nested within the primary product object). The full syntax for discussion data is available in the Discussion API documentation.

Discussion extraction can be disabled using the argument discussion=false. Note that if a page has recently been processed by Diffbot, cached reviews may be returned even if discussion=false is passed.

Normalized Specs

The normalizedSpecs field returns a product's automatically standardized/sanitized specifications, if a specs table and/or similar elements are detected on a page. Numeric values for many specifications are normalized into a standard units.

Data Returned

Each key will return an array of values. Single-value specifications will contain a single-element array. For each value, the following possible fields will be returned:

FieldDescription
cleanLiteralA sanitized version of the text string.
unitNormalized output unit, if applicable, per below table.
valueNormalized output value, if applicable, according to the unit.

Example Response

"normalizedSpecs_beta": {
  "color": [
    {
      "unit": "rgbHex",
      "cleanLiteral": "Fluorescent Pink",
      "value": "FF1493"
    },
    {
      "unit": "rgbHex",
      "cleanLiteral": "Soft White",
      "value": "E0E4DF"
    },
    {
      "unit": "rgbHex",
      "cleanLiteral": "Diffbot Blue",
      "value": "112532"
    },
  ],
  "dataCapacity": [
    {
      "unit": "KILOBYTE",
      "cleanLiteral": "1.0 TB",
      "value": 1073741824
    }
  ],
  "minOperatingTemperature": [
    {
      "unit": "CELSIUS",
      "cleanLiteral": "32.0 F",
      "value": -0.00000799999999756551
    }
  ],
  "shippingDepth": [
    {
      "unit": "METER",
      "cleanLiteral": "5.6 in",
      "value": 0.1422
    }
  ],
  "shippingWeight": [
    {
      "unit": "KILOGRAM",
      "cleanLiteral": "0.3 lb",
      "value": 0.1361
    }
  ] ,
  "sku": [
    {
      "cleanLiteral": "A8237"
    }
  ]
}

List of Normalized Keys

Normalized KeyTypeNormalized Value Unit
armLengthnumericmeter
audioJackDiameternumericmeter
batteryCapacitynumericcoulomb
bookFormatstringn/a
brandstringn/a
busClockFrequencynumerichertz
bustnumericmeter
dataCapacitynumerickilobyte
chestnumericmeter
circumferencenumericmeter
clockFrequencynumerichertz
colorstringrgb hex value
conditionstringn/a
copyingSpeednumericpageperminute
cordLengthnumericmeter
countryOfOriginstringn/a
dataReadSpeednumerickilobytepersecond
dataTransmissionSpeednumerickilobytepersecond
dataWriteSpeednumerickilobytepersecond
depthnumericmeter
diameternumericmeter
fileSizenumerickilobyte
focalLengthnumericmeter
fuelConsumptionCitynumericliterperkilometer
fuelConsumptionCombinednumericliterperkilometer
fuelConsumptionHighwaynumericliterperkilometer
genderstringn/a
genrestringn/a
gpuFrequencyClocknumerichertz
heelnumericmeter
heightnumericmeter
hipsnumericmeter
impedancenumericOHM
inkColorstringn/a
innerDiameternumericmeter
inputVoltagenumericvolt
inseamnumericmeter
languagestringn/a
lengthnumericmeter
lensDiameternumericmeter
lensWidthnumericmeter
materialstringn/a
maxFocalLengthnumericmeter
maxFrequencyResponsenumerichertz
maxWeightnumerickilogram
maxWeightCapacitynumerickilogram
maxOperatingTemperaturenumericcelsius
maxStorageTemperaturenumericcelsius
memoryClockFrequencynumerichertz
mileagenumericmeter
minFocalLengthnumericmeter
minFrequencyResponsenumerichertz
minWeightnumerickilogram
minWeightCapacitynumerickilogram
minOperatingTemperaturenumericcelsius
minStorageTemperaturenumericcelsius
mpnstringn/a
necknumericmeter
operating_temperaturenumericcelsius
opticalWaveLengthnumericmeter
outerDiameternumericmeter
outputVoltagenumericvolt
powernumericwatt
powerConsumptionnumericwatt
powerConsumptionIdlenumericwatt
powerDevelopednumericwatt
powerRMSnumericwatt
printSpeedBlacknumericpageperminute
printSpeedColornumericpageperminute
printSpeedCombinednumericpageperminute
processorCachenumerickilobyte
processorClockFrequencynumerichertz
publisherstringn/a
ramSizenumerickilobyte
refreshRatenumerichertz
resolutionXnumericn/a
resolutionYnumericn/a
screenDiagonalnumericmeter
shippingDepthnumericmeter
shippingHeightnumericmeter
shippingLengthnumericmeter
shippingWeightnumerickilogram
shippingWidthnumericmeter
shouldersnumericmeter
skustringn/a
sleeveLengthnumericmeter
stylestringn/a
subtitlesLanguagestringn/a
supportedRamSizenumerickilobyte
thermalDesignPowernumericwatt
waistnumericmeter
warrantyDurationnumericsecond
waterResistancenumericmeter
weightnumerickilogram
weightCapacitynumerickilogram
wheelDiameternumericmeter
widthnumericmeter

Example Response

{
  "request": {
    "pageUrl": "http://store.livrada.com/collections/all/products/before-i-go-to-sleep",
    "api": "product",
    "options": [],
    "fields": "title,text,offerPrice,regularPrice,saveAmount,pageUrl,images",
    "version": 3
  },
  "objects": [
    {
      "type": "product",
      "title": "Before I Go To Sleep",
      "text": "Memories define us. So what if you lost yours every time you went to sleep? Your name, your identity, your past, even the people you love -- all forgotten overnight. And the one person you trust may be telling you only half the story. Before I Go To Sleep is a disturbing psychological thriller in which an amnesiac desperately tries to uncover the truth about who she is and who she can trust.",
      "offerPrice": "$7.99",
      "regularPrice": "$9.99",
      "saveAmount": "$2.00",
      "pageUrl": "http://store.livrada.com/collections/all/products/before-i-go-to-sleep",
      "images": [
        {
          "title": "Before I Go to Sleep cover",
          "url": "http://cdn.shopify.com/s/files/1/0184/6296/products/BeforeIGoToSleep_large.png?946",
          "xpath": "/HTML[@class='no-js']/BODY[@id='page-product']/DIV[@class='content-frame']/DIV[@class='content']/DIV[@class='content-shop']/DIV[@class='row']/DIV[@class='span5']/DIV[@class='product-thumbs']/UL/LI[@class='first-image']/A[@class='single_image']/IMG",
          "diffbotUri": "image|1|768070723"
        }
      ],
      "diffbotUri": "product|1|937176621"
    }
  ]
}

Authentication

You can supply Diffbot with basic authentication credentials or custom HTTP headers (see below) to access intranet pages or other sites that require a login.

Basic Authentication

To access pages that require a login/password (using basic access authentication), include the username and password in your url parameter, e.g.: url=http%3A%2F%2FUSERNAME:PASSWORD@www.diffbot.com.

Custom HTTP Headers and JavaScript

See here for a full guide on using custom headers in direct API calls.

Custom headers

You can supply Diffbot APIs with custom HTTP headers that will be passed along when making requests to third-party sites. These can be used to define specific Referer, User-Agent, Cookie or any other values.

Custom headers should be sent as HTTP headers in your request to https://api.diffbot.com, and prepended with X-Forward-.

For instance, to send custom User-Agent, Referer and My-Custom-Header header values:

Desired HeaderSend to api.diffbot.com
User-Agent:DiffbotX-Forward-User-Agent:Diffbot
Referer:diffbot.comX-Forward-Referer:diffbot.com
My-Custom-Header:CustomValueX-Forward-My-Custom-Header:CustomValue

Custom Javascript

This functionality is currently in beta.

Using the X-Evaluate custom header (sent as X-Forward-X-Evaluate), you can inject your own Javascript code into web pages. Custom Javascript will be executed once the DOM has loaded.

Custom Javascript should be provided as a text string and contained in its own function. You must also include the special functions start() and end() to indicate the beginning and end of your custom script. Once end() fires, the updated document will be processed by your chosen extraction API.

It's recommended that your end() function be offset using setTimeout (see JavaScript Timing Events) in order to accommodate your primary function processing. Additionally, if your custom Javascript requires access to Ajax-delivered content, it may be necessary to offset your entire function using setTimeout in order to delay the start of your processing.

The following sample X-Evaluate header waits one-half second after the DOM has loaded, enacts a click on the a.loadMore element, then waits 800 milliseconds before signaling the end():

function() {
    start();
    setTimeout(function() {
        var loadMoreNode = document.querySelector('a.loadMore');
        if (loadMoreNode != null) {
            loadMoreNode.click();
            setTimeout(function() {
                end();
            }, 800);
        } else {
            end();
        }
    }, 500);
}

Delivered as a string value as a custom header:

"X-Forward-X-Evaluate": "function() {start();setTimeout(function(){var loadMoreNode=document.querySelector('a.loadMore');if (loadMoreNode != null) {loadMoreNode.click();setTimeout(function(){end();}, 800);} else {end();}},500);}"

Note: X-Evaluate will only be executed if called from the API the rule resides in. If you have an X-Evaluate script in your Article API rule and make a request with the Analyze API that identifies the page as an article, the X-Evaluate will not be executed.

Posting Content

If your content is not publicly available (e.g., behind a firewall), you can POST markup directly to the Product API endpoint for analysis:

https://api.diffbot.com/v3/product?token=...&url=...

Please note that the url argument is still required, and will be used to resolve any relative links contained in the markup.

Provide the content to analyze as your POST body, and specify the Content-Type header as text/html.

HTML Post Sample

curl -H "Content-Type: text/html" -d '<html><head><title>Something to Buy</title></head><body><h2>A Pair of Jeans</h2><div>Price: $31.99</div></body></html>' 'https://api.diffbot.com/v3/product?token=...&url=http%3A%2F%2Fstore.diffbot.com'
Last updated by dioro
← Index
  • Request
    • The fields argument
  • Response
  • Review Extraction
  • Normalized Specs
    • Data Returned
    • Example Response
    • List of Normalized Keys
  • Example Response
  • Authentication
    • Basic Authentication
  • Custom HTTP Headers and JavaScript
    • Custom headers
    • Custom Javascript
  • Posting Content
    • HTML Post Sample
Docs Suite
Docs
ExtractionCrawlingKnowledge GraphDiffbot and GDPR
Community
Stack OverflowTwitter
More
BlogHelpGitHub
Diffbot.com
Copyright © 2021 Diffbot.com