Article

get

https://api.diffbot.com/v3/article

Automatically extract clean article text and other data from news articles, blog posts and other text-heavy pages.

The Diffbot Article API is used to extract clean article text and other data from news articles, blog posts and other text-heavy pages. Retrieve the full-text, cleaned and normalized HTML, related images and videos, author, date, tags—automatically, from any article on any site.

Test drive Article API without a token at diffbot.com/testdrive.

Response

The Article API returns data in JSON format.

Each response includes a request object (which returns request-specific metadata), and an objects array, which will include the extracted information for all objects on a submitted page.

At the moment, only a single object will be returned for Article API requests.

Objects in the Article API's objects array will include the following fields:

Field	Description
`type`	Type of object (always `article`).
`title`	Title of the article.
`text`	Full text of the article.
`html`	Diffbot-normalized HTML of the extracted article. Please see Normalized HTML Fields for a breakdown of elements and attributes returned.
`date`	Date of extracted article, normalized in most cases to RFC 1123 (HTTP/1.1).
`estimatedDate`	If an article's date is ambiguous, Diffbot will attempt to estimate a more specific timestamp using various factors. This will not be generated for articles older than two days, or articles without an identified `date`.
`author`	Article author.
`authorUrl`	URL of the author profile page, if available.
`discussion`	Article comments, as extracted by the Diffbot Discussion API. See Extracting Comments.
`humanLanguage`	Returns the (spoken/human) language of the submitted page, using two-letter ISO 639-1 nomenclature.
`numPages`	Number of pages automatically concatenated to form the `text` or `html` response. By default, Diffbot will automatically concatenate up to 20 pages of an article. More on automatic concatenation.
`nextPages`	Array of all page URLs concatenated in a multipage article. More on automatic concatenation.
`siteName`	The plain-text name of the site (e.g. `The New York Times` or `Diffbot`). If no site name is automatically determined, the root domain (`diffbot.com`) will be returned.
`publisherRegion`	If known, the region of the article publication.
`publisherCountry`	If known, the country of the article publication.
`location`	Location mentioned at the beginning of the article.
`pageUrl`	URL of submitted page / page from which the article is extracted.
`resolvedPageUrl`	Returned if the `pageUrl` redirects to another URL.
`tags`	Array of tags/entities, generated from analysis of the extracted `title` and `text` fields. Tags are extracted by the Diffbot Natural Language API and linked to the Diffbot Knowledge Graph. Tags will be returned if the text is in one of the following languages: English (en), French (fr), Spanish (es), Chinese (zh), German (de), Russian (ru), Japanese (ja), Dutch (nl), Polish (pl), Norwegian (no), Danish (da), Swedish (sv), Italian (it).
↳`label`	Name of the entity or tag.
↳`count`	Number of appearances the entity makes within the text content.
↳`score`	Rating of the entity's relevance to the overall text content (range of 0 to 1) based on various factors.
↳`rdfTypes`	If the entity can be represented by multiple resources, all of the possible URIs will be returned.
↳`type`	This legacy field is a simplified precursor to `rdfTypes`, and will return either `organization` or `person` if the entity is either of those types.
↳`uri`	Link to the primary Diffbot entity for this tag in the Diffbot Knowledge Graph.
`categories`	Array of categories, generated from analysis of the extracted `title` and `text` fields. This field is available for over 100 languages. The complete list of categories can be found at this link.
↳`name`	Name of the category.
↳`score`	Score of how relevant this category is for the article.
↳`id`	Id of the category.
`images`	Array of images, if present within the article body.
↳`url`	Fully resolved link to image. If the image `SRC` is encoded as base64 data, the complete data URI will be returned.
↳`title`	Description or caption of the image.
↳`height`	Height of image as (re-)sized via browser/CSS.
↳`width`	Width of image as (re-)sized via browser/CSS.
↳`naturalHeight`	Raw image height, in pixels.
↳`naturalWidth`	Raw image width, in pixels.
↳`primary`	Returns `true` if image is identified as primary based on visual analysis.
↳`diffbotUri`	Internal ID used for indexing.
`videos`	Array of videos, if present within the article body.
↳`url`	Fully resolved link to source video content.
↳`naturalHeight`	Source video height, in pixels, if available.
↳`naturalWidth`	Source video width, in pixels, if available.
↳`primary`	Returns `true` if video is identified as primary based on visual analysis.
↳`diffbotUri`	Internal ID used for indexing.
`breadcrumb`	Returns a top-level array (`breadcrumb`) of URLs and link text from page breadcrumbs.
`diffbotUri`	Unique object ID. The `diffbotUri` is generated from the values of various Article fields and uniquely identifies the object. This can be used for deduplication.
`sentiment`	Returns the sentiment score of the analyzed article text, a value ranging from -1.0 (very negative) to 1.0 (very positive).

The following is an example response from a successful extraction of an article on technologyreview.com.

{
  "request": {
    "pageUrl": "https://www.technologyreview.com/2020/09/04/1008156/knowledge-graph-ai-reads-web-machine-learning-natural-language-processing/",
    "api": "article",
    "version": 3
  },
  "humanLanguage": "en",
  "objects": [
    {
      "date": "Fri, 04 Sep 2020 00:00:00 GMT",
      "sentiment": 0.153,
      "images": [
        {
          "naturalHeight": 869,
          "width": 654,
          "diffbotUri": "image|3|1663647584",
          "url": "https://wp.technologyreview.com/wp-content/uploads/2022/03/Flower-Trip-style.jpeg?resize=1006,640",
          "naturalWidth": 1366,
          "height": 418
        },
        {
          "naturalHeight": 1900,
          "width": 460,
          "diffbotUri": "image|3|683243517",
          "url": "https://wp.technologyreview.com/wp-content/uploads/2022/02/MA22_Demis-Hassabis-99-v1.jpg?resize=1006,1400",
          "naturalWidth": 1366,
          "height": 294
        }
      ],
      "author": "Will Douglas Heaven",
      "estimatedDate": "Fri, 04 Sep 2020 00:00:00 GMT",
      "publisherRegion": "North America",
      "icon": "https://www.technologyreview.com/static/media/favicon.1cfcdb44.ico",
      "diffbotUri": "article|3|973247980",
      "siteName": "MIT Technology Review",
      "type": "article",
      "title": "This know-it-all AI learns by reading the entire web nonstop",
      "tags": [
        {
          "score": 0.998680055141449,
          "sentiment": 0,
          "count": 10,
          "label": "artificial intelligence",
          "uri": "https://diffbot.com/entity/E_lYDrjmAMlKKwXaDf958zg",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Skill",
            "http://dbpedia.org/ontology/Activity"
          ]
        },
        {
          "score": 0.9686350226402283,
          "sentiment": 0.889,
          "count": 7,
          "label": "Diffbot",
          "uri": "https://diffbot.com/entity/EYX1i02YVPsuT7fPLUYgRhQ",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Organisation"
          ]
        },
        {
          "score": 0.9306924939155579,
          "sentiment": 0,
          "count": 2,
          "label": "Michigan",
          "uri": "https://diffbot.com/entity/E2eIrTt0jPUmGmuV6N2O3KQ",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Place",
            "http://dbpedia.org/ontology/PopulatedPlace",
            "http://dbpedia.org/ontology/State"
          ]
        },
        {
          "score": 0.9025880098342896,
          "sentiment": 0,
          "count": 1,
          "label": "Paul Katsen",
          "uri": "https://diffbot.com/entity/EqUim_ci0ObmrK2gZM3UfNA",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Person"
          ]
        },
        {
          "score": 0.8933213353157043,
          "sentiment": 0.48,
          "count": 4,
          "label": "Katy Perry",
          "uri": "https://diffbot.com/entity/E_6rhi_PEOD6vGencwOxd2A",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Person"
          ]
        },
        {
          "score": 0.8848651051521301,
          "sentiment": 0,
          "count": 4,
          "label": "Mike Tung",
          "uri": "https://diffbot.com/entity/ESGMaGV9uP0SuTmfPTtNEoA",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Person"
          ]
        },
        {
          "score": 0.8562507629394531,
          "sentiment": 0,
          "count": 4,
          "label": "Google",
          "uri": "https://diffbot.com/entity/EUFq-3WlpNsq0pvfUYWXOEA",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Organisation"
          ]
        },
        {
          "score": 0.7750672101974487,
          "sentiment": 0,
          "count": 2,
          "label": "Alaska",
          "uri": "https://diffbot.com/entity/E4odwkG_xMNeZTbHrnNrojA",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Place",
            "http://dbpedia.org/ontology/PopulatedPlace",
            "http://dbpedia.org/ontology/State"
          ]
        },
        {
          "score": 0.7653270959854126,
          "sentiment": 0,
          "count": 1,
          "label": "Zola",
          "uri": "https://diffbot.com/entity/E0qGTA2o5NjaeezggjMsoVw",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Organisation"
          ]
        },
        {
          "score": 0.7643865942955017,
          "sentiment": 0.75,
          "count": 3,
          "label": "GUID Partition Table",
          "uri": "https://diffbot.com/entity/EReKbXuSJMYmoM8lawtgEsA",
          "rdfTypes": [
            "http://dbpedia.org/ontology/Skill",
            "http://dbpedia.org/ontology/Activity"
          ]
        }
      ],
      "publisherCountry": "United States",
      "humanLanguage": "en",
      "authorUrl": "https://www.technologyreview.com/author/will-douglas-heaven/",
      "pageUrl": "https://www.technologyreview.com/2020/09/04/1008156/knowledge-graph-ai-reads-web-machine-learning-natural-language-processing/",
      "html": "<figure><img alt=\"knowledge graph illustration\" sizes=\"(max-width: 32rem) 472px,(max-width: 48rem) 728px,(max-width: 64rem) 808px,(max-width: 80rem) 1064px,(max-width: 90rem) 1126px,1080px\" src=\"https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=2252,1266\" srcset=\"https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=944,530 944w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=472,265 472w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=1456,818 1456w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=728,409 728w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=1616,908 1616w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=808,454 808w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=2128,1196 2128w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=1064,598 1064w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=2252,1266 2252w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=1126,633 1126w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=2160,1214 2160w,https://wp.technologyreview.com/wp-content/uploads/2020/09/knowledge-graph2_web.jpg?fit=1080,607 1080w\"></img></figure>\n<p>Back in July, OpenAI&rsquo;s <a href=\"https://www.technologyreview.com/2020/07/20/1005454/openai-machine-learning-language-generator-gpt-3-nlp/\">latest language model, GPT-3</a>, dazzled with its ability to churn out paragraphs that look as if they could have been written by a human. People started showing off how GPT-3 could also autocomplete code or fill in blanks in spreadsheets.</p>\n<p>In one example, Twitter employee Paul Katsen tweeted &ldquo;the spreadsheet function to rule them all,&rdquo; in which<a href=\"https://twitter.com/pavtalk/status/1285410751092416513\"> GPT-3 fills out columns</a> by itself, pulling in data for US states: the population of Michigan is 10.3 million, Alaska became a state in 1906, and so on.</p>\n<p>Except that GPT-3 can be a bit of a bullshitter. The population of Michigan has never been 10.3 million, and Alaska became a state in 1959.</p>\n<p>Language models like GPT-3 are <a href=\"https://www.technologyreview.com/2020/07/31/1005876/natural-language-processing-evaluation-ai-opinion/\">amazing mimics</a>, but they have little sense of what they&rsquo;re actually saying. &ldquo;They&rsquo;re really good at generating stories about unicorns,&rdquo; says Mike Tung, CEO of Stanford startup Diffbot. &ldquo;But they&rsquo;re not trained to be factual.&rdquo;</p>\n<p>This is a problem if we want <a href=\"https://forms.technologyreview.com/in-machines-we-trust/\">AIs to be trustworthy</a>. That&rsquo;s why Diffbot takes a different approach. It is building an AI that reads every page on the entire public web, in multiple languages, and extracts as many facts from those pages as it can.</p>\n<p>Like GPT-3, Diffbot&rsquo;s system learns by vacuuming up vast amounts of human-written text found online. But instead of using that data to train a language model, Diffbot turns what it reads into a series of three-part factoids that relate one thing to another: subject, verb, object.</p>\n<p>Pointed at <a href=\"https://www.technologyreview.com/author/will-douglas-heaven/\">my bio</a>, for example, Diffbot learns that Will Douglas Heaven is a journalist; Will Douglas Heaven works at MIT Technology Review; MIT Technology Review is a media company; and so on. Each of these factoids gets joined up with billions of others in a sprawling, interconnected network of facts. This is known as a knowledge graph.</p>\n<p>Knowledge graphs are not new. They have been around for decades, and were a fundamental concept in early AI research. But constructing and maintaining knowledge graphs has typically been done by hand, which is hard. This also stopped Tim Berners-Lee from realizing what he called the semantic web, which would have included information for machines as well as humans, so that bots could book our flights, do our shopping, or give smarter answers to questions than search engines.</p>\n<p>A few years ago, Google started using knowledge graphs too. Search for &ldquo;Katy Perry&rdquo; and you will get a box next to the main search results telling you that Katy Perry is an American singer-songwriter with music available on YouTube, Spotify, and Deezer. You can see at a glance that she is married to Orlando Bloom, she&rsquo;s 35 and worth $125 million, and so on. Instead of giving you a list of links to pages about Katy Perry, Google gives you a set of facts about her drawn from its knowledge graph.</p>\n<p>But Google only does this for its most popular search terms. Diffbot wants to do it for everything. By fully automating the construction process, Diffbot has been able to build what may be the largest knowledge graph ever.</p>\n<p>Alongside Google and Microsoft, it is one of only three US companies that crawl the entire public web. &ldquo;It definitely makes sense to crawl the web,&rdquo; says Victoria Lin, a research scientist at Salesforce who works on natural-language processing and knowledge representation. &ldquo;A lot of human effort can otherwise go into making a large knowledge base.&rdquo; Heiko Paulheim at the University of Mannheim in Germany agrees: &ldquo;Automation is the only way to build large-scale knowledge graphs.&rdquo;</p>\n<h3>Super surfer</h3>\n<p>To collect its facts, Diffbot&rsquo;s AI reads the web as a human would&mdash;but much faster. Using a super-charged version of the Chrome browser, the AI views the raw pixels of a web page and uses image-recognition algorithms to categorize the page as one of 20 different types, including video, image, article, event, and discussion thread. It then identifies key elements on the page, such as headline, author, product description, or price, and uses NLP to extract facts from any text.</p>\n<p>Every three-part factoid gets added to the knowledge graph. Diffbot extracts facts from pages written in any language, which means that it can answer queries about Katy Perry, say, using facts taken from articles in Chinese or Arabic even if they do not contain the term &ldquo;Katy Perry.&rdquo;</p>\n<p>Browsing the web like a human lets the AI see the same facts that we see. It also means it has had to learn to navigate the web like us. The AI must scroll down, switch between tabs, and click away pop-ups. &ldquo;The AI has to play the web like a video game just to experience the pages,&rdquo; says Tung.</p>\n<p>Diffbot crawls the web nonstop and rebuilds its knowledge graph every four to five days. According to Tung, the AI adds 100 million to 150 million entities each month as new people pop up online, companies are created, and products are launched. It uses more machine-learning algorithms to fuse new facts with old, creating new connections or overwriting out-of-date ones. Diffbot has to add new hardware to its data center as the knowledge graph grows.</p>\n<p>Researchers can access Diffbot&rsquo;s knowledge graph for free. But Diffbot also has around 400 paying customers. The search engine DuckDuckGo uses it to generate its own Google-like boxes. Snapchat uses it to extract highlights from news pages. The popular wedding-planner app Zola uses it to help people make wedding lists, pulling in images and prices. NASDAQ, which provides information about the stock market, uses it for financial research.</p>\n<h3>Fake shoes</h3>\n<p>Adidas and Nike even use it to search the web for counterfeit shoes. A search engine will return a long list of sites that mention Nike trainers. But Diffbot lets these companies look for sites that are actually selling their shoes, rather just talking about them.</p>\n<p>For now, these companies must interact with Diffbot using code. But Tung plans to add a natural-language interface. Ultimately, he wants to build what he calls a &ldquo;universal factoid question answering system&rdquo;: an AI that could answer almost anything you asked it, with sources to back up its response.</p>\n<p>Tung and Lin agree that this kind of AI cannot be built with language models alone. But better yet would be to combine the technologies, using a language model like GPT-3 to craft a human-like front end for a know-it-all bot.</p>\n<p>Still, even an AI that has its facts straight is not necessarily smart. &ldquo;We&rsquo;re not trying to define what intelligence is, or anything like that,&rdquo; says Tung. &ldquo;We&rsquo;re just trying to build something useful.&rdquo;</p>\n<figure><img alt=\"NLP maps hallucinogenic experience\" sizes=\"(max-width: 32rem) 287px,(max-width: 48rem) 503px,100vw\" src=\"https://wp.technologyreview.com/wp-content/uploads/2022/03/Flower-Trip-style.jpeg?resize=1006,640\" srcset=\"https://wp.technologyreview.com/wp-content/uploads/2022/03/Flower-Trip-style.jpeg?resize=574,574 574w,https://wp.technologyreview.com/wp-content/uploads/2022/03/Flower-Trip-style.jpeg?resize=287,287 287w,https://wp.technologyreview.com/wp-content/uploads/2022/03/Flower-Trip-style.jpeg?resize=1006,640 1006w,https://wp.technologyreview.com/wp-content/uploads/2022/03/Flower-Trip-style.jpeg?resize=503,320 503w\"></img></figure>\n<figure><img alt=\"Demis Hassabis\" sizes=\"(max-width: 32rem) 287px,(max-width: 48rem) 503px,100vw\" src=\"https://wp.technologyreview.com/wp-content/uploads/2022/02/MA22_Demis-Hassabis-99-v1.jpg?resize=1006,1400\" srcset=\"https://wp.technologyreview.com/wp-content/uploads/2022/02/MA22_Demis-Hassabis-99-v1.jpg?resize=574,574 574w,https://wp.technologyreview.com/wp-content/uploads/2022/02/MA22_Demis-Hassabis-99-v1.jpg?resize=287,287 287w,https://wp.technologyreview.com/wp-content/uploads/2022/02/MA22_Demis-Hassabis-99-v1.jpg?resize=1006,1400 1006w,https://wp.technologyreview.com/wp-content/uploads/2022/02/MA22_Demis-Hassabis-99-v1.jpg?resize=503,700 503w\"></img></figure>",
      "categories": [
        {
          "score": 0.962,
          "name": "Technology & Computing",
          "id": "iabv2-596"
        },
        {
          "score": 0.962,
          "name": "Artificial Intelligence",
          "id": "iabv2-597"
        }
      ],
      "text": "Back in July, OpenAI’s latest language model, GPT-3, dazzled with its ability to churn out paragraphs that look as if they could have been written by a human. People started showing off how GPT-3 could also autocomplete code or fill in blanks in spreadsheets.\nIn one example, Twitter employee Paul Katsen tweeted “the spreadsheet function to rule them all,” in which GPT-3 fills out columns by itself, pulling in data for US states: the population of Michigan is 10.3 million, Alaska became a state in 1906, and so on.\nExcept that GPT-3 can be a bit of a bullshitter. The population of Michigan has never been 10.3 million, and Alaska became a state in 1959.\nLanguage models like GPT-3 are amazing mimics, but they have little sense of what they’re actually saying. “They’re really good at generating stories about unicorns,” says Mike Tung, CEO of Stanford startup Diffbot. “But they’re not trained to be factual.”\nThis is a problem if we want AIs to be trustworthy. That’s why Diffbot takes a different approach. It is building an AI that reads every page on the entire public web, in multiple languages, and extracts as many facts from those pages as it can.\nLike GPT-3, Diffbot’s system learns by vacuuming up vast amounts of human-written text found online. But instead of using that data to train a language model, Diffbot turns what it reads into a series of three-part factoids that relate one thing to another: subject, verb, object.\nPointed at my bio, for example, Diffbot learns that Will Douglas Heaven is a journalist; Will Douglas Heaven works at MIT Technology Review; MIT Technology Review is a media company; and so on. Each of these factoids gets joined up with billions of others in a sprawling, interconnected network of facts. This is known as a knowledge graph.\nKnowledge graphs are not new. They have been around for decades, and were a fundamental concept in early AI research. But constructing and maintaining knowledge graphs has typically been done by hand, which is hard. This also stopped Tim Berners-Lee from realizing what he called the semantic web, which would have included information for machines as well as humans, so that bots could book our flights, do our shopping, or give smarter answers to questions than search engines.\nA few years ago, Google started using knowledge graphs too. Search for “Katy Perry” and you will get a box next to the main search results telling you that Katy Perry is an American singer-songwriter with music available on YouTube, Spotify, and Deezer. You can see at a glance that she is married to Orlando Bloom, she’s 35 and worth $125 million, and so on. Instead of giving you a list of links to pages about Katy Perry, Google gives you a set of facts about her drawn from its knowledge graph.\nBut Google only does this for its most popular search terms. Diffbot wants to do it for everything. By fully automating the construction process, Diffbot has been able to build what may be the largest knowledge graph ever.\nAlongside Google and Microsoft, it is one of only three US companies that crawl the entire public web. “It definitely makes sense to crawl the web,” says Victoria Lin, a research scientist at Salesforce who works on natural-language processing and knowledge representation. “A lot of human effort can otherwise go into making a large knowledge base.” Heiko Paulheim at the University of Mannheim in Germany agrees: “Automation is the only way to build large-scale knowledge graphs.”\nSuper surfer\nTo collect its facts, Diffbot’s AI reads the web as a human would—but much faster. Using a super-charged version of the Chrome browser, the AI views the raw pixels of a web page and uses image-recognition algorithms to categorize the page as one of 20 different types, including video, image, article, event, and discussion thread. It then identifies key elements on the page, such as headline, author, product description, or price, and uses NLP to extract facts from any text.\nEvery three-part factoid gets added to the knowledge graph. Diffbot extracts facts from pages written in any language, which means that it can answer queries about Katy Perry, say, using facts taken from articles in Chinese or Arabic even if they do not contain the term “Katy Perry.”\nBrowsing the web like a human lets the AI see the same facts that we see. It also means it has had to learn to navigate the web like us. The AI must scroll down, switch between tabs, and click away pop-ups. “The AI has to play the web like a video game just to experience the pages,” says Tung.\nDiffbot crawls the web nonstop and rebuilds its knowledge graph every four to five days. According to Tung, the AI adds 100 million to 150 million entities each month as new people pop up online, companies are created, and products are launched. It uses more machine-learning algorithms to fuse new facts with old, creating new connections or overwriting out-of-date ones. Diffbot has to add new hardware to its data center as the knowledge graph grows.\nResearchers can access Diffbot’s knowledge graph for free. But Diffbot also has around 400 paying customers. The search engine DuckDuckGo uses it to generate its own Google-like boxes. Snapchat uses it to extract highlights from news pages. The popular wedding-planner app Zola uses it to help people make wedding lists, pulling in images and prices. NASDAQ, which provides information about the stock market, uses it for financial research.\nFake shoes\nAdidas and Nike even use it to search the web for counterfeit shoes. A search engine will return a long list of sites that mention Nike trainers. But Diffbot lets these companies look for sites that are actually selling their shoes, rather just talking about them.\nFor now, these companies must interact with Diffbot using code. But Tung plans to add a natural-language interface. Ultimately, he wants to build what he calls a “universal factoid question answering system”: an AI that could answer almost anything you asked it, with sources to back up its response.\nTung and Lin agree that this kind of AI cannot be built with language models alone. But better yet would be to combine the technologies, using a language model like GPT-3 to craft a human-like front end for a know-it-all bot.\nStill, even an AI that has its facts straight is not necessarily smart. “We’re not trying to define what intelligence is, or anything like that,” says Tung. “We’re just trying to build something useful.”",
      "authors": [
        {
          "name": "Will Douglas Heavenarchive page",
          "link": "technologyreview.com/author/will-douglas-heaven"
        }
      ]
    }
  ],
  "type": "article",
  "title": "This know-it-all AI learns by reading the entire web nonstop | MIT Technology Review"
}

Optional Fields

Specify each field desired (comma delimited) in the &fields= argument. In addition to the fields listed below, there are also more fields available with all Extract APIs .

Field	Description
`quotes`	Returns quotes found in the article text and who said them. For English-language text only.
`naturalLanguage`	Runs extracted text and title through the Diffbot Natural Language API. Example: &naturalLanguage=entities,facts,categories,sentiment.
`summaryNumSentences`	Sets the maximum number of sentences for summary generation when using naturalLanguage=summary (Default: 3).

Already have the source HTML? POST it to Article API.

Article API supports a POST option that allows you to upload HTML or plain text for extraction. See Extract Content Not Available Online.

Extracting Comments

Article API will attempt to extract comments from article pages by default. Using integrated functionality from the Discussion API, comment data will be returned in the discussion object (nested within the primary article object). The full syntax for discussion data is available in the Discussion API documentation.

Comment extraction can be disabled using the argument discussion=false. Note that if a page has recently been processed by Diffbot, cached comments may be returned even if discussion=false is passed.

Query Params

url

string

required

Defaults to https://www.technologyreview.com/2020/09/04/1008156/knowledge-graph-ai-reads-web-machine-learning-natural-language-processing/

Target URL to extract

fields

string

enum

Specify optional fields to be returned from any fully-extracted pages (e.g. fields=querystring,links)

Allowed:

timeout

int32

Sets a value in milliseconds to wait for the retrieval/fetch of content from the requested URL. The default timeout for the third-party response is 30 seconds (30000).

callback

string

Use for jsonp requests. Needed for cross-domain ajax.

proxy

string

Specify an IP address of a custom proxy that will be used to fetch the target page. (Ex: &proxy or &proxy=0.0.0.0)

proxyAuth

string

Used to specify the authentication parameters that will be used with a custom proxy specified in the ≺oxy parameter. (Ex: proxyAuth=username:password)

useProxy

string

Set to default to use Diffbot's datacenter proxy for this request. none will instruct Extract to not use proxies, even if proxies have been enabled for this particular URL globally.

paging

boolean

Pass paging=false to disable automatic concatenation multiple-page articles.

maxTags

int32

Set the maximum number of automatically-generated tags to return. (Default: 10)

tagConfidence

float

Set the minimum relevance score of tags to return, between 0.0 and 1.0. By default only tags with a score equal to or above 0.5 will be returned.

categoryConfidence

float

Set the minimum relevance score of categories to return, between 0.0 and 1.0. By default only categories with a score equal to or above 0.5 will be returned.

discussion

boolean

Pass discussion=false to disable automatic extraction of article comments.

naturalLanguage

string

enum

Run extracted text and title through the Diffbot Natural Language API. Example: &naturalLanguage=entities,facts,categories,sentiment.

summaryNumSentences

int32

Sets the maximum number of sentences for summary generation when using naturalLanguage=summary (Default: 3).

renderDelay

integer

≤ 180000

Add additional time for rendering before the page is closed and the DOM is extracted. This can cause page timeouts, so a timeout parameter may be needed to extend the timeout. Note that the renderer closes automatically at 180 seconds.

scroll

string

enum

Direct the browser to scroll down the page, to trigger lazy-loaded content.

Allowed:

Responses

Article

Response

Optional Fields

Already have the source HTML? POST it to Article API.

Extracting Comments

What’s Next

What’s Next

Response

Optional Fields

Already have the source HTML? POST it to Article API.

Extracting Comments

200Successful API Response

500Internal Server Error

What’s Next

What’s Next