Automatically structure and extract entire threads of reviews/comments from articles, product pages, and forum threads.
The Discussion API automatically structures and extracts entire threads or lists of reviews/comments from most discussion pages, forums, and similarly structured web pages.
Test drive Discussion API without a trial token at diffbot.com/testdrive.
Response
The Discussion API returns data in JSON format.
Each response includes a request
object (which returns request-specific metadata), and an objects
array, which will include the extracted information for all objects on a submitted page.
The Discussion API also comes bundled with Article and Product APIs (to extract comments or review data when available). Discussion data in those APIs will be returned within a nested discussion
object instead of an objects
array.
Objects in the Discussion API's objects
array / discussion
object will include the following fields:
Field | Description |
---|---|
type | Type of object (always discussion ). |
pageUrl | URL of submitted page / page from which the discussion is extracted. |
resolvedPageUrl | Returned if the pageUrl redirects to another URL. |
title | Title of the discussion. |
numPosts | Number of individual posts in the thread. |
posts | Array of individual posts. |
↳type | Type of element (always post ). |
↳id | ID of the individual post. The first post of a thread will have an ID of 0. |
↳parentId | ID of the parent, if the post is a reply or response. |
↳text | Full text of the extracted post. |
↳html | Diffbot-normalized HTML of the extracted post. Please see Normalized HTML Fields for a breakdown of elements and attributes returned. |
↳tags | If the post is long enough, an array of tags generated from its specific content. |
↳humanLanguage | Spoken/human language of the post, using two-letter ISO 639-1 nomenclature. |
↳images | If any images are detected within post content, they will be returned in a separate array. Individual array fields are the same as the Article API's images array. |
↳date | Date of post, normalized in most cases to RFC 1123 (HTTP/1.1). |
↳author | Name/username of the post author. |
↳authorUrl | URL of the author profile page, if available. |
↳pageUrl | URL of the page on which the post was found. |
↳diffbotUri | Internal ID used for indexing. |
tags | Array of tags/entities as generated from analysis of all extracted posts and cross-referenced with DBpedia and other data sources. |
participants | Number of unique participants in the discussion thread or comments. |
numPages | Number of pages in the thread concatenated to form the posts response. Use maxPages to define how many pages to concatenate. More on automatic concatenation. |
nextPage | If discussion spans multiple pages, nextPage will return the subsequent page URL. |
nextPages | Array of all page URLs concatenated in a multipage discussion. More on automatic concatenation. |
provider | Discussion service provider (e.g., Disqus, Facebook), if known. |
humanLanguage | Spoken/human language of the discussion / comment thread, using two-letter ISO 639-1 nomenclature. |
rssUrl | URL of the discussion's RSS feed, if available. |
diffbotUri | Unique object ID. The diffbotUri is generated from the values of various Discussion fields and uniquely identifies the object. This can be used for deduplication. |
Optional fields, available using fields= argument | |
sentiment | Returns a sentiment score of each individual post, a value ranging from -1.0 (very negative) to 1.0 (very positive). |
links | Returns a top-level object (links ) containing all hyperlinks found on the page. |
meta | Returns a top-level object (meta ) containing the full contents of page meta tags, including sub-arrays for OpenGraph tags, Twitter Card metadata, schema.org microdata, and -- if available -- oEmbed metadata. |
querystring | Returns any key/value pairs present in the URL querystring. Items without a discrete value will be returned as true . |
breadcrumb | Returns a top-level array (breadcrumb ) of URLs and link text from page breadcrumbs. |
The following is an example response from a successful extraction of comments on a Reddit post.
{
"request": {
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"api": "discussion",
"version": 3
},
"objects": [
{
"numPages": 1,
"humanLanguage": "en",
"confidence": 0.05500000089407453,
"diffbotUri": "discussion|3|-870809033",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"numPosts": 13,
"type": "discussion",
"title": "[OC] 66% of Top 50 Russian Exposed Companies Have Announced Sanctions",
"posts": [
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"images": [
{
"naturalHeight": 767,
"width": 457,
"diffbotUri": "image|3|-804821395",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"url": "https://preview.redd.it/l76k59t8jsm81.png?width=457&auto=webp&s=632efa1f24e607358bbec99c161a6aa579aebfe1",
"naturalWidth": 457,
"height": 767
}
],
"humanLanguage": "en",
"author": "hicheoo",
"authorUrl": "https://old.reddit.com/user/hicheoo",
"diffbotUri": "post|3|29462830",
"html": "<figure><a href=\"https://i.redd.it/l76k59t8jsm81.png\"><img src=\"https://preview.redd.it/l76k59t8jsm81.png?width=457&auto=webp&s=632efa1f24e607358bbec99c161a6aa579aebfe1\"></img></a></figure>\n<h2>Want to add to the discussion?</h2>\n<p>Post a comment!</p>\n<p>Create an account</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 0,
"text": "Want to add to the discussion?\nPost a comment!\n\n \nCreate an account",
"type": "post",
"title": "[OC] 66% of Top 50 Russian Exposed Companies Have Announced Sanctions"
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "not_mig",
"authorUrl": "https://old.reddit.com/user/not_mig",
"diffbotUri": "post|3|-720375378",
"html": "<p>What's the difference between blue, yellow, and green?</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 1,
"text": "What's the difference between blue, yellow, and green?",
"type": "post",
"parentId": 0
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "hicheoo",
"authorUrl": "https://old.reddit.com/user/hicheoo",
"diffbotUri": "post|3|-148816221",
"html": "<p>They're exemptions. I should've clarified up top, but they're basically all in the description.</p>\n<p>Green: Typical Sanctions<br>\n Yellow: Sanctions, but might be a PR move.<br>\n Blue: Healthcare</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 2,
"text": "They're exemptions. I should've clarified up top, but they're basically all in the description.\nGreen: Typical Sanctions\nYellow: Sanctions, but might be a PR move.\nBlue: Healthcare",
"type": "post",
"parentId": 1
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "Zealousideal-Lie7255",
"authorUrl": "https://old.reddit.com/user/Zealousideal-Lie7255",
"diffbotUri": "post|3|-683402068",
"html": "<p>A lot of oil service companies have no reported sanctions. Like Schlumberger, Baker Hughes. Some Chinese companies too.</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 3,
"text": "A lot of oil service companies have no reported sanctions. Like Schlumberger, Baker Hughes. Some Chinese companies too.",
"type": "post",
"parentId": 0
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "varnima",
"authorUrl": "https://old.reddit.com/user/varnima",
"diffbotUri": "post|3|-603833918",
"html": "<p>JetBrains changed and imposed sanctions <a href=\"https://blog.jetbrains.com/blog/2022/03/11/jetbrains-statement-on-ukraine/\">https://blog.jetbrains.com/blog/2022/03/11/jetbrains-statement-on-ukraine/</a></p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 4,
"text": "JetBrains changed and imposed sanctions https://blog.jetbrains.com/blog/2022/03/11/jetbrains-statement-on-ukraine/",
"type": "post",
"parentId": 0
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "hicheoo",
"authorUrl": "https://old.reddit.com/user/hicheoo",
"diffbotUri": "post|3|-296888207",
"html": "<p>Yeah, they're green in the chart.</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 5,
"text": "Yeah, they're green in the chart.",
"type": "post",
"parentId": 4
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "hicheoo",
"authorUrl": "https://old.reddit.com/user/hicheoo",
"diffbotUri": "post|3|624793084",
"html": "<p><strong>Sources:</strong> - Diffbot Sanctions Tracker (<a href=\"https://www.diffbot.com/insights/every-company-affected-by-sanctions/\">https://www.diffbot.com/insights/every-company-affected-by-sanctions/</a>) - Diffbot Knowledge Graph (more detail on query below)</p>\n<p><strong>Data Viz Tool:</strong> Infogram</p>\n<p><strong>Disclaimer:</strong> I work for Diffbot</p>\n<p>I started by querying the Knowledge Graph for people who live in Russia but work for a non-Russian company. Faceting this query by their employer provides me with a list of non-Russian companies ranked by # of Russian employees.</p>\n<p><code>\ntype:Person location.country.name:"Russia" employments.{employer.{location.country.name!="Russia" nbLocations>0} isCurrent:true} facet:employments.{employer.name isCurrent:true}\n</code></p>\n<p>This data underrepresents actual employment figures, as there are many employees who do not maintain an internet presence linking them to their employer. Underrepresentation should be fairly equal across all companies, and relative position in the rankings should be accurate.</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 6,
"text": "Sources: - Diffbot Sanctions Tracker (https://www.diffbot.com/insights/every-company-affected-by-sanctions/) - Diffbot Knowledge Graph (more detail on query below)\nData Viz Tool: Infogram\nDisclaimer: I work for Diffbot\nI started by querying the Knowledge Graph for people who live in Russia but work for a non-Russian company. Faceting this query by their employer provides me with a list of non-Russian companies ranked by # of Russian employees.\ntype:Person location.country.name:\"Russia\" employments.{employer.{location.country.name!=\"Russia\" nbLocations>0} isCurrent:true} facet:employments.{employer.name isCurrent:true}\nThis data underrepresents actual employment figures, as there are many employees who do not maintain an internet presence linking them to their employer. Underrepresentation should be fairly equal across all companies, and relative position in the rankings should be accurate.",
"type": "post",
"parentId": 0
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "zzzmick",
"authorUrl": "https://old.reddit.com/user/zzzmick",
"diffbotUri": "post|3|-130810969",
"html": "<p>epam had over 10k employees in Russia</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 7,
"text": "epam had over 10k employees in Russia",
"type": "post",
"parentId": 0
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "hicheoo",
"authorUrl": "https://old.reddit.com/user/hicheoo",
"diffbotUri": "post|3|-1458692070",
"html": "<p>Yup. The data underrepresents actual employment figures, as there are many employees who do not maintain an internet presence linking them to their employer. Underrepresentation should be fairly equal across all companies, and relative position in the rankings should be accurate.</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 8,
"text": "Yup. The data underrepresents actual employment figures, as there are many employees who do not maintain an internet presence linking them to their employer. Underrepresentation should be fairly equal across all companies, and relative position in the rankings should be accurate.",
"type": "post",
"parentId": 7
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "JanitorKarl",
"authorUrl": "https://old.reddit.com/user/JanitorKarl",
"diffbotUri": "post|3|-149138223",
"html": "<p>Schlumberger and Baker Hughes are both in the oilfield services industry.</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 9,
"text": "Schlumberger and Baker Hughes are both in the oilfield services industry.",
"type": "post",
"parentId": 0
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "flumenia",
"authorUrl": "https://old.reddit.com/user/flumenia",
"diffbotUri": "post|3|889762151",
"html": "<p>What if Microsoft stops to extend licenses of Microsoft Office to Russia? That would make the biggest impact, I guess</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 10,
"text": "What if Microsoft stops to extend licenses of Microsoft Office to Russia? That would make the biggest impact, I guess",
"type": "post",
"parentId": 0
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "Imperial_Empirical",
"authorUrl": "https://old.reddit.com/user/Imperial_Empirical",
"diffbotUri": "post|3|-179317804",
"html": "<p>Putin ordered the development of Russian alternatives after the Crimean annexation due to dependancy/spying fears. I believe from 2016 onwards Microsoft was largely fased out internally.</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 11,
"text": "Putin ordered the development of Russian alternatives after the Crimean annexation due to dependancy/spying fears. I believe from 2016 onwards Microsoft was largely fased out internally.",
"type": "post",
"parentId": 10
},
{
"date": "Fri, 11 Mar 2022 00:00:00 GMT",
"humanLanguage": "en",
"author": "Nightblood83",
"authorUrl": "https://old.reddit.com/user/Nightblood83",
"diffbotUri": "post|3|-901046006",
"html": "<p>A lot of accountants for commies...</p>",
"pageUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/",
"id": 12,
"text": "A lot of accountants for commies...",
"type": "post",
"parentId": 0
}
],
"tags": [
{
"score": 0.8428076505661011,
"count": 5,
"label": "economic sanctions",
"uri": "https://diffbot.com/entity/EWnXSPtH6Osi0pmx8-WPKAg",
"rdfTypes": [
"http://dbpedia.org/ontology/Miscellaneous"
]
}
],
"participants": 9,
"rssUrl": "https://old.reddit.com/r/dataisbeautiful/comments/tbvdhu/oc_66_of_top_50_russian_exposed_companies_have/.rss"
}
]
}
Optional Fields
Specify each field desired (comma delimited) in the &fields=
argument. In addition to the fields listed below, there are also more fields available with all Extract APIs .
Field | Description |
---|---|
sentiment | Returns a sentiment score of each individual post, a value ranging from -1.0 (very negative) to 1.0 (very positive). |
Already have the source HTML? POST it to Discussion API.
Discussion API supports a POST option that allows you to upload HTML or plain text for extraction. See Extract Content Not Available Online.