How Diffbot handles multi-page articles and discussions
Diffbot’s Article and Discussion APIs allow for automatic page concatenation: the ability to string-together multiple pages into a single response.
The Article API by default will automatically concatenate multiple page articles — up to twenty pages total — into single ‘text’ and ‘html’ responses, and media items from multiple pages into the ‘images’ and ‘videos’ arrays.
To disable this functionality, pass
paging=false in your Article API request.
The Discussion API will not concatenate by default. If you wish to enable concatenation, use the
maxPages argument to define the maximum number of pages you wish to be returned in a response. Use
maxPages=all to return all pages regardless of length.
When an article or discussion thread had multiple pages concatenated, you will see two additional fields in your default response:
numPages: number of pages in total concatenated to form the full output
nextPages: a list of additional URLs that were extracted
Pagination not working as expected?
On occasion a site’s unique pagination design or terminology will confuse our concatenator. In this case you can add the concatenation functionality for a particular article or discussion page using our Custom API. This is how you set one up.
- Create a new Custom API for your page
- Create a new custom field named
- Select the element that contains the link to the next page.
- Add an “attribute” filter using the Filters drop-down, and in this field enter
hrefto make sure the URL value is returned.
A few notes:
- This method only works for article and discussion APIs.
Sometimes sites don’t identify the next page link using unique CSS selectors (particularly on sites that have links to individually-numbered pages).
For instance, an older layout of Slate.com used the same class —
.sl-art-pag-link — for all links to individual pages, even pages prior to the current page. Using this class alone could result in multiple
nextPage values and an infinite processing loop.
Our concatenation algorithm will generally prevent infinite loops and repeated content, but writing better CSS selectors will ensure the best performance. In this case, using the following selector will ensure that only the correct next page is identified:
.sl-art-curpage + .sl-art-pag-link
This uses the plus-sign combinator to identify only the page link that is immediately preceded by the current page (
.sl-art-curpage). This ensures that only the next page — if it exists — is identified.
Updated about 1 year ago