Extract content from multiple/recurring elements on a page with a Custom API.
For instance, to extract the name and address of each result in a business directory, you would need a collection. To extract multiple images on a page, you will typically need a collection. (And, in fact, to override the default media output within our Article API, you will need to edit a collection.)
The first step in creating a collection is identifying the “repeating parent” of the content you wish to extract. This will depend entirely on the markup of the page. In the case of a business directory example, you may have markup as follows:
<div class="business"> <h3 class="title">Hamburger Central</h3> <span class="phone">650.555.5512</span> </div> <div class="business"> <h3 class="title">Jim's Shake Shop</h3> <span class="phone">650.555.9127</span> </div> <div class="business"> <h3 class="title">Steaks and More</h3> <span class="phone">650.555.2100</span> </div>
In the above example our “repeating parent” is
Once you have created your collection, it’s time to add custom fields to be extracted from within each collection item. In the above example, we can add a field for the business name (
h3.title) and phone number (
When the fields are added, your JSON response will include an array of the items on the page, and for each matching item, the fields defined within your custom collection.