Docs Suite

Docs Suite

  • Debugging

›Basics

Crawlbot

    Basics

    • Introduction
    • Basic Usage
    • Crawlbot Video Tutorials
    • Improving Crawl Efficiency

    Recipes

    • Index
    • Authenticated Crawling
    • Checking number of results per result type
    • Crawling JavaScript-generated links
    • How do I set custom headers while crawling?
    • Controlling the number of Active Crawl jobs

    API

    • Crawlbot API
Edit

Basic Usage of Crawlbot

This tutorial will guide you through creating a simple crawljob.

MaxHops

Use the parameter maxHops to specify the depth of your crawl from your seed URL(s). A value of 0 will limit crawling only to your seed(s). maxHops=1 will spider all links from your seed URL(s); maxHops=2 will follow the links on those pages; etc.

A maxHops setting of -1 (default) will spider all links at any depth.

This can be used in conjunction with URL crawl patterns to fine tune your crawl further.

Querystring parameter

Crawlbot serves as a controller for sending pages to the appropriate Diffbot API for processing/extraction. By default, these will be generic requests to the appropriate API and will return the default fields from that API.

For example, Crawlbot URLs handed to the Article API will be equivalent to calling https://api.diffbot.com/v3/article?url=[url]

You can adjust individual API fields returned or the parameters of extraction API requests via the Crawlbot querystring field. For example, to specify certain fields and adjust the timeout value in your Article API requests, enter timeout=10000&fields=title,text,meta in the querystring field:

This will pass &timeout=10000&fields=title,text,meta in each Article API request.

Last updated by Bruno Skvorc
← IntroductionCrawlbot Video Tutorials →
  • MaxHops
  • Querystring parameter
Docs Suite
Docs
ExtractionCrawlingKnowledge GraphDiffbot and GDPR
Community
Stack OverflowTwitter
More
BlogHelpGitHub
Diffbot.com
Copyright © 2021 Diffbot.com