Docs Suite

Docs Suite

  • Debugging

›Recipes

Crawlbot

    Basics

    • Introduction
    • Basic Usage
    • Crawlbot Video Tutorials
    • Improving Crawl Efficiency

    Recipes

    • Index
    • Authenticated Crawling
    • Checking number of results per result type
    • Crawling JavaScript-generated links
    • How do I set custom headers while crawling?
    • Controlling the number of Active Crawl jobs

    API

    • Crawlbot API
Edit

Crawling behind login-walls

There are many authentication schemes on the web, but two of the most common are username+password HTML forms and HTTP basic authentication.

HTML Forms

Form-based authentication works by the setting a cookie in your browser using the Set-Cookie header. Here is a full tutorial on how to use login cookies to access content behind login walls in individual APIs. Follow the same procedure for retrieving the login cookie.

If you're using the old Diffbot dashboard to create the crawljob, place the cookie value into the Cookie field:

The custom headers fields in the old dashboard's UI

If you're using the new dashboard, use the "Custom headers" text field and add the Cookie as a single line, like so:

Cookie:SomeKey=SomeValue...

The custom headers field in the new dashboard

Save the crawljob and it will use this cookie when crawling.

HTTP Basic

For HTTP Basic based login, the browser will send an Authorization header that is calculated based on the values of the username and password. The header will be of the format Authorization: Basic $hash where the $hash is computed as the Base 64 encoding of the string $username:$password.

Learn more about basic authentication.

Once you have the Authorization header, as above, you can then supply this via the Custom Headers field in crawlbot's UI or via the Crawlbot API in order to perform authenticated crawling.

Last updated by Bruno Skvorc
← IndexChecking number of results per result type →
  • HTML Forms
  • HTTP Basic
Docs Suite
Docs
ExtractionCrawlingKnowledge GraphDiffbot and GDPR
Community
Stack OverflowTwitter
More
BlogHelpGitHub
Diffbot.com
Copyright © 2021 Diffbot.com