Crawling behind login-walls
There are many authentication schemes on the web, but two of the most common are username+password HTML forms and HTTP basic authentication.
HTML Forms
Form-based authentication works by the setting a cookie in your browser using the Set-Cookie header. Here is a full tutorial on how to use login cookies to access content behind login walls in individual APIs. Follow the same procedure for retrieving the login cookie.
If you're using the old Diffbot dashboard to create the crawljob, place the cookie value into the Cookie field:
If you're using the new dashboard, use the "Custom headers" text field and add the Cookie as a single line, like so:
Cookie:SomeKey=SomeValue...
Save the crawljob and it will use this cookie when crawling.
HTTP Basic
For HTTP Basic based login, the browser will send an Authorization header that is calculated based on the values of the username and password. The header will be of the format Authorization: Basic $hash
where the $hash
is computed as the Base 64 encoding of the string $username:$password
.
Learn more about basic authentication.
Once you have the Authorization header, as above, you can then supply this via the Custom Headers field in crawlbot's UI or via the Crawlbot API in order to perform authenticated crawling.