Error 404: Could Not Download Page

The website is slow to load, completely down, or is blocking Diffbot's servers.

    "errorCode": 404,
    "error": "Could not download page (404)"

Quick Test

To check which case we’re dealing with, please try to process it with a regular API call first.

Copy this to your browser’s URL bar and replace TOKEN with your token, ENCODED_URL with your encoded URL to test (encode it here) and APINAME with the desired API (with products it’s “product”, with articles it’s “article”, etc.).

If the request is successful, then the page is back up and should work. If it fails, but it opens in your browser, then they are blocking Diffbot.

To try and fix this, here're a few methods we can mix and match.

Apply Proxies

Diffbot's default server might be getting blocked for hitting the target website too many times. This can happen occasionally and can be fixed simply by rotating to a different set of IP addresses called proxies. Apply proxies by adding &proxy to the request —

If this works, in your crawl / bulk settings, flip the switch that says “Use Proxies” and try again. This should make things work. Please note that proxy calls count as double calls.

Apply a Render Delay

Some sites may need more time to fully load and render before it is ready for extraction. We can tell Extract API to wait by applying a render delay and/or adjusting the timeout threshold.

  • renderDelay tells Extract API to wait a specified time before executing.
  • timeout tells Extract API to give up after a total length of time rendering and processing a page.

We can test for a rendering issue by applying a conservatively large delay (10s) and timeout (50s, up from 30s).

If the issue is resolved, play with the renderDelay and timeout values to arrive at a number that is acceptable for production use.

Check Your Custom Javascript (X-Evaluate Scripts)

If you're running any Custom Javascript on your Extract/Custom API, try these measures:

  • Check if the site can even run X-eval by putting in something simple, then picking that content up with a Custom API selector. This JS creates a new element with dummy content which you can target with the Custom API to make sure the injection of the new element was successful. If OK, go to next step.
  • Check if the script has the start() and end() functions. If OK, go to next step.
  • Check that the script without start() and end() runs in a browser’s regular console. If OK, go to next step.
  • Check that the script doesn’t take longer than 60 seconds to execute. If OK, go to next step.
  • Avoid using JavaScript classes. Instead of var re = new RegExp(‘ab+c’); use var re = /ab+c/;
  • Use try/catch blocks to isolate running and failing parts of the script, line by line, until you find on which line it fails. You can put the simple injection script from step 1 after every line and you’ll know it executed.

Still Not Working?

Let us help! Share your troubleshooting attempts with us at [email protected].