Avoid rate limiting or throttling responses when extracting from certain websites.
In some cases - when crawling or processing data from certain sites - you may need to diversify the IP addresses of your requests. In this event you can utilize Diffbot’s fleet of proxy IPs to more consistently retrieve results.
Diffbot Offers Two Levels of Proxies
-
Our default proxy servers are usable for most sites and at most volumes. Usage of these proxies incurs an additional API call for each page processed: each page processed using a proxy will count as two API calls.
-
Our dynamic proxy servers effectively offer a new IP address for each request, and are usable for even the most difficult-to-crawl sites. Usage of dynamic proxy servers is limited to Professional or Enterprise customers, and pricing is dependent on data volume. Contact [email protected] for more information.
Bring Your Own Proxy
Diffbot Extract also supports the use of third party proxies. In fact, third party proxies are recommended for tighter control of Diffbot Extract responses when dependability is paramount. See below for usage instructions.
Diffbot May Apply Proxies Globally
In cases where we identify that our default IPs are being blocked by some site, we will apply a proxy pool globally on our backend to allow the call to succeed. API calls to a domain for which proxies have been applied globally will cost 1 additional credit, the same as if the user had applied the &proxy
parameter manually.
How to Use Proxies
Default proxies may be enabled for any Extract API request by adding the &proxy
parameter along with the rest of your request. A single proxy-enabled request will consume two credits.
Some popular sites always require proxies.
These domains have proxies enabled globally and will automatically consume two credits on each call unless a different proxy rule is specified.
The table below outlines all available options for using proxies with Extract APIs.
Field | Description |
---|---|
proxy | Specify an IP address of a third party proxy that will be used to fetch the target page. (Ex: &proxy=0.0.0.0 ) |
proxyAuth | Used to specify the authentication parameters that will be used with a custom proxy specified in the &proxy parameter. (Ex: proxyAuth=username:password ) |
useProxy=default | Uses our default datacenter proxy for this request. This proxy doesn't require any special authentication and is a secondary measure if our primary datacenter is getting blocked. |
useProxy=none | Disable the use of proxies, even if proxies have been enabled for this particular URL globally. |
Details on your account’s proxy usage will be available via our Account API, in your Developer Dashboard, and in your monthly invoices.
Note that the use of proxies will likely increase the response time of individual API calls. See suggestions for improving API response times
How Proxies Work
Proxies are not a "get out of jail free" card. Even the highest quality proxies will eventually be blocked if usage isn't controlled.
Think of proxies as essentially a pool of available computers to make requests from. Without proxies enabled, you start with the default machine. This is the machine everyone else in the Diffbot ecosystem will be making requests from as well, so it's prone to blocks by the largest websites.
By enabling the &proxy
parameter in your request, your requests will start from a second, less-trafficked machine that's theoretically less likely to be blocked.
In other words, the less requests that are made through a machine, the less likely it is to be blocked by websites.
The more the proxy
machine is used however, the higher the likelihood for it to get throttled.
Best Practices
The best approach for consistent extractions of sites with strong rate limiting rules is to rotate your proxies. A simple technique for this is to simply go without proxies until you need one (i.e. 400, 403, or 500 errors), then try the request again with a proxy.
While there are certainly more advanced approaches, this technique is both simple and economical to deploy.
Troubleshooting Proxies
In most cases, you should see results immediately upon adding a proxy to your request. In the case of third party proxies, Diffbot Extract will not tell you if the proxy connection was made successfully, so it is recommended to validate the proxy connection with a cURL.
# A sample cURL statement using proxies
curl -x "127.0.0.1:1234" -U "user:pass" "https://www.diffbot.com"
In rare cases, Diffbot's renderer may not be fully compatible with a third party proxy. This can be diagnosed with a successful proxy cURL with a full HTML source response, but an unsuccessful Diffbot Extract request (generally returning 400, 403, or 500 errors).
In such cases, rewrite your script to make 2 requests. The first to download the full HTML source using your third party proxy, the second should pipe the full HTML response to the body of a POST to Diffbot Extract. This technique bypasses Diffbot Extract's renderer entirely, rebuilding the page from the provided HTML and extracts the contents from there. See Extract Content Not Available Online for more details.