In some cases - when crawling or processing data from certain sites - you may need to diversify the IP addresses of your requests. In this event you can utilize Diffbot’s fleet of proxy IPs to more consistently retrieve results.
Our default proxy servers are usable for most sites and at most volumes. Usage of these proxies incurs an additional API call for each page processed: each page processed using a proxy will count as two API calls.
Our dynamic proxy servers effectively offer a new IP address for each request, and are usable for even the most difficult-to-crawl sites. Usage of dynamic proxy servers is limited to Professional or Enterprise customers, and pricing is dependent on data volume. Contact [email protected] for more information.
Diffbot Extract also supports the use of third party proxies. In fact, third party proxies are recommended for tighter control of Diffbot Extract responses when dependability is paramount. See below for usage instructions.
Default proxies may be enabled for any Extract API request by adding the
&proxy parameter along with the rest of your request. A single proxy-enabled request will consume two credits.
Some popular sites always require proxies.
These domains have proxies enabled globally and will automatically consume two credits on each call unless a different proxy rule is specified.
The table below outlines all available options for using proxies with Extract APIs.
|proxy||Leave value empty to use default proxies, or specify an IP address of a third party proxy that will be used to fetch the target page, instead of Diffbot's default IPs/proxies. (Ex: |
|proxyAuth||Used to specify the authentication parameters that will be used with a custom proxy specified in the |
|useProxy=none||Don't use proxies, even if proxies have been enabled for this particular URL globally.|
Details on your account’s proxy usage will be available via our Account API, in your Developer Dashboard, and in your monthly invoices.
Note that the use of proxies will likely increase the response time of individual API calls. See suggestions for improving API response times
Proxies are not a "get out of jail free" card. Even the highest quality proxies will eventually be blocked if usage isn't controlled.
Think of proxies as essentially a pool of available computers to make requests from. Without proxies enabled, you start with the default machine. This is the machine everyone else in the Diffbot ecosystem will be making requests from as well, so it's prone to blocks by the largest websites.
By enabling the
&proxy parameter in your request, your requests will start from a second, less-trafficked machine that's theoretically less likely to be blocked.
In other words, the less requests that are made through a machine, the less likely it is to be blocked by websites.
The more the
proxy machine is used however, the higher the likelihood for it to get throttled.
The best approach for consistent extractions of sites with strong rate limiting rules is to rotate your proxies. A simple technique for this is to simply go without proxies until you need one (i.e. 400, 403, or 500 errors), then try the request again with a proxy.
While there are certainly more advanced approaches, this technique is both simple and economical to deploy.
In most cases, you should see results immediately upon adding a proxy to your request. In the case of third party proxies, Diffbot Extract will not tell you if the proxy connection was made successfully, so it is recommended to validate the proxy connection with a cURL.
# A sample cURL statement using proxies curl -x "127.0.0.1:1234" -U "user:pass" "https://www.diffbot.com"
In rare cases, Diffbot's renderer may not be fully compatible with a third party proxy. This can be diagnosed with a successful proxy cURL with a full HTML source response, but an unsuccessful Diffbot Extract request (generally returning 400, 403, or 500 errors).
In such cases, rewrite your script to make 2 requests. The first to download the full HTML source using your third party proxy, the second should pipe the full HTML response to the body of a POST to Diffbot Extract. This technique bypasses Diffbot Extract's renderer entirely, rebuilding the page from the provided HTML and extracts the contents from there. See Extract Content Not Available Online for more details.