In some cases - when crawling or processing data from certain sites - you may need to diversify the IP addresses of your requests. In this event you can utilize Diffbot’s fleet of proxy IPs to more consistently retrieve results.
Our default proxy servers are usable for most sites and at most volumes. Usage of these proxies incurs an additional API call for each page processed: each page processed using a proxy will count as two API calls.
Our dynamic proxy servers effectively offer a new IP address for each request, and are usable for even the most difficult-to-crawl sites. Usage of dynamic proxy servers is limited to Professional or Enterprise customers, and pricing is dependent on data volume. Contact [email protected] for more information.
Default proxies may be enabled for any Extract API request by adding the
&proxy parameter along with the rest of your request. A single proxy-enabled request will consume two credits.
Some popular sites always require proxies.
These domains have proxies enabled globally and will automatically consume two credits on each call unless a different proxy rule is specified.
The table below outlines all available options for using proxies with Extract APIs.
|proxy||Leave value empty to use default proxies, or specify an IP address of a custom proxy that will be used to fetch the target page, instead of Diffbot's default IPs/proxies. (Ex: |
|proxyAuth||Used to specify the authentication parameters that will be used with a custom proxy specified in the |
|useproxy=none||Don't use proxies, even if proxies have been enabled for this particular URL globally.|
Details on your account’s proxy usage will be available via our Account API, in your Developer Dashboard, and in your monthly invoices.
Note that the use of proxies will likely increase the response time of individual API calls. See suggestions for improving API response times
Proxies are not a "get out of jail free" card. Even the highest quality proxies will eventually be blocked if usage isn't controlled.
Think of proxies as essentially a pool of available computers to make requests from. Without proxies enabled, you start with the default machine. This is the machine everyone else in the Diffbot ecosystem will be making requests from as well, so it's prone to blocks by the largest websites.
By enabling the
&proxy parameter in your request, your requests will start from a second, less-trafficked machine that's theoretically less likely to be blocked.
In other words, the less requests that are made through a machine, the less likely it is to be blocked by websites.
The more the
proxy machine is used however, the higher the likelihood for it to get throttled.
The best approach for consistent extractions of sites with strong rate limiting rules is to rotate your proxies. A simple technique for this is to simply go without proxies until you need one, then try the request again with a proxy.
While there are certainly more advanced approaches, this technique is both simple and economical to deploy.