Proxy management
IP address blocking is one of the oldest and most effective ways of preventing access to a website. It is therefore paramount for a good web scraping library to provide easy to use but powerful tools which can work around IP blocking. The most powerful weapon in your anti IP blocking arsenal is a proxy server.
With the Apify SDK, you can use your own proxy servers, proxy servers acquired from third-party providers, or you can rely on Apify Proxy for your scraping needs.
Quick start
If you want to use Apify Proxy locally, make sure that you run your Actors via the Apify CLI and that you are logged in with your Apify account in the CLI.
Using Apify Proxy
proxy_configuration = await Actor.create_proxy_configuration()
proxy_url = await proxy_configuration.new_url()
Using your own proxies
proxy_configuration = await Actor.create_proxy_configuration(
proxy_urls=[
'http://proxy-1.com',
'http://proxy-2.com',
],
)
proxy_url = await proxy_configuration.new_url()
Proxy Configuration
All your proxy needs are managed by the ProxyConfiguration
class.
You create an instance using the Actor.create_proxy_configuration()
method.
Then you generate proxy URLs using the ProxyConfiguration.new_url()
method.
Apify Proxy vs. your own proxies
The ProxyConfiguration
class covers both Apify Proxy and custom proxy URLs,
so that you can easily switch between proxy providers.
However, some features of the class are available only to Apify Proxy users,
mainly because Apify Proxy is what one would call a super-proxy.
It's not a single proxy server, but an API endpoint that allows connection
through millions of different IP addresses.
So the class essentially has two modes: Apify Proxy or Your proxy.
The difference is easy to remember.
Using the proxy_url
or new_url_function
arguments enables use of your custom proxy URLs,
whereas all the other options are there to configure Apify Proxy.
Visit the Apify Proxy docs for more info on how these parameters work.
IP Rotation and session management
proxyConfiguration.new_url()
allows you to pass a session_id
parameter.
It will then be used to create a session_id
-proxy_url
pair,
and subsequent new_url()
calls with the same session_id
will always return the same proxy_url
.
This is extremely useful in scraping, because you want to create the impression of a real user.
When no session_id
is provided, your custom proxy URLs are rotated round-robin,
whereas Apify Proxy manages their rotation using black magic to get the best performance.
proxy_configuration = await Actor.create_proxy_configuration(
proxy_urls=[
'http://proxy-1.com',
'http://proxy-2.com',
],
)
proxy_url = await proxy_configuration.new_url() # http://proxy-1.com
proxy_url = await proxy_configuration.new_url() # http://proxy-2.com
proxy_url = await proxy_configuration.new_url() # http://proxy-1.com
proxy_url = await proxy_configuration.new_url() # http://proxy-2.com
proxy_url = await proxy_configuration.new_url(session_id='a') # http://proxy-1.com
proxy_url = await proxy_configuration.new_url(session_id='b') # http://proxy-2.com
proxy_url = await proxy_configuration.new_url(session_id='b') # http://proxy-2.com
proxy_url = await proxy_configuration.new_url(session_id='a') # http://proxy-1.com
Apify Proxy Configuration
With Apify Proxy, you can select specific proxy groups to use, or countries to connect from. This allows you to get better proxy performance after some initial research.
proxy_configuration = await Actor.create_proxy_configuration(
groups=['RESIDENTIAL'],
country_code='US',
)
proxy_url = await proxy_configuration.new_url()
Now your connections using proxy_url will use only Residential proxies from the US. Note that you must first get access to a proxy group before you are able to use it. You can find your available proxy groups in the proxy dashboard.
If you don't specify any proxy groups, automatic proxy selection will be used.
Your own proxy configuration
There are two options how to make ProxyConfiguration
work with your own proxies.
Either you can pass it a list of your own proxy servers:
proxy_configuration = await Actor.create_proxy_configuration(
proxy_urls=[
'http://proxy-1.com',
'http://proxy-2.com',
],
)
proxy_url = await proxy_configuration.new_url()
Or you can pass it a method (accepting one optional argument, the session ID), to generate proxy URLs automatically:
def custom_new_url_function(session_id: Optional[str] = None) -> str:
if session_id is not None:
return f'http://my-custom-proxy-supporting-sessions.com?session-id={session_id}
return 'http://my-custom-proxy-not-supporting-sessions.com'
proxy_configuration = await Actor.create_proxy_configuration(
new_url_function = custom_new_url_function,
)
proxy_url_with_session = await proxy_configuration.new_url('a')
proxy_url_without_Session = await proxy_configuration.new_url()
Configuring proxy based on Actor input
To make selecting the proxies that the Actor uses easier,
you can use an input field with the editor proxy
in your input schema.
This input will then be filled with a dictionary containing the proxy settings you or the users of your Actor selected for the Actor run.
You can then use that input to create the proxy configuration:
actor_input = await Actor.get_input() or {}
proxy_settings = actor_input.get('proxySettings')
proxy_configuration = await Actor.create_proxy_configuration(actor_proxy_input=proxy_settings)
proxy_url = await proxy_configuration.new_url()
Using the generated proxy URLs
Requests
To use the generated proxy URLs with the requests
library,
use the proxies
argument:
proxy_configuration = await Actor.create_proxy_configuration()
proxy_url = await proxy_configuration.new_url()
proxies = {
'http': proxy_url,
'https': proxy_url,
}
response = requests.get('http://example.com', proxies=proxies)
# --- OR ---
with requests.Session() as session:
session.proxies.update(proxies)
response = session.get('http://example.com')
HTTPX
To use the generated proxy URLs with the httpx
library,
use the proxies
argument:
proxy_configuration = await Actor.create_proxy_configuration()
proxy_url = await proxy_configuration.new_url()
response = httpx.get('http://example.com', proxy=proxy_url)
# --- OR ---
async with httpx.AsyncClient(proxy=proxy_url) as httpx_client:
response = await httpx_client.get('http://example.com')