diff --git a/sources/academy/tutorials/php/using_apify_from_php.md b/sources/academy/tutorials/php/using_apify_from_php.md index c5a194ff2..d8c9b609d 100644 --- a/sources/academy/tutorials/php/using_apify_from_php.md +++ b/sources/academy/tutorials/php/using_apify_from_php.md @@ -263,7 +263,7 @@ $response = $client->get("https://api.apify.com/v2/browser-info"); echo $response->getBody(); ``` -[See the proxy docs](/platform/proxy/connection-settings) for more details on using specific proxies. +[See the proxy docs](/platform/proxy/usage) for more details on using specific proxies. ## Feedback diff --git a/sources/platform/proxy/connection_settings.md b/sources/platform/proxy/connection_settings.md deleted file mode 100644 index 8a396fcf1..000000000 --- a/sources/platform/proxy/connection_settings.md +++ /dev/null @@ -1,97 +0,0 @@ ---- -title: Connection settings -description: Learn how to connect your application to Apify Proxy. See the required parameters such as the correct username and password. -sidebar_position: 10.1 -slug: /proxy/connection-settings ---- - -# Connection settings - -**Learn how to connect your application to Apify Proxy. See the required parameters such as the correct username and password.** - ---- - -Below are the HTTP proxy connection settings for Apify Proxy. - -| Parameter | Value / explanation | -|---------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| Proxy type | `HTTP` | -| Hostname | `proxy.apify.com` | -| Port | `8000` | -| Username | Specifies the proxy parameters such as groups, [session](./index.md) and location.
See [username parameters](#username-parameters) below for details.
**Note**: this is not your Apify username. | -| Password | Proxy password. Your password is displayed on the [Proxy](https://console.apify.com/proxy) page in Apify Console.
In Apify [actors](../actors/index.mdx), it is passed as the `APIFY_PROXY_PASSWORD`
environment variable.
See the [environment variables docs](../actors/development/programming_interface/environment_variables.md) for more details. | -| Connection URL | `http://:@proxy.apify.com:8000` | -| Static IP Addresses | `18.208.102.16`, `35.171.134.41` Static IP addresses,
that can be used as alternatives to `Hostname`. | - - -**WARNING:** All usage of Apify Proxy with your password is charged towards your account. Do not share the password with untrusted parties or use it from insecure networks – **the password is sent unencrypted** due to the HTTP protocol's [limitations](https://www.guru99.com/difference-http-vs-https.html). - -## Username parameters - -The `username` field enables you to pass parameters like **[groups](#proxy-groups)**, **[session](./index.md) ID** and **country** for your proxy connection. - -For example, if you're using [datacenter proxies](./datacenter_proxy/index.md) and want to use the `new_job_123` session using the `SHADER` group, the username will be: - -```text -groups-SHADER,session-new_job_123 -``` - -The table below describes the available parameters. - - - - - - - - - - - - - - - - -
groups - Set proxied requests to use servers from the selected groups. -
Set to groups-[group name] or auto when using datacenter proxies. -
Set to groups-RESIDENTIAL when using residential proxies. -
Set to groups-GOOGLE_SERP when using Google SERP proxies. -
session - If specified, all proxied requests with the same session identifier are routed -
through the same IP address. For example session-new_job_123. -
This parameter is optional. By default, each proxied request - is assigned a -
randomly picked least used IP address. -
The session string can only contain numbers (0-9), letters (a-z or A-Z), - dot (.), -
underscore (_), a tilde (~). The maximum length is 50 characters.
-
country - If specified, all proxied requests will use proxy servers from a selected country. -
Note that if there are no proxy servers -
from the specified country, the connection will fail. -
For example groups-SHADER,country-US uses proxies -
from the SHADER group located in the USA. -
This parameter is optional. - By default, the proxy uses all available -
proxy servers from all countries. -
- -If you want to specify one parameter and not the others, just provide that parameter and omit the others. To use the default behavior (not specifying either `groups`, `session`, or `country`), set the username to **auto**. **auto** serves as a placeholder because the username can't be empty. - -To learn more about [sessions](./index.md#sessions) and [IP address rotation](./index.md#ip-address-rotation), see the [proxy overview page](./index.md). - -## Code examples - -We have code examples for connecting to our proxy using the [Apify SDK](/sdk/js) and [Crawlee](https://crawlee.dev/) and other JavaScript libraries (**axios** and **got-scraping**), as well as examples in Python and PHP. - -* [Datacenter proxy](./datacenter_proxy/examples.md) -* [Residential proxy](./residential_proxy/index.md) -* [Google SERP proxy](./google_serp_proxy/examples.md) - -## Proxy groups - -You can see which proxy groups you have access to on the [Proxy page](https://console.apify.com/proxy) in the Apify Console. - -To use a specific proxy group (or multiple groups), specify it in the `username` parameter. diff --git a/sources/platform/proxy/datacenter_proxy.md b/sources/platform/proxy/datacenter_proxy.md new file mode 100644 index 000000000..5ebc288b2 --- /dev/null +++ b/sources/platform/proxy/datacenter_proxy.md @@ -0,0 +1,455 @@ +--- +title: Datacenter proxy +description: Learn how to reduce blocking when web scraping using IP address rotation. See proxy parameters and learn to implement Apify Proxy in an application. +sidebar_position: 10.2 +slug: /proxy/datacenter-proxy +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Datacenter proxy {#datacenter-proxy} + +**Learn how to reduce blocking when web scraping using IP address rotation. See proxy parameters and learn to implement Apify Proxy in an application.** + +--- + +Datacenter proxies are a cheap, fast and stable way to mask your identity online. When you access a website using a datacenter proxy, the site can only see the proxy center's credentials, not yours. + +Datacenter proxies allow you to mask and [rotate](./usage.md#ip-address-rotation) your IP address during web scraping and automation jobs, reducing the possibility of them being [blocked](/academy/anti-scraping/techniques#access-denied). For each [HTTP/S request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods), the proxy takes the list of all available IP addresses and selects the one used the longest time ago for the specific hostname. + +You can refer to our [blog post](https://blog.apify.com/datacenter-proxies-when-to-use-them-and-how-to-make-the-most-of-them/) for tips on how to make the most out of datacenter proxies. + +## Features {#features} + +* Periodic health checks of proxies in the pool so requests are not forwarded via dead proxies. +* Intelligent rotation of IP addresses so target hosts are accessed via proxies that have accessed them the longest time ago, to reduce the chance of blocking. +* Periodically checks whether proxies are banned by selected target websites. If they are, stops forwarding traffic to them to get the proxies unbanned as soon as possible. +* Ensures proxies are located in specific countries using IP geolocation. +* Allows selection of groups of proxy servers with specific characteristics. +* Supports persistent sessions that enable you to keep the same IP address for certain parts of your crawls. +* Measures statistics of traffic for specific users and hostnames. +* Allows selection of proxy servers by country. + +## Datacenter proxy types + +When using Apify's datacenter proxies, you can either select a proxy group, or the `auto` mode. [Apify Proxy](https://apify.com/proxy) offers either proxy groups that are shared across multiple customers or dedicated ones. + +### Shared proxy groups {#shared-proxy-groups} + +Each user has access to a selected number of proxy servers from a shared pool. These servers are spread into groups (called proxy groups). Each group shares a common feature (location, provider, speed and so on). + +For a full list of plans and number of allocated proxy servers for each plan, see our [pricing](https://apify.com/pricing). To get access to more servers, you can upgrade your plan in the [subscription settings](https://console.apify.com/billing/subscription); + +### Dedicated proxy groups {#dedicated-proxy-groups} + +When you purchase access to dedicated proxy groups, they are assigned to you, and only you can use them. You gain access to a range of static IP addresses from these groups. + +This feature is also useful if you have your own pool of proxy servers and still want to benefit from the features of Apify Proxy (like [IP address rotation](./usage.md#ip-address-rotation), [persistent sessions](#session-persistence), and health checking). If you do not have your own pool, the [customer support](https://apify.com/contact) team can set up a dedicated group for you based on your needs and requirements. + +Prices for dedicated proxy servers are mainly based on the number of proxy servers, their type, and location. [Contact us](https://apify.com/contact) for more information. + +## Connecting to datacenter proxies {#connecting-to-datacenter-proxies} + +By default, each proxied HTTP request is potentially sent via a different target proxy server, which adds overhead and could be potentially problematic for websites which save cookies based on IP address. + +If you want to pick an IP address and pass all subsequent connections via that same IP address, you can use the `session` [parameter](./usage.md#sessions). + +### Username parameters {#username-parameters} + +The `username` field enables you to pass various [parameters](./usage.md#connection-settings), such as groups, session and country, for your proxy connection. + +**This parameter is optional**. By default, the proxy uses all available proxy servers from all groups you have access to. + +If you do not want to specify either `groups` or `session` parameters and therefore use the default behavior for both, set the username to `auto`. + +### Examples {#examples} + + + + +```javascript +import { Actor } from 'apify'; +import { PuppeteerCrawler } from 'crawlee'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration(); + +const crawler = new PuppeteerCrawler({ + proxyConfiguration, + async requestHandler({ page }) { + console.log(await page.content()) + }, +}); + +await crawler.run(['https://proxy.apify.com/?format=json']); + +await Actor.exit(); + +``` + + + + + + +```javascript +import { Actor } from 'apify'; +import { CheerioCrawler } from 'crawlee'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration(); + +const crawler = new CheerioCrawler({ + proxyConfiguration, + async requestHandler({ body }) { + // ... + console.log(body); + }, +}); + +await crawler.run(['https://proxy.apify.com']); + +await Actor.exit(); + +``` + + + + + +```python +from apify import Actor +import requests, asyncio + +async def main(): + async with Actor: + proxy_configuration = await Actor.create_proxy_configuration() + proxy_url = await proxy_configuration.new_url() + proxies = { + 'http': proxy_url, + 'https': proxy_url, + } + + for _ in range(10): + response = requests.get('https://api.apify.com/v2/browser-info', proxies=proxies) + print(response.text) + +if __name__ == '__main__': + asyncio.run(main()) +``` + + + + + + +```javascript +import { Actor } from 'apify'; +import { gotScraping } from 'got-scraping'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration(); +const proxyUrl = await proxyConfiguration.newUrl(); + +const url = 'https://api.apify.com/v2/browser-info'; + +const response1 = await gotScraping({ + url, + proxyUrl, + responseType: 'json', +}); + +const response2 = await gotScraping({ + url, + proxyUrl, + responseType: 'json', +}); + +console.log(response1.body.clientIp); +console.log('Should be different than'); +console.log(response2.body.clientIp); + +await Actor.exit(); + +``` + + + + +## Session persistence {#session-persistence} + +When you use datacenter proxy with the `session` [parameter](./usage.md#sessions) set in the `username` [field](#username-parameters), a single IP is assigned to the `session ID` provided after you make the first request. + +**Session IDs represent IP addresses. Therefore, you can manage the IP addresses you use by managing sessions.** [[More info](./usage.md#sessions)] + +This IP/session ID combination is persisted and expires 26 hours later. Each additional request resets the expiration time to 26 hours. + +So, if you use the session at least once a day, it will never expire, with two possible exceptions: + +* The proxy server stops responding and is marked as dead during a health check. +* If the proxy server is part of a proxy group that is refreshed monthly and is rotated out. + +If the session is discarded due to the reasons above, it is assigned a new IP address. + +To learn more about [sessions](./usage.md#sessions) and [IP address rotation](./usage.md#ip-address-rotation), see the [proxy overview page](./index.md). + + +### Examples using sessions + + + + +```javascript +import { Actor } from 'apify'; +import { PuppeteerCrawler } from 'crawlee'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration(); + +const crawler = new PuppeteerCrawler({ + proxyConfiguration, + sessionPoolOptions: { maxPoolSize: 1 }, + async requestHandler({ page}) { + console.log(await page.content()); + }, +}); + +await crawler.run([ + 'https://proxy.apify.com/?format=json', + 'https://proxy.apify.com', +]); + +await Actor.exit(); + +``` + + + + + + +```javascript +import { Actor } from 'apify'; +import { CheerioCrawler } from 'crawlee'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration(); + +const crawler = new CheerioCrawler({ + proxyConfiguration, + sessionPoolOptions: { maxPoolSize: 1 }, + async requestHandler({ json }) { + // ... + console.log(json); + }, +}); + +await crawler.run([ + 'https://api.apify.com/v2/browser-info', + 'https://proxy.apify.com/?format=json', +]); + +await Actor.exit(); + +``` + + + + + +```python +from apify import Actor +import requests, asyncio + +async def main(): + async with Actor: + proxy_configuration = await Actor.create_proxy_configuration() + proxy_url = await proxy_configuration.new_url('my_session') + proxies = { + 'http': proxy_url, + 'https': proxy_url, + } + + # each request uses the same IP address + for _ in range(10): + response = requests.get('https://api.apify.com/v2/browser-info', proxies=proxies) + print(response.text) + +if __name__ == '__main__': + asyncio.run(main()) +``` + + + + + + +```javascript +import { Actor } from 'apify'; +import { gotScraping } from 'got-scraping'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration(); +const proxyUrl = await proxyConfiguration.newUrl('my_session'); + +const response1 = await gotScraping({ + url: 'https://api.apify.com/v2/browser-info', + proxyUrl, + responseType: 'json', +}); + +const response2 = await gotScraping({ + url: 'https://api.apify.com/v2/browser-info', + proxyUrl, + responseType: 'json', +}); + +console.log(response1.body.clientIp); +console.log("Should be the same as"); +console.log(response2.body.clientIp); + +await Actor.exit(); + +``` + + + + +## Examples using standard libraries and languages {#examples-using-standard-libraries-and-languages} + +You can find your proxy password on the [Proxy page](https://console.apify.com/proxy) of the Apify Console. + +> The `username` field is **not** your Apify username.
+> Instead, you specify proxy settings (e.g. `groups-BUYPROXIES94952`, `session-123`).
+> Use `auto` for default settings. + +For examples using [PHP](https://www.php.net/), you need to have the [cURL](https://www.php.net/manual/en/book.curl.php) extension enabled in your PHP installation. See [installation instructions](https://www.php.net/manual/en/curl.installation.php) for more information. + +Examples in [Python 2](https://www.python.org/download/releases/2.0/) use the [six](https://pypi.org/project/six/) library. Run `pip install six` to enable it. + + + + +```javascript +import axios from 'axios'; + +const proxy = { + protocol: 'http', + host: 'proxy.apify.com', + port: 8000, + // Replace below with your password + // found at https://console.apify.com/proxy + auth: { username: 'auto', password: }, +}; + +const url = 'http://proxy.apify.com/?format=json'; + +const { data } = await axios.get(url, { proxy }); + +console.log(data); + +``` + + + + + + +```python +import urllib.request as request +import ssl + +# Replace below with your password +# found at https://console.apify.com/proxy +password = "" +proxy_url = f"http://auto:{password}@proxy.apify.com:8000" +proxy_handler = request.ProxyHandler({ + "http": proxy_url, + "https": proxy_url, +}) + +ctx = ssl.create_default_context() +ctx.check_hostname = False +ctx.verify_mode = ssl.CERT_NONE +httpHandler = request.HTTPSHandler(context=ctx) + +opener = request.build_opener(httpHandler,proxy_handler) +print(opener.open("http://proxy.apify.com/?format=json").read()) + +``` + + + + + + +```python +import six +from six.moves.urllib import request + +# Replace below with your password +# found at https://console.apify.com/proxy +password = "" +proxy_url = ( + "http://auto:%s@proxy.apify.com:8000" % + (password) +) +proxy_handler = request.ProxyHandler({ + "http": proxy_url, + "https": proxy_url, +}) +opener = request.build_opener(proxy_handler) +print(opener.open("http://proxy.apify.com/?format=json").read()) + +``` + + + + + + +```php + below with your password +// found at https://console.apify.com/proxy +curl_setopt($curl, CURLOPT_PROXYUSERPWD, "auto:"); +$response = curl_exec($curl); +curl_close($curl); +if ($response) echo $response; +?> + +``` + + + + + + +```php + below with your password + // found at https://console.apify.com/proxy + 'proxy' => 'http://auto:@proxy.apify.com:8000' +]); + +$response = $client->get("http://proxy.apify.com/?format=json"); +echo $response->getBody(); + +``` + + + diff --git a/sources/platform/proxy/datacenter_proxy/examples.md b/sources/platform/proxy/datacenter_proxy/examples.md deleted file mode 100644 index c3b0408b3..000000000 --- a/sources/platform/proxy/datacenter_proxy/examples.md +++ /dev/null @@ -1,626 +0,0 @@ ---- -title: Examples -description: Learn how to connect to Apify's datacenter proxies from your application with Node.js (axios and got-scraping), Python 2 and 3 and PHP using code examples. -slug: /proxy/datacenter-proxy/examples ---- - -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; - - - -# Connect to datacenter proxies - -**Learn how to connect to Apify's datacenter proxies from your application with Node.js (axios and got-scraping), Python 2 and 3 and PHP using code examples.** - ---- - -This page contains code examples for connecting to [datacenter proxies](./index.md) using [Apify Proxy](https://apify.com/proxy). - -See the [connection settings](../connection_settings.md) page for connection parameters. - -## Using the Apify SDK and Crawlee {#using-the-apify-sdk-and-crawlee} - -If you are developing your own Apify [actor](../../actors/index.mdx) using the [Apify SDK](/sdk/js) and [Crawlee](https://crawlee.dev/), you can use Apify Proxy in: - -* [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) by using the [`Actor.createProxyConfiguration()`](/sdk/js/api/apify/class/Actor#createProxyConfiguration) function. -* [`PlaywrightCrawler`](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler) by using the [`Actor.createProxyConfiguration()`](/sdk/js/api/apify/class/Actor#createProxyConfiguration) function. -* [`PuppeteerCrawler`](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) by using the [`Actor.createProxyConfiguration()`](/sdk/js/api/apify/class/Actor#createProxyConfiguration) function. -* [`JSDOMCrawler`](https://crawlee.dev/api/jsdom-crawler/class/JSDOMCrawler) by using the [`Actor.createProxyConfiguration()`](/sdk/js/api/apify/class/Actor#createProxyConfiguration) function. -* [`launchPlaywright()`](https://crawlee.dev/api/playwright-crawler/function/launchPlaywright) by specifying the proxy configuration in the function's options. -* [`launchPuppeteer()`](https://crawlee.dev/api/puppeteer-crawler/function/launchPuppeteer) by specifying the proxy configuration in the function's options. -* [`got-scraping`](https://github.com/apify/got-scraping) [NPM package](https://www.npmjs.com/package/got-scraping) by specifying proxy URL in the options. - -The Apify SDK's [ProxyConfiguration](/sdk/js/api/apify/class/ProxyConfiguration) enables you to choose which proxies you use for all connections. You can inspect the current proxy's URL and other attributes using the [proxyInfo](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext#proxyInfo) property of [crawling context](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext) of your crawler's [requestHandler](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#requestHandler). - -### Rotate IP addresses {#rotate-ip-addresses} - -IP addresses for each request are selected at random from all available proxy servers. - - - - -```javascript -import { Actor } from 'apify'; -import { PuppeteerCrawler } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); - -const crawler = new PuppeteerCrawler({ - proxyConfiguration, - async requestHandler({ page }) { - console.log(await page.content()) - }, -}); - -await crawler.run(['https://proxy.apify.com/?format=json']); - -await Actor.exit(); - -``` - - - - - - -```javascript -import { Actor } from 'apify'; -import { CheerioCrawler } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); - -const crawler = new CheerioCrawler({ - proxyConfiguration, - async requestHandler({ body }) { - // ... - console.log(body); - }, -}); - -await crawler.run(['https://proxy.apify.com']); - -await Actor.exit(); - -``` - - - - - - -```javascript -import { Actor } from 'apify'; -import { launchPuppeteer } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); -const proxyUrl = await proxyConfiguration.newUrl(); - -const browser = await launchPuppeteer({ proxyUrl }); -const page = await browser.newPage(); -await page.goto('https://www.example.com'); -const html = await page.content(); -await browser.close(); - -console.log('HTML:'); -console.log(html); - -await Actor.exit(); - -``` - - - - - - -```javascript -import { Actor } from 'apify'; -import { gotScraping } from 'got-scraping'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); -const proxyUrl = await proxyConfiguration.newUrl(); - -const url = 'https://api.apify.com/v2/browser-info'; - -const response1 = await gotScraping({ - url, - proxyUrl, - responseType: 'json', -}); - -const response2 = await gotScraping({ - url, - proxyUrl, - responseType: 'json', -}); - -console.log(response1.body.clientIp); -console.log('Should be different than'); -console.log(response2.body.clientIp); - -await Actor.exit(); - -``` - - - - -### Single IP address for multiple requests {#single-ip-address-for-multiple-requests} - -Use a single IP address until it fails (gets retired). - -The `maxPoolSize: 1` specified in `sessionPoolOptions` of [PuppeteerCrawler](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) (works the same with other crawler classes) means that a single IP will be used by all browsers until it fails. Then, all running browsers are retired, a new IP is selected and new browsers opened. The browsers all use the new IP. - - - - -```javascript -import { Actor } from 'apify'; -import { PuppeteerCrawler } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); - -const crawler = new PuppeteerCrawler({ - proxyConfiguration, - sessionPoolOptions: { maxPoolSize: 1 }, - async requestHandler({ page}) { - console.log(await page.content()); - }, -}); - -await crawler.run([ - 'https://proxy.apify.com/?format=json', - 'https://proxy.apify.com', -]); - -await Actor.exit(); - -``` - - - - - - -```javascript -import { Actor } from 'apify'; -import { CheerioCrawler } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); - -const crawler = new CheerioCrawler({ - proxyConfiguration, - sessionPoolOptions: { maxPoolSize: 1 }, - async requestHandler({ json }) { - // ... - console.log(json); - }, -}); - -await crawler.run([ - 'https://api.apify.com/v2/browser-info', - 'https://proxy.apify.com/?format=json', -]); - -await Actor.exit(); - -``` - - - - - - -```javascript -import { Actor } from 'apify'; -import { launchPuppeteer } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); -const proxyUrl = await proxyConfiguration.newUrl('my_session'); -const browser = await launchPuppeteer({ proxyUrl }); -const page = await browser.newPage(); - -await page.goto('https://proxy.apify.com/?format=json'); -const html = await page.content(); - -await page.goto('https://proxy.apify.com'); -const html2 = await page.content(); - -await browser.close(); - -console.log(html); -console.log('Should display the same clientIp as'); -console.log(html2); - -await Actor.exit(); - -``` - - - - - - -```javascript -import { Actor } from 'apify'; -import { gotScraping } from 'got-scraping'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration(); -const proxyUrl = await proxyConfiguration.newUrl('my_session'); - -const response1 = await gotScraping({ - url: 'https://api.apify.com/v2/browser-info', - proxyUrl, - responseType: 'json', -}); - -const response2 = await gotScraping({ - url: 'https://api.apify.com/v2/browser-info', - proxyUrl, - responseType: 'json', -}); - -console.log(response1.body.clientIp); -console.log("Should be the same as"); -console.log(response2.body.clientIp); - -await Actor.exit(); - -``` - - - - -### How to use proxy groups {#how-to-use-proxy-groups} - -For simplicity, the examples above use the automatic proxy configuration (no specific proxy groups are specified), which selects IP addresses from all available groups. - -To use IP addresses from specific proxy groups, add the `groups` [property](/sdk/js/api/apify/interface/ProxyConfigurationOptions#groups) -to [`Actor.createProxyConfiguration()`](/sdk/js/api/apify/class/Actor#createProxyConfiguration) and specify the group names. For example: - -```js -import { Actor } from 'apify'; - -await Actor.init(); -// ... -const proxyConfiguration = await Actor.createProxyConfiguration({ - groups: ['GROUP_NAME_1', 'GROUP_NAME_2'], -}); -// ... -await Actor.exit(); -``` - -## Using standard libraries and languages {#using-standard-libraries-and-languages} - -You can find your proxy password on the [Proxy page](https://console.apify.com/proxy) of the Apify Console. - -> The `username` field is **not** your Apify username.
-> Instead, you specify proxy settings (e.g. `groups-BUYPROXIES94952`, `session-123`).
-> Use `auto` for default settings. - -For examples using [PHP](https://www.php.net/), you need to have the [cURL](https://www.php.net/manual/en/book.curl.php) extension enabled in your PHP installation. See [installation instructions](https://www.php.net/manual/en/curl.installation.php) for more information. - -Examples in [Python 2](https://www.python.org/download/releases/2.0/) use the [six](https://pypi.org/project/six/) library. Run `pip install six` to enable it. - -### Use IP rotation {#use-ip-rotation} - -For each request, a random IP address is chosen from all [available proxy groups](https://console.apify.com/proxy). You can use random IP addresses from proxy groups by specifying the group(s) in the `username` parameter. - -A random IP address will be used for each request. - - - - -```javascript -import axios from 'axios'; - -const proxy = { - protocol: 'http', - host: 'proxy.apify.com', - port: 8000, - // Replace below with your password - // found at https://console.apify.com/proxy - auth: { username: 'auto', password: }, -}; - -const url = 'http://proxy.apify.com/?format=json'; - -const { data } = await axios.get(url, { proxy }); - -console.log(data); - -``` - - - - - - -```python -import urllib.request as request -import ssl - -# Replace below with your password -# found at https://console.apify.com/proxy -password = "" -proxy_url = f"http://auto:{password}@proxy.apify.com:8000" -proxy_handler = request.ProxyHandler({ - "http": proxy_url, - "https": proxy_url, -}) - -ctx = ssl.create_default_context() -ctx.check_hostname = False -ctx.verify_mode = ssl.CERT_NONE -httpHandler = request.HTTPSHandler(context=ctx) - -opener = request.build_opener(httpHandler,proxy_handler) -print(opener.open("http://proxy.apify.com/?format=json").read()) - -``` - - - - - - -```python -import six -from six.moves.urllib import request - -# Replace below with your password -# found at https://console.apify.com/proxy -password = "" -proxy_url = ( - "http://auto:%s@proxy.apify.com:8000" % - (password) -) -proxy_handler = request.ProxyHandler({ - "http": proxy_url, - "https": proxy_url, -}) -opener = request.build_opener(proxy_handler) -print(opener.open("http://proxy.apify.com/?format=json").read()) - -``` - - - - - - -```php - below with your password -// found at https://console.apify.com/proxy -curl_setopt($curl, CURLOPT_PROXYUSERPWD, "auto:"); -$response = curl_exec($curl); -curl_close($curl); -if ($response) echo $response; -?> - -``` - - - - - - -```php - below with your password - // found at https://console.apify.com/proxy - 'proxy' => 'http://auto:@proxy.apify.com:8000' -]); - -$response = $client->get("http://proxy.apify.com/?format=json"); -echo $response->getBody(); - -``` - - - - -### Multiple requests with the same IP address {#multiple-requests-with-the-same-ip-address} - -The IP address in the example is chosen at random from all available proxy groups. - -To use this option, set a session name in the `username` parameter. - - - - -```javascript -import axios from 'axios'; -import { HttpsProxyAgent } from 'hpagent'; - -const httpsAgent = new HttpsProxyAgent({ - // Replace below with your password - // found at https://console.apify.com/proxy - proxy: 'http://session-my_session:@proxy.apify.com:8000', -}); -const axiosWithProxy = axios.create({ httpsAgent }); - -const url = 'https://api.apify.com/v2/browser-info'; - -const response1 = await axiosWithProxy.get(url); -const response2 = await axiosWithProxy.get(url); -// Should return the same clientIp for both requests -console.log('clientIp1:', response1.data.clientIp); -console.log('clientIp2:', response2.data.clientIp); - -``` - - - - - - -```python -import urllib.request as request -import ssl - -def do_request(): - # Replace below with your password - # found at https://console.apify.com/proxy - password = "" - proxy_url = f"http://session-my_session:{password}@proxy.apify.com:8000" - proxy_handler = request.ProxyHandler({ - "http": proxy_url, - "https": proxy_url, - }) - - ctx = ssl.create_default_context() - ctx.check_hostname = False - ctx.verify_mode = ssl.CERT_NONE - httpHandler = request.HTTPSHandler(context=ctx) - - opener = request.build_opener(httpHandler,proxy_handler) - return opener.open("https://api.apify.com/v2/browser-info").read() - -print(do_request()) -print("Should return the same clientIp as ") -print(do_request()) - -``` - - - - - - -```python -import six -from six.moves.urllib import request -import ssl - -def do_request(): - # Replace below with your password - # found at https://console.apify.com/proxy - password = "" - proxy_url = ( - "http://session-my_session:%s@proxy.apify.com:8000" % - (password) - ) - proxy_handler = request.ProxyHandler({ - "http": proxy_url, - "https": proxy_url, - }) - - ctx = ssl.create_default_context() - ctx.check_hostname = False - ctx.verify_mode = ssl.CERT_NONE - httpHandler = request.HTTPSHandler(context=ctx) - - opener = request.build_opener(httpHandler,proxy_handler) - return opener.open("https://api.apify.com/v2/browser-info").read() - -print(do_request()) -print("Should return the same clientIp as ") -print(do_request()) - -``` - - - - - - -```php - below with your password - // found at https://console.apify.com/proxy - curl_setopt($curl, CURLOPT_PROXYUSERPWD, "session-my_session:"); - $response = curl_exec($curl); - curl_close($curl); - return $response; -} -$response1 = doRequest(); -$response2 = doRequest(); -echo $response1; -echo "\nShould return the same clientIp as\n"; -echo $response2; -?> - -``` - - - - - - -```php - below with your password - // found at https://console.apify.com/proxy - 'proxy' => 'http://session-my_session:@proxy.apify.com:8000' -]); - -$response = $client->get("https://api.apify.com/v2/browser-info"); -echo $response->getBody(); - -// Should return the same clientIp as -$response = $client->get("https://api.apify.com/v2/browser-info"); -echo $response->getBody(); - -``` - - - - -## Username examples {#username-examples} - -Use randomly allocated IP addresses from the `BUYPROXIES94952` group: - -```text -groups-BUYPROXIES94952 -``` - -Use a randomly allocated IP address for multiple requests: - -```text -session-new_job_123 -``` - -Use the same IP address from the `SHADER` and `BUYPROXIES94952` groups for multiple requests: - -```text -groups-SHADER+BUYPROXIES94952,session-new_job_123 -``` - -Set a session and select an IP from the `BUYPROXIES94952` group located in the USA: - -```text -groups-BUYPROXIES94952,session-new_job_123,country-US -``` diff --git a/sources/platform/proxy/datacenter_proxy/index.md b/sources/platform/proxy/datacenter_proxy/index.md deleted file mode 100644 index 047c7152a..000000000 --- a/sources/platform/proxy/datacenter_proxy/index.md +++ /dev/null @@ -1,88 +0,0 @@ ---- -title: Datacenter proxy -description: Learn how to reduce blocking when web scraping using IP address rotation. See proxy parameters and learn to implement Apify Proxy in an application. -sidebar_position: 10.3 -slug: /proxy/datacenter-proxy ---- - -# Datacenter proxy {#datacenter-proxy} - -**Learn how to reduce blocking when web scraping using IP address rotation. See proxy parameters and learn to implement Apify Proxy in an application.** - ---- - -Datacenter proxies are a cheap, fast and stable way to mask your identity online. When you access a website using a datacenter proxy, the site can only see the proxy center's credentials, not yours. - -Datacenter proxies allow you to mask and [rotate](../index.md) your IP address during web scraping and automation jobs, reducing the possibility of them being [blocked](/academy/anti-scraping/techniques#access-denied). For each [HTTP/S request](https://developer.mozilla.org/en-US/docs/Web/HTTP/Methods), the proxy takes the list of all available IP addresses and selects the one used the longest time ago for the specific hostname. - -[Apify Proxy](https://apify.com/proxy) currently offers two types of datacenter proxy: - -* [Shared proxy groups](#shared-proxy-groups) -* [Dedicated proxy groups](#dedicated-proxy-groups) - -## Features {#features} - -* Periodic health checks of proxies in the pool so requests are not forwarded via [dead](../index.md) proxies. -* Intelligent rotation of IP addresses so target hosts are accessed via proxies that have accessed them the longest time ago, to reduce the chance of blocking. -* Periodically checks whether proxies are banned by selected target websites. If they are, stops forwarding traffic to them to get the proxies unbanned as soon as possible. -* Ensures proxies are located in specific countries using IP geolocation. -* Allows selection of groups of proxy servers with specific characteristics. -* Supports persistent sessions that enable you to keep the same IP address for certain parts of your crawls. -* Measures statistics of traffic for specific users and hostnames. -* Allows selection of proxy servers by country. - -## Shared proxy groups {#shared-proxy-groups} - -Each user has access to a selected number of proxy servers from a shared pool. These servers are spread into groups (called proxy groups). Each group shares a common feature (location, provider, speed and so on). - -The number of proxy servers available depends on your subscription plan. When you first sign up to Apify platform, you get a 30-day free trial of Apify Proxy. After the trial, you must subscribe to a paid plan to continue using Apify Proxy. - -For a full list of plans and number of allocated proxy servers for each plan, see our [pricing](https://apify.com/pricing). - -To access more servers or to use Apify Proxy without other parts of the Apify platform, [contact us](https://apify.com/contact). - -## Dedicated proxy groups {#dedicated-proxy-groups} - -When you purchase access to dedicated proxy groups, they are assigned to you, and only you can use them. You gain access to a range of static IP addresses from these groups. - -This feature is useful if you have your own pool of proxy servers and still want to benefit from the features of Apify Proxy (like [IP address rotation](../index.md), [persistent sessions](#session-persistence), and health checking). - -If you do not have your own pool, the [customer support](https://apify.com/contact) team can set up a dedicated group for you based on your needs and requirements. - -Prices for dedicated proxy servers are mainly based on the number of proxy servers, their type, and location. [Contact us](https://apify.com/contact) for more information. - -[Contact us](https://apify.com/contact) for more details or if you have any questions. - -## Connecting to datacenter proxies {#connecting-to-datacenter-proxies} - -By default, each proxied HTTP request is potentially sent via a different target proxy server, which adds overhead and could be potentially problematic for websites which save cookies based on IP address. - -If you want to pick an IP address and pass all subsequent connections via that same IP address, you can use the `session` [parameter](../index.md). - -For code examples on how to connect to datacenter proxies, see the [examples](./examples.md) page. - -### Username parameters {#username-parameters} - -The `username` field enables you to pass various [parameters](../connection_settings.md), such as groups, session and country, for your proxy connection. - -**This parameter is optional**. By default, the proxy uses all available proxy servers from all groups you have access to. - -If you do not want to specify either `groups` or `session` parameters and therefore use the default behavior for both, set the username to `auto`. - -## Session persistence {#session-persistence} - -When you use datacenter proxy with the `session` [parameter](../index.md) set in the `username` [field](#username-parameters), a single IP is assigned to the `session ID` provided after you make the first request. - -**Session IDs represent IP addresses. Therefore, you can manage the IP addresses you use by managing sessions.** [[More info](../index.md)] - -This IP/session ID combination is persisted and expires 26 hours later. Each additional request resets the expiration time to 26 hours. - -So, if you use the session at least once a day, it will never expire, with two possible exceptions: - -* The proxy server stops responding and is marked as [dead](../index.md) during a health check. -* If the proxy server is part of a proxy group that is refreshed monthly and is rotated out. - -If the session is discarded due to the reasons above, it is assigned a new IP address. - -To learn more about [sessions](../index.md#sessions) and [IP address rotation](../index.md#ip-address-rotation), see the [proxy overview page](../index.md). - diff --git a/sources/platform/proxy/google_serp_proxy.md b/sources/platform/proxy/google_serp_proxy.md new file mode 100644 index 000000000..6c03391ad --- /dev/null +++ b/sources/platform/proxy/google_serp_proxy.md @@ -0,0 +1,291 @@ +--- +title: Google SERP proxy +description: Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany. +sidebar_position: 10.4 +slug: /proxy/google-serp-proxy +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Google SERP proxy {#google-serp-proxy} + +**Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany.** + +--- + +Google SERP proxy allows you to extract search results from Google Search-powered services. It allows searching in [various countries](#country-selection) and to dynamically switch between country domains. + +Our Google SERP proxy currently supports the below services. + +* Google Search (`http://www.google./search`). +* Google Shopping (`http://www.google./search?tbm=shop`). + +> Google SERP proxy can **only** be used for Google Search and Shopping. It cannot be used to access other websites. + +When using the proxy, **pricing is based on the number of requests made**. + +To use Google SERP proxy or for more information, [contact us](https://apify.com/contact). + +## Connecting to Google SERP proxy {#connecting-to-google-serp-proxy} + +Requests made through the proxy are automatically routed through a proxy server from the selected country and pure **HTML code of the search result page is returned**. + +**Important:** Only HTTP requests are allowed, and the Google hostname needs to start with the `www.` prefix. + +For code examples on how to connect to Google SERP proxies, see the [examples](#examples-using-the-apify-sdk) section. + +### Username parameters {#username-parameters} + +The `username` field enables you to pass various [parameters](./usage.md#username-parameters), such as groups and country, for your proxy connection. + +When using Google SERP proxy, the username should always be: + +```text +groups-GOOGLE_SERP +``` + +Unlike [datacenter](./datacenter_proxy.md) or [residential](./residential_proxy.md) proxies, there is no [session](./usage.md#sessions) parameter. + +If you use the `country` [parameter](./usage.md), the Google proxy location is used if you access a website whose hostname (stripped of `www.`) starts with **google**. + +## Country selection {#country-selection} + +You must use the correct Google domain to get results for your desired country code. + +For example: + +* Search results from the USA: `http://www.google.com/search?q=` + + +* Shopping results from Great Britain: `http://www.google.co.uk/seach?tbm=shop&q=` + +See a [full list](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/List_of_Google_domains.html) of available domain names for specific countries. When using them, remember to prepend the domain name with the `www.` prefix. + +## Examples {#examples} + +### Using the Apify SDK {#examples-using-the-apify-sdk} + +If you are developing your own Apify [Actor](../actors/index.mdx) using the Apify SDK ([JavaScript](/sdk/js) and [Python](/sdk/python)) and [Crawlee](https://crawlee.dev/), the most efficient way to use Google SERP proxy is [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler). This is because Google SERP proxy [only returns a page's HTML](./index.md). Alternatively, you can use the [got-scraping](https://github.com/apify/got-scraping) [NPM package](https://www.npmjs.com/package/got-scraping) by specifying the proxy URL in the options. For Python, you can leverage the [`requests`](https://pypi.org/project/requests/) library along with the Apify SDK. + +The following examples get a list of search results for the keyword **wikipedia** from the USA (`google.com`). + + + + +```javascript +import { Actor } from 'apify'; +import { CheerioCrawler } from 'crawlee'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration({ + groups: ['GOOGLE_SERP'], +}); + +const crawler = new CheerioCrawler({ + proxyConfiguration, + async requestHandler({ body }) { + // ... + console.log(body) + }, +}); + +await crawler.run(['http://www.google.com/search?q=wikipedia']); + +await Actor.exit(); + +``` + + + + + +```python +from apify import Actor +import requests, asyncio + +async def main(): + async with Actor: + proxy_configuration = await Actor.create_proxy_configuration(groups=['GOOGLE_SERP']) + proxy_url = await proxy_configuration.new_url() + proxies = { + 'http': proxy_url, + 'https': proxy_url, + } + + response = requests.get('http://www.google.com/search?q=wikipedia', proxies=proxies) + print(response.text) + +if __name__ == '__main__': + asyncio.run(main()) + +``` + + + + + +```javascript +import { Actor } from 'apify'; +import { gotScraping } from 'got-scraping'; + +await Actor.init(); + +const proxyConfiguration = await Actor.createProxyConfiguration({ + groups: ['GOOGLE_SERP'], +}); +const proxyUrl = await proxyConfiguration.newUrl(); + +const { body } = await gotScraping({ + url: 'http://www.google.com/search?q=wikipedia', + proxyUrl, +}); + +console.log(body); + +await Actor.exit(); + +``` + + + + +### Using standard libraries and languages {#using-standard-libraries-and-languages} + +You can find your proxy password on the [Proxy page](https://console.apify.com/proxy/access) of Apify Console. + +> The `username` field is **not** your Apify username.
+> Instead, you specify proxy settings (e.g. `groups-GOOGLE_SERP`).
+> Use `groups-GOOGLE_SERP` to use proxies from all available countries. + +For examples using [PHP](https://www.php.net/), you need to have the [cURL](https://www.php.net/manual/en/book.curl.php) extension enabled in your PHP installation. See [installation instructions](https://www.php.net/manual/en/curl.installation.php) for more information. + +Examples in [Python 2](https://www.python.org/download/releases/2.0/) use the [six](https://pypi.org/project/six/) library. Run `pip install six` to enable it. + +The following examples get the HTML of search results for the keyword **wikipedia** from the USA (**google.com**). + +Select this option by setting the `username` parameter to `groups-GOOGLE_SERP`. Add the item you want to search to the `query` parameter. + + + + +```javascript +import axios from 'axios'; + +const proxy = { + protocol: 'http', + host: 'proxy.apify.com', + port: 8000, + // Replace below with your password + // found at https://console.apify.com/proxy + auth: { username: 'groups-GOOGLE_SERP', password: }, +}; + +const url = 'http://www.google.com/search'; +const params = { q: 'wikipedia' }; + +const { data } = await axios.get(url, { proxy, params }); + +console.log(data); + +``` + + + + + + +```python +import urllib.request as request +import urllib.parse as parse + +# Replace below with your password +# found at https://console.apify.com/proxy +password = '' +proxy_url = f"http://groups-GOOGLE_SERP:{password}@proxy.apify.com:8000" + +proxy_handler = request.ProxyHandler({ + 'http': proxy_url, +}) + +opener = request.build_opener(proxy_handler) + +query = parse.urlencode({ 'q': 'wikipedia' }) +print(opener.open(f"http://www.google.com/search?{query}").read()) + +``` + + + + + + +```python +import six +from six.moves.urllib import request, urlencode + +# Replace below with your password +# found at https://console.apify.com/proxy +password = '' +proxy_url = ( + 'http://groups-GOOGLE_SERP:%s@proxy.apify.com:8000' % + (password) +) +proxy_handler = request.ProxyHandler({ + 'http': proxy_url, +}) +opener = request.build_opener(proxy_handler) +query = parse.urlencode({ 'q': 'wikipedia' }) +url = ( + 'http://www.google.com/search?%s' % + (query) +) +print(opener.open(url).read()) + +``` + + + + + + +```php + below with your password +// found at https://console.apify.com/proxy +curl_setopt($curl, CURLOPT_PROXYUSERPWD, 'groups-GOOGLE_SERP:'); +$response = curl_exec($curl); +curl_close($curl); +echo $response; +?> + +``` + + + + + +```php + below with your password + // found at https://console.apify.com/proxy + 'proxy' => 'http://groups-GOOGLE_SERP:@proxy.apify.com:8000' +]); + +$response = $client->get("http://www.google.com/search", [ + 'query' => ['q' => 'wikipedia'] +]); +echo $response->getBody(); + +``` + + + diff --git a/sources/platform/proxy/google_serp_proxy/examples.md b/sources/platform/proxy/google_serp_proxy/examples.md deleted file mode 100644 index 4f0407150..000000000 --- a/sources/platform/proxy/google_serp_proxy/examples.md +++ /dev/null @@ -1,424 +0,0 @@ ---- -title: Examples -description: Learn how to connect to Google SERP proxies from your applications with Node.js (axios and got-scraping), Python 2 and 3 and PHP using code examples. -slug: /proxy/google-serp-proxy/examples ---- - -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; - - - -# Connect to Google SERP proxies - -**Learn how to connect to Google SERP proxies from your applications with Node.js (axios and got-scraping), Python 2 and 3 and PHP using code examples.** - ---- - -This page contains code examples for connecting to [Google SERP proxies](./index.md) using [Apify Proxy](https://apify.com/proxy). - -See the [connection settings](../connection_settings.md) page for connection parameters. - -## Using the Apify SDK {#using-the-apify-sdk} - -If you are developing your own Apify [actor](../../actors/index.mdx) using the [Apify SDK](/sdk/js) and [Crawlee](https://crawlee.dev/), the most efficient way to use Google SERP proxy is [CheerioCrawler](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler). This is because Google SERP proxy [only returns a page's HTML](./index.md). Alternatively, you can use the [got-scraping](https://github.com/apify/got-scraping) [NPM package](https://www.npmjs.com/package/got-scraping) by specifying proxy URL in the options. - -Apify Proxy also works with [PuppeteerCrawler](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler), [launchPuppeteer()](https://crawlee.dev/api/puppeteer-crawler/function/launchPuppeteer), [PlaywrightCrawler](https://crawlee.dev/api/playwright-crawler/class/PlaywrightCrawler), [launchPlaywright()](https://crawlee.dev/api/playwright-crawler/function/launchPlaywright) and [JSDOMCrawler](https://crawlee.dev/api/jsdom-crawler/class/JSDOMCrawler). However, `CheerioCrawler` is simply the most efficient solution for this use case. - -### Get a list of search results {#get-a-list-of-search-results} - -Get a list of search results for the keyword **wikipedia** from the USA (`google.com`). - - - - -```javascript -import { Actor } from 'apify'; -import { CheerioCrawler } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration({ - groups: ['GOOGLE_SERP'], -}); - -const crawler = new CheerioCrawler({ - proxyConfiguration, - async requestHandler({ body }) { - // ... - console.log(body) - }, -}); - -await crawler.run(['http://www.google.com/search?q=wikipedia']); - -await Actor.exit(); - -``` - - - - - - -```javascript -import { Actor } from 'apify'; -import { gotScraping } from 'got-scraping'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration({ - groups: ['GOOGLE_SERP'], -}); -const proxyUrl = await proxyConfiguration.newUrl(); - -const { body } = await gotScraping({ - url: 'http://www.google.com/search?q=wikipedia', - proxyUrl, -}); - -console.log(body); - -await Actor.exit(); - -``` - - - - -### Get a list of shopping results {#get-a-list-of-shopping-results} - -Get a list of shopping results for the query **Apple iPhone XS 64GB** from Great Britain (`google.co.uk`). - - - - - -```javascript -import { Actor } from 'apify'; -import { CheerioCrawler } from 'crawlee'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration({ - groups: ['GOOGLE_SERP'], -}); - -const crawler = new CheerioCrawler({ - proxyConfiguration, - async requestHandler({ body }) { - // ... - console.log(body) - }, -}); - -const query = encodeURI('Apple iPhone XS 64GB'); -await crawler.run([`http://www.google.co.uk/search?q=${query}&tbm=shop`]); - -await Actor.exit(); - -``` - - - - - -```javascript -import { Actor } from 'apify'; -import { gotScraping } from 'got-scraping'; - -await Actor.init(); - -const proxyConfiguration = await Actor.createProxyConfiguration({ - groups: ['GOOGLE_SERP'], -}); -const proxyUrl = await proxyConfiguration.newUrl(); - -const query = encodeURI('Apple iPhone XS 64GB'); -const { body } = await gotScraping({ - url: `http://www.google.co.uk/search?tbm=shop&q=${query}`, - proxyUrl, -}); - -console.log(body); - -await Actor.exit(); - -``` - - - - -## Using standard libraries and languages {#using-standard-libraries-and-languages} - -You can find your proxy password on the [Proxy page](https://console.apify.com/proxy) of the Apify Console. - -> The `username` field is **not** your Apify username.
-> Instead, you specify proxy settings (e.g. `groups-GOOGLE_SERP`).
-> Use `groups-GOOGLE_SERP` to use proxies from all available countries. - -For examples using [PHP](https://www.php.net/), you need to have the [cURL](https://www.php.net/manual/en/book.curl.php) extension enabled in your PHP installation. See [installation instructions](https://www.php.net/manual/en/curl.installation.php) for more information. - -Examples in [Python 2](https://www.python.org/download/releases/2.0/) use the [six](https://pypi.org/project/six/) library. Run `pip install six` to enable it. - -### HTML from search results {#html-from-search-results} - -Get the HTML of search results for the keyword **wikipedia** from the USA (**google.com**). - -Select this option by setting the `username` parameter to `groups-GOOGLE_SERP`. Add the item you want to search to the `query` parameter. - - - - -```javascript -import axios from 'axios'; - -const proxy = { - protocol: 'http', - host: 'proxy.apify.com', - port: 8000, - // Replace below with your password - // found at https://console.apify.com/proxy - auth: { username: 'groups-GOOGLE_SERP', password: }, -}; - -const url = 'http://www.google.com/search'; -const params = { q: 'wikipedia' }; - -const { data } = await axios.get(url, { proxy, params }); - -console.log(data); - -``` - - - - - - -```python -import urllib.request as request -import urllib.parse as parse - -# Replace below with your password -# found at https://console.apify.com/proxy -password = '' -proxy_url = f"http://groups-GOOGLE_SERP:{password}@proxy.apify.com:8000" - -proxy_handler = request.ProxyHandler({ - 'http': proxy_url, -}) - -opener = request.build_opener(proxy_handler) - -query = parse.urlencode({ 'q': 'wikipedia' }) -print(opener.open(f"http://www.google.com/search?{query}").read()) - -``` - - - - - - -```python -import six -from six.moves.urllib import request, urlencode - -# Replace below with your password -# found at https://console.apify.com/proxy -password = '' -proxy_url = ( - 'http://groups-GOOGLE_SERP:%s@proxy.apify.com:8000' % - (password) -) -proxy_handler = request.ProxyHandler({ - 'http': proxy_url, -}) -opener = request.build_opener(proxy_handler) -query = parse.urlencode({ 'q': 'wikipedia' }) -url = ( - 'http://www.google.com/search?%s' % - (query) -) -print(opener.open(url).read()) - -``` - - - - - - -```php - below with your password -// found at https://console.apify.com/proxy -curl_setopt($curl, CURLOPT_PROXYUSERPWD, 'groups-GOOGLE_SERP:'); -$response = curl_exec($curl); -curl_close($curl); -echo $response; -?> - -``` - - - - - -```php - below with your password - // found at https://console.apify.com/proxy - 'proxy' => 'http://groups-GOOGLE_SERP:@proxy.apify.com:8000' -]); - -$response = $client->get("http://www.google.com/search", [ - 'query' => ['q' => 'wikipedia'] -]); -echo $response->getBody(); - -``` - - - - -### HTML from localized shopping results {#html-from-localized-shopping-results} - -Get HTML of shopping results for the query **Apple iPhone XS 64GB** from Great Britain (`google.co.uk`). - -Select this option by setting the `username` parameter to `groups-GOOGLE_SERP`. In the `query` parameter, add the item you want to search and specify the **shop** page as a URL parameter. - -Set the domain (your country of choice) in the URL (in the `response` variable). - - - - -```javascript -import axios from 'axios'; - -const proxy = { - protocol: 'http', - host: 'proxy.apify.com', - port: 8000, - // Replace below with your password - // found at https://console.apify.com/proxy - auth: { username: 'groups-GOOGLE_SERP', password: }, -}; - -const url = 'http://www.google.co.uk/search'; -const params = { q: 'Apple iPhone XS 64GB', tbm: 'shop' } - -const { data } = await axios.get(url, { proxy, params }); - -console.log(data); - -``` - - - - - - -```python -import urllib.request as request -import urllib.parse as parse - -# Replace below with your password -# found at https://console.apify.com/proxy -password = '' -proxy_url = f"http://groups-GOOGLE_SERP:{password}@proxy.apify.com:8000" -proxy_handler = request.ProxyHandler({ - 'http': proxy_url, -}) -opener = request.build_opener(proxy_handler) - -query = parse.urlencode({ 'q': 'Apple iPhone XS 64GB', 'tbm': 'shop' }) -print(opener.open(f"http://www.google.co.uk/search?{query}").read()) - -``` - - - - - - -```python -import six -from six.moves.urllib import request, urlencode - -# Replace below with your password -# found at https://console.apify.com/proxy -password = '' -proxy_url = ( - 'http://groups-GOOGLE_SERP:%s@proxy.apify.com:8000' % - (password) -) -proxy_handler = request.ProxyHandler({ - 'http': proxy_url, -}) -opener = request.build_opener(proxy_handler) -query = parse.urlencode({ 'q': 'Apple iPhone XS 64GB', 'tbm': 'shop' }) -url = ( - 'http://www.google.co.uk/search?%s' % - (query) -) -print(opener.open(url).read()) - -``` - - - - - - -```php - below with your password -// found at https://console.apify.com/proxy -curl_setopt($curl, CURLOPT_PROXYUSERPWD, 'groups-GOOGLE_SERP:'); -$response = curl_exec($curl); -curl_close($curl); -echo $response; -?> - -``` - - - - - - -```php - below with your password - // found at https://console.apify.com/proxy - 'proxy' => 'http://groups-GOOGLE_SERP:@proxy.apify.com:8000' -]); - -$response = $client->get("http://www.google.co.uk/search", [ - 'query' => [ - 'q' => 'Apple iPhone XS 64GB', - 'tbm' => 'shop' - ] -]); -echo $response->getBody(); - -``` - - - diff --git a/sources/platform/proxy/google_serp_proxy/index.md b/sources/platform/proxy/google_serp_proxy/index.md deleted file mode 100644 index ffe5fb54c..000000000 --- a/sources/platform/proxy/google_serp_proxy/index.md +++ /dev/null @@ -1,60 +0,0 @@ ---- -title: Google SERP proxy -description: Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany. -sidebar_position: 10.5 -slug: /proxy/google-serp-proxy ---- - -# Google SERP proxy {#google-serp-proxy} - -**Learn how to collect search results from Google Search-powered tools. Get search results from localized domains in multiple countries, e.g. the US and Germany.** - ---- - -Google SERP proxy allows you to extract search results from Google Search-powered services. It allows searching in [various countries](#country-selection) and to dynamically switch between country domains. - -Our Google SERP proxy currently supports the below services. - -* Google Search (`http://www.google./search`). -* Google Shopping (`http://www.google./search?tbm=shop`). - -> Google SERP proxy can **only** be used for Google Search and Shopping. It cannot be used to access other websites. - -When using the proxy, **pricing is based on the number of requests made**. - -To use Google SERP proxy or for more information, [contact us](https://apify.com/contact). - -## Connecting to Google SERP proxy {#connecting-to-google-serp-proxy} - -Requests made through the proxy are automatically routed through a proxy server from the selected country and pure **HTML code of the search result page is returned**. - -**Important:** Only HTTP requests are allowed, and the Google hostname needs to start with the `www.` prefix. - -For code examples on how to connect to Google SERP proxies, see the [examples](./examples.md) page. - -### Username parameters {#username-parameters} - -The `username` field enables you to pass various [parameters](../connection_settings.md), such as groups and country, for your proxy connection. - -When using Google SERP proxy, the username should always be: - -```text -groups-GOOGLE_SERP -``` - -Unlike [datacenter](../datacenter_proxy/index.md) or [residential](../residential_proxy/index.md) proxies, there is no [session](../connection_settings.md) parameter. - -If you use the `country` [parameter](../connection_settings.md), the Google proxy location is used if you access a website whose hostname (stripped of `www.`) starts with **google**. - -## Country selection {#country-selection} - -You must use the correct Google domain to get results for your desired country code. - -For example: - -* Search results from the USA: `http://www.google.com/search?q=` - - -* Shopping results from Great Britain: `http://www.google.co.uk/seach?tbm=shop&q=` - -See a [full list](https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/List_of_Google_domains.html) of available domain names for specific countries. When using them, remember to prepend the domain name with the `www.` prefix. diff --git a/sources/platform/proxy/index.md b/sources/platform/proxy/index.md index c2aa9ada0..48f3fcca4 100644 --- a/sources/platform/proxy/index.md +++ b/sources/platform/proxy/index.md @@ -6,109 +6,97 @@ category: platform slug: /proxy --- +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +import Card from "@site/src/components/Card"; +import CardGrid from "@site/src/components/CardGrid"; + # [](./proxy) Proxy **Learn to anonymously access websites in scraping/automation jobs. Improve data outputs and efficiency of bots, and access websites from various geographies.** --- -[Apify Proxy](https://apify.com/proxy) allows you to change your IP address when web scraping to reduce the chance of being [blocked](/academy/anti-scraping/techniques) because of your geographical location. +> [Apify Proxy](https://apify.com/proxy) allows you to change your IP address when web scraping to reduce the chance of being [blocked](/academy/anti-scraping/techniques) because of your geographical location. You can use proxies in your [actors](../actors/index.mdx) or any other application that supports HTTP proxies. Apify Proxy monitors the health of your IP pool and intelligently [rotates addresses](#ip-address-rotation) to prevent IP address-based blocking. -**You can view your proxy settings and password on the [Proxy](https://console.apify.com/proxy) page in the Apify Console.** - -## Our proxies {#our-proxies} - -[Datacenter proxy](./datacenter_proxy/index.md) – the fastest and cheapest option, it uses datacenters to change your IP address. Note that there is a chance of being blocked because of the activity of other users. [[Code examples](./datacenter_proxy/examples.md)] - -[Residential proxy](./residential_proxy/index.md) – IP addresses located in homes and offices around the world. These IPs are the least likely to be blocked. [[How to connect](./residential_proxy/index.md)] - -[Google SERP proxy](./google_serp_proxy/index.md) – download and extract data from Google Search Engine Result Pages (SERPs). You can select country and language to get localized results. [[Code examples](./google_serp_proxy/examples.md)] - -**For pricing information, visit [apify.com/proxy](https://apify.com/proxy).** - -## Using your own proxies - -In addition to our proxies, you can use your own both in Apify Console and SDK. - -### Custom proxies in console {#console} - -To use your own proxies with Apify Console, in your actor's **Input and options** tab, scroll down and open the **Proxy and browser configuration** section. Enter your proxy URLs, and you're good to go. - -![Using custom proxy in Apify Console](../images/proxy-custom.png) - -### Custom proxies in SDK {#SDK} - -In the Apify SDK, use the `proxyConfiguration.newUrl(sessionId)` command to add your custom proxy URLs to the proxy configuration. See the [SDK docs](/sdk/js/api/apify/class/ProxyConfiguration#newUrl) for more details. - -## IP address rotation {#ip-address-rotation} - -Web scrapers can rotate the IP addresses they use to access websites. They assign each request a different IP address, which makes it appear like they are all coming from different users. This greatly enhances performance and data throughout. - -Depending on whether you use a [browser](https://apify.com/apify/web-scraper) or [HTTP requests](https://apify.com/apify/cheerio-scraper) for your scraping jobs, IP address rotation works differently. - -* Browser – a different IP address is used for each browser. -* HTTP request – a different IP address is used for each request. - -**You can use [sessions](#sessions) to manage how you rotate and [persist](#session-persistence) IP addresses.** - -[Click here](/academy/anti-scraping/techniques) to learn more about IP address rotation and our findings on how blocking works. - -## Sessions {#sessions} - -Sessions allow you to use the same IP address for multiple connections. - -To set a new session, pass the [`session`](./connection_settings.md) parameter in your [username](./connection_settings.md#username-parameters) field when connecting to a proxy. This will serve as the session's ID and an IP address will be assigned to it. To [use that IP address in other requests](./datacenter_proxy/examples.md#multiple-requests-with-the-same-ip-address), pass that same session ID in the username field. - -The created session will store information such as cookies and can be used to generate [browser fingerprints](https://pixelprivacy.com/resources/browser-fingerprinting/). You can also assign custom user data such as authorization tokens and specific headers. - -Sessions are available for [datacenter](./datacenter_proxy/index.md) and [residential](./residential_proxy/index.md#session-persistence) proxies. - -**This parameter is optional**. By default, each proxied request is assigned a randomly picked least used IP address. +You can view your proxy settings and password on the [Proxy](https://console.apify.com/proxy) page in Apify Console. For pricing information, visit [apify.com/pricing](https://apify.com/pricing). -### Session persistence {#session-persistence} -You can persist your sessions (use the same IP address) by setting the `session` parameter in the `username` [field](./connection_settings.md). This assigns a single IP address to a **session ID** after you make the first request. +## Quickstart {#quickstart} -**Session IDs represent IP addresses. Therefore, you can manage the IP addresses you use by managing sessions.** In cases where you need to keep the same session (e.g. when you need to log in to a website), it is best to keep the same proxy. By assigning an IP address to a **session ID**, you can use that IP for every request you make. +Usage of Apify Proxy means just a couple of lines of code thanks to our SDKs for ([JavaScript](/sdk/js) and [Python](/sdk/python)): -For datacenter proxies, a session persists for **26 hours** ([more info](./datacenter_proxy/index.md)). For residential proxies, it persists for **1 minute** ([more info](./residential_proxy/index.md#session-persistence)). Using a session resets its expiry timer. + + -Google SERP proxies do not support sessions. +```javascript +import { Actor } from 'apify'; +import { PuppeteerCrawler } from 'crawlee'; -## Dead proxies {#dead-proxies} +await Actor.init(); -Our health check performs an HTTP and HTTPS request with each proxy server every few hours. If a server fails both requests 3 times in a row, it's marked as dead and all user sessions with this server are discarded. +const proxyConfiguration = await Actor.createProxyConfiguration(); -Banned proxies are not considered dead, since they become usable after a while. +const crawler = new PuppeteerCrawler({ + proxyConfiguration, + async requestHandler({ page }) { + console.log(await page.content()) + }, +}); -## A different approach to `502 Bad Gateway` +await crawler.run(['https://proxy.apify.com/?format=json']); -There are times when the `502` status code is not comprehensive enough. Therefore, we have modified our server with `590-599` codes instead to provide more insight. +await Actor.exit(); +``` -* `590 Non Successful`: upstream responded with non-200 status code. + + -* `591 RESERVED`: *this status code is reserved for further use.* +```python +import requests, asyncio +from apify import Actor -* `592 Status Code Out Of Range`: upstream responded with status code different than 100-999. +async def main(): + async with Actor: + proxy_configuration = await Actor.create_proxy_configuration() + proxy_url = await proxy_configuration.new_url() -* `593 Not Found`: DNS lookup failed - [`EAI_NODATA`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L82) or [`EAI_NONAME`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L83). + proxies = { + 'http': proxy_url, + 'https': proxy_url, + } -* `594 Connection Refused`: upstream refused connection. + response = requests.get('https://api.apify.com/v2/browser-info', proxies=proxies) + print(response.text) -* `595 Connection Reset`: connection reset due to loss of connection or timeout. +if __name__ == '__main__': + asyncio.run(main()) +``` -* `596 Broken Pipe`: trying to write on a closed socket. + + -* `597 Auth Failed`: incorrect upstream credentials. +## Proxy types {#proxy-types} -* `598 RESERVED`: *this status code is reserved for further use.* +There are several types of proxy servers, each of them with different advantages, disadvantages, and pricing. You can use them to access websites from various geographies and with different levels of anonymity. -* `599 Upstream Error`: generic upstream error. + + + + + -`590` and `592` indicate an issue on the upstream side.
-`593` indicates an incorrect `proxy-chain` configuration.
-`594`, `595` and `596` may occur due to connection loss.
-`597` indicates incorrect upstream credentials.
-`599` is a generic error, where the above is not applicable. diff --git a/sources/platform/proxy/residential_proxy.md b/sources/platform/proxy/residential_proxy.md new file mode 100644 index 000000000..4bd63b7c3 --- /dev/null +++ b/sources/platform/proxy/residential_proxy.md @@ -0,0 +1,158 @@ +--- +title: Residential proxy +description: Achieve a higher level of anonymity using IP addresses from human users. Access a wider pool of proxies and reduce blocking by websites' anti-scraping measures. +sidebar_position: 10.3 +slug: /proxy/residential-proxy +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +# Residential proxy {#residential-proxy} + +**Achieve a higher level of anonymity using IP addresses from human users. Access a wider pool of proxies and reduce blocking by websites' anti-scraping measures.** + +--- + +Residential proxies use IP addresses assigned by Internet Service Providers to the homes and offices of actual users. Unlike [datacenter proxies](./datacenter_proxy.md), traffic from residential proxies is indistinguishable from that of legitimate users. + +This solution allows you to access a larger pool of servers than datacenter proxy. This makes it a better option in cases when you need a large number of different IP addresses. + +Residential proxies support [IP address rotation](./usage.md#ip-address-rotation) and [sessions](#session-persistence). + +**Pricing is based on data traffic**. It is measured for each connection made and displayed on your [proxy usage dashboard](https://console.apify.com/proxy/usage) in the Apify Console. + +## Connecting to residential proxy {#connecting-to-residential-proxy} + +Connecting to residential proxy works the same way as [datacenter proxy](./datacenter_proxy.md), with two differences. + +1. The `groups` [username parameter](./usage.md#username-parameters) should always specify `RESIDENTIAL`. + +2. You can specify the country in which you want your proxies to be. + +### How to set a proxy group {#how-to-set-a-proxy-group} + +When using [standard libraries and languages](./datacenter_proxy.md), specify the `groups` parameter in the [username](./usage.md#username-parameters) as `groups-RESIDENTIAL`. + +For example, your **proxy URL** when using the [got-scraping](https://www.npmjs.com/package/got-scraping) JavaScript library will look like this: + +```js +const proxyUrl = 'http://groups-RESIDENTIAL:@proxy.apify.com:8000'; +``` + +In the Apify SDK ([JavaScript](/sdk/js) and [Python](/sdk/python)), you set the **groups** in your proxy configuration: + + + + +```js +import { Actor } from 'apify'; + +await Actor.init(); +// ... +const proxyConfiguration = await Actor.createProxyConfiguration({ + groups: ['RESIDENTIAL'], +}); +// ... +await Actor.exit(); +``` + + + + +```python +from apify import Actor + +async with Actor: + # ... + proxy_configuration = await Actor.create_proxy_configuration(groups=['RESIDENTIAL']) + # ... + +``` + + + + +### How to set a proxy country {#how-to-set-a-proxy-country} + +When using [standard libraries and languages](./datacenter_proxy.md), specify the `country` parameter in the [username](./usage.md#username-parameters) as `country-COUNTRY-CODE`. + +For example, your `username` parameter when using [Python 3](https://docs.python.org/3/) will look like this: + +```python +username = "groups-RESIDENTIAL,country-JP" +``` + +In the Apify SDK ([JavaScript](/sdk/js) and [Python](/sdk/python)), you set the country in your proxy configuration using two-letter [country codes](https://laendercode.net/en/2-letter-list.html). Specify the groups as `RESIDENTIAL`, then add a `countryCode`/`country_code` parameter: + + + + +```js +import { Actor } from 'apify'; + +await Actor.init(); +// ... +const proxyConfiguration = await Actor.createProxyConfiguration({ + groups: ['RESIDENTIAL'], + countryCode: 'FR', +}); +// ... +await Actor.exit(); +``` + + + + +```python +from apify import Actor + +async with Actor: + # ... + proxy_configuration = await Actor.create_proxy_configuration( + groups=['RESIDENTIAL'], + country_code='FR', + ) + # ... + +``` + + + + +## Session persistence {#session-persistence} + +When using residential proxy with the `session` [parameter](./usage.md#sessions) set in the [username](./usage.md#username-parameters), a single IP address is assigned to the **session ID** provided after you make the first request. + +**Session IDs represent IP addresses. Therefore, you can manage the IP addresses you use by managing sessions.** [[More info](./usage.md#sessions)] + +This IP/session ID combination is persisted for 1 minute. Each subsequent request resets the expiration time to 1 minute. + +If the proxy server becomes unresponsive or the session expires, a new IP address is selected for the next request. + +> If you really need to persist the same session, you can try sending some data using that session (e.g. every 20 seconds) to keep it alive.
+> Providing the connection is not interrupted, this will let you keep the IP address for longer. + +To learn more about [sessions](./usage.md#sessions) and [IP address rotation](./usage.md#ip-address-rotation), see the proxy [overview page](./index.md). + +## Tips to keep in mind {#tips-to-keep-in-mind} + +[Residential](./index.md) proxies are less predictable than [datacenter](./datacenter_proxy.md) proxies and are priced differently (by number of IPs vs traffic used). Because of this, there are some important things to consider before using residential proxy in your solutions. + +### Control traffic used by automated browsers {#control-traffic-used-by-automated-browsers} + +Residential proxy is priced by data traffic used. Thus, it's easy to quickly use up all your prepaid traffic. In particular, when accessing websites with large files loaded on every page. + +To reduce your traffic use, we recommend using the `blockRequests()` function of [`playwrightUtils`](https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests)/[`puppeteerUtils`](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#blockRequests) (depending on the library used). + +### Connected proxy speed variation {#connected-proxy-speed-variation} + +Each host on the residential proxy network uses a different device. They have different network speeds and different latencies. This means that requests made with one [session](./usage.md#sessions) can be extremely fast, while another request with a different session can be extremely slow. The difference can range from a few milliseconds to a few seconds. + +If your solution requires quickly loaded content, the best option is to set a [session](./usage.md#sessions), try a small request and see if the response time is acceptable. If it is, you can use this session for other requests. Otherwise, repeat the attempt with a different session. + +### Connection interruptions {#connection-interruptions} + +While sessions are persistent, they can be destroyed at any time if the host devices are turned off or disconnected. + +For this problem there is no easy solution. One option is to not use residential proxy for larger requests (and use [datacenter](./datacenter_proxy.md) proxy instead). If you have no other choice, expect that interruptions might happen and write your solution with this in mind. diff --git a/sources/platform/proxy/residential_proxy/index.md b/sources/platform/proxy/residential_proxy/index.md deleted file mode 100644 index b39e9b8ad..000000000 --- a/sources/platform/proxy/residential_proxy/index.md +++ /dev/null @@ -1,140 +0,0 @@ ---- -title: Residential proxy -description: Achieve a higher level of anonymity using IP addresses from human users. Access a wider pool of proxies and reduce blocking by websites' anti-scraping measures. -sidebar_position: 10.4 -slug: /proxy/residential-proxy ---- - -# Residential proxy {#residential-proxy} - -**Achieve a higher level of anonymity using IP addresses from human users. Access a wider pool of proxies and reduce blocking by websites' anti-scraping measures.** - ---- - -Residential proxies use IP addresses assigned by Internet Service Providers to the homes and offices of actual users. Unlike [datacenter proxies](../datacenter_proxy/index.md), traffic from residential proxies is indistinguishable from that of legitimate users. - -This solution allows you to access a larger pool of servers than datacenter proxy. This makes it a better option in cases when you need a large number of different IP addresses. - -Residential proxies support [IP address rotation](../index.md) and [sessions](#session-persistence). - -**Pricing is based on data traffic**. It is measured for each connection made and displayed on your [dashboard](https://console.apify.com) in the Apify Console. - -## Limitations {#limitations} - -Apify provides 2 levels of residential proxy: - -| Level | Availability | Supported domains | Other limitations | -|--------------------------------|---------------------------|---------------------|-----------------------------------------------------------------------------------------------| -| Restricted residential proxy | Every user | >500 domains | Requires [man-in-the-middle](https://crypto.stanford.edu/ssl-mitm/) access for the connection | -| Unrestricted residential proxy | Enterprise level accounts | Entire web | None | - -### Restricted residential proxy {#restricted-residential-proxy} - -Restricted residential proxy is available for all the users with some conditions. - -Firstly, restricted residential proxy supports only certain domains and paths. The pool of 500 domains is increased every month and covers the most frequent use cases. -Any traffic outside this pool of domains will go through [datacenter proxy](../datacenter_proxy/index.md). - -The second limitation is that restricted residential proxy uses the man-in-the-middle system to monitor traffic -and activities and so requires acceptance of an [SSL certificate](https://apify.com/restricted-residential-proxy-cert.crt). -This is automatically handled by [Apify SDK](/sdk/js/) and [Crawlee](https://crawlee.dev/) for both Puppeteer and Playwright. To manually check if a connection is using a man-in-the-middle connection, [head over to the Apify Proxy page](http://proxy.apify.com). - -### Unrestricted residential proxy {#unrestricted-residential-proxy} - -Unrestricted residential proxy neither limits the domains you can access nor requires a man-in-the-middle access to traffic. -However, it's provided only to enterprise-level accounts on a per-request basis and under an additional contract. - -[Contact us](https://apify.com/contact) if you would like to use the unrestricted residential proxy or for more information. - -## Connecting to residential proxy {#connecting-to-residential-proxy} - -Connecting to residential proxy works the same way as [datacenter proxy](../datacenter_proxy/examples.md), with two differences. - -1. The `groups` [username parameter](../connection_settings.md) should always specify `RESIDENTIAL`. - -2. You can specify the country in which you want your proxies to be. - -### How to set a proxy group {#how-to-set-a-proxy-group} - -When using [standard libraries and languages](../datacenter_proxy/examples.md), specify the `groups` parameter in the [username](../connection_settings.md#username-parameters) as `groups-RESIDENTIAL`. - -For example, your **proxy URL** when using the [got-scraping](https://www.npmjs.com/package/got-scraping) JavaScript library will look like this: - -```js -const proxyUrl = 'http://groups-RESIDENTIAL:@proxy.apify.com:8000'; -``` - -In the [Apify SDK](/sdk/js), you set the **group** in your [proxy configuration](/sdk/js/api/apify/interface/ProxyConfigurationOptions#groups): - -```js -import { Actor } from 'apify'; - -await Actor.init(); -// ... -const proxyConfiguration = await Actor.createProxyConfiguration({ - groups: ['RESIDENTIAL'], -}); -// ... -await Actor.exit(); -``` - -### How to set a proxy country {#how-to-set-a-proxy-country} - -When using [standard libraries and languages](../datacenter_proxy/examples.md), specify the `country` parameter in the [username](../connection_settings.md#username-parameters) as `country-COUNTRY-CODE`. - -For example, your `username` parameter when using [Python 3](https://docs.python.org/3/) will look like this: - -```python -username = "groups-RESIDENTIAL,session-my_session,country-JP" -``` - -In the [Apify SDK](/sdk/js), you set the country in your [proxy configuration](/sdk/js/api/apify/interface/ProxyConfigurationOptions#countryCode) using two-letter [country codes](https://laendercode.net/en/2-letter-list.html). Specify the groups as `RESIDENTIAL`, then add a `countryCode` parameter. - -```js -import { Actor } from 'apify'; - -await Actor.init(); -// ... -const proxyConfiguration = await Actor.createProxyConfiguration({ - groups: ['RESIDENTIAL'], - countryCode: 'FR', -}); -// ... -await Actor.exit(); -``` - -### Username examples {#username-examples} - -Use randomly allocated IP addresses from all available countries: - -```text -groups-RESIDENTIAL -``` - -A random proxy from the US: - -```text -groups-RESIDENTIAL,country-US -``` - -Set a session and select an IP address from the United States: - -```text -groups-RESIDENTIAL,session-my_session_1,country-US -``` - - -## Session persistence {#session-persistence} - -When using residential proxy with the `session` [parameter](../index.md) set in the [username](../connection_settings.md#username-parameters), a single IP address is assigned to the **session ID** provided after you make the first request. - -**Session IDs represent IP addresses. Therefore, you can manage the IP addresses you use by managing sessions.** [[More info](../index.md)] - -This IP/session ID combination is persisted for 1 minute. Each subsequent request resets the expiration time to 1 minute. - -If the proxy server becomes unresponsive or the session expires, a new IP address is selected for the next request. - -> If you really need to persist the same session, you can try sending some data using that session (e.g. every 20 seconds) to keep it alive.
-> Providing the connection is not interrupted, this will let you keep the IP address for longer. - -To learn more about [sessions](../index.md#sessions) and [IP address rotation](../index.md#ip-address-rotation), see the proxy [overview page](../index.md). diff --git a/sources/platform/proxy/residential_proxy/tips_and_tricks.md b/sources/platform/proxy/residential_proxy/tips_and_tricks.md deleted file mode 100644 index 4d188a8af..000000000 --- a/sources/platform/proxy/residential_proxy/tips_and_tricks.md +++ /dev/null @@ -1,31 +0,0 @@ ---- -title: Tips and tricks -description: Helpful tips for using your application with Apify's residential proxies. Control traffic, deal with interrupted connections and manage expenses. -slug: /proxy/residential-proxy/tips-and-tricks ---- - -# Tips and tricks {#tips-and-tricks} - -**Helpful tips for using your application with Apify's residential proxies. Control traffic, deal with interrupted connections and manage expenses.** - ---- - -[Residential](./index.md) proxies are less predictable than [datacenter](../datacenter_proxy/index.md) proxies and are priced differently (by number of IPs vs traffic used). Because of this, there are some important things to consider before using residential proxy in your solutions. - -## Control traffic used by automated browsers {#control-traffic-used-by-automated-browsers} - -Residential proxy is priced by data traffic used. Thus, it's easy to quickly use up all your prepaid traffic. In particular, when accessing websites with large files loaded on every page. - -To reduce your traffic use, we recommend using the `blockRequests()` function of [`playwrightUtils`](https://crawlee.dev/api/playwright-crawler/namespace/playwrightUtils#blockRequests)/[`puppeteerUtils`](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#blockRequests) (depending on the library used). - -## Connected proxy speed variation {#connected-proxy-speed-variation} - -Each host on the residential proxy network uses a different device. They have different network speeds and different latencies. This means that requests made with one [session](../index.md) can be extremely fast, while another request with a different session can be extremely slow. The difference can range from a few milliseconds to a few seconds. - -If your solution requires quickly loaded content, the best option is to set a [session](../index.md), try a small request and see if the response time is acceptable. If it is, you can use this session for other requests. Otherwise, repeat the attempt with a different session. - -## Connection interruptions {#connection-interruptions} - -While sessions are persistent, they can be destroyed at any time if the host devices are turned off or disconnected. - -For this problem there is no easy solution. One option is to not use residential proxy for larger requests (and use [datacenter](../datacenter_proxy/index.md) proxy instead). If you have no other choice, expect that interruptions might happen and write your solution with this in mind. diff --git a/sources/platform/proxy/troubleshooting.md b/sources/platform/proxy/troubleshooting.md deleted file mode 100644 index df853fad3..000000000 --- a/sources/platform/proxy/troubleshooting.md +++ /dev/null @@ -1,25 +0,0 @@ ---- -title: Troubleshooting -description: Useful tips for debugging applications that use Apify Proxy. Check the status of your proxies and view information about the client IP address. -sidebar_position: 10.6 -slug: /proxy/troubleshooting ---- - -# Troubleshooting {#troubleshooting} - -**Useful tips for debugging applications that use Apify Proxy. Check the status of your proxies and view information about the client IP address.** - ---- - -To view your connection status to [Apify Proxy](https://apify.com/proxy), open the URL below in the browser using the proxy: - -[http://proxy.apify.com/](http://proxy.apify.com/) - -If the proxy connection is working, the page should look something like this: - -![Apify proxy status page](./images/proxy-status.png) - -To test that your requests are proxied and IP addresses are being [rotated](/academy/anti-scraping/techniques) correctly, open the following API endpoint via the proxy. It shows information about the client IP address. - -[https://api.apify.com/v2/browser-info/](https://api.apify.com/v2/browser-info/) - diff --git a/sources/platform/proxy/usage.md b/sources/platform/proxy/usage.md new file mode 100644 index 000000000..ef99ffb5c --- /dev/null +++ b/sources/platform/proxy/usage.md @@ -0,0 +1,154 @@ +--- +title: Usage +description: Learn how to configure and use Apify Proxy. See the required parameters such as the correct username and password. +sidebar_position: 10.1 +slug: /proxy/usage +--- + +# Usage + +**Learn how to configure and use Apify Proxy. See the required parameters such as the correct username and password.** + +## Connection settings + +To connect to the Apify Proxy, you use the [HTTP proxy protocol](https://en.wikipedia.org/wiki/Proxy_server#Web_proxy_servers). This means that you need to configure your HTTP client to use the proxy server at `proxy.apify.com:8000` and provide it with your Apify Proxy password and the other parameters described below. + +The full connection string has the following format: + +```text +http://:@proxy.apify.com:8000 +``` + +| Parameter | Value / explanation | +|---------------------|---------------------| +| Proxy type | `HTTP` | +| Hostname | `proxy.apify.com`, alternatively you can use static IP addresses `18.208.102.16`, `35.171.134.41`. | +| Port | `8000` | +| Username | Specifies the proxy parameters such as groups, [session](#sessions) and location. See [username parameters](#username-parameters) below for details.
**Note**: this is not your Apify username.| +| Password | Proxy password. Your password is displayed on the [Proxy](https://console.apify.com/proxy/groups) page in Apify Console. In Apify [actors](../actors/index.mdx), it is passed as the `APIFY_PROXY_PASSWORD` environment variable. See the [environment variables docs](../actors/development/programming_interface/environment_variables.md) for more details. | + +> **WARNING:** All usage of Apify Proxy with your password is charged towards your account. Do not share the password with untrusted parties or use it from insecure networks – **the password is sent unencrypted** due to the HTTP protocol's [limitations](https://www.guru99.com/difference-http-vs-https.html). + +### Username parameters + +The `username` field enables you to pass parameters like **[groups](#proxy-groups)**, **[session ID](#sessions)** and **country** for your proxy connection. + +For example, if you're using [datacenter proxies](./datacenter_proxy.md) and want to use the `new_job_123` session using the `SHADER` group, the username will be: + +```text +groups-SHADER,session-new_job_123 +``` + +The table below describes the available parameters. + + + + + + + + + + + + + + + + + + + + + + + + + + +
ParameterTypeDescription
groupsRequired + Set proxied requests to use servers from the selected groups: +
- groups-[group name] or auto when using datacenter proxies. +
- groups-RESIDENTIAL when using residential proxies. +
- groups-GOOGLE_SERP when using Google SERP proxies. +
sessionOptional +

If specified to session-new_job_123, for example, all proxied requests with the same session identifier are routed through the same IP address. If not specified, each proxied request is assigned a randomly picked least used IP address.

+

The session string can only contain numbers (0-9), letters (a-z or A-Z), dot (.), underscore (_), a tilde (~). The maximum length is 50 characters.

+

Session management may work differently for residential and SERP proxies. Check relevant documentations for more details.

+
countryOptional + If specified, all proxied requests will use proxy servers from a selected country. Note that if there are no proxy servers from the specified country, the connection will fail. For example groups-SHADER,country-US uses proxies from the SHADER group located in the USA. By default, the proxy uses all available proxy servers from all countries. +
+ +If you want to specify one parameter and not the others, just provide that parameter and omit the others. To use the default behavior (not specifying either `groups`, `session`, or `country`), set the username to **auto**. **auto** serves as a placeholder because the username can't be empty. + +## Code examples + +We have code examples for connecting to our proxy using the Apify SDK ([JavaScript](/sdk/js) and [Python](/sdk/python)) and [Crawlee](https://crawlee.dev/) and other libraries, as well as examples in PHP. + +* [Datacenter proxy](./datacenter_proxy.md#examples) +* [Residential proxy](./residential_proxy.md#connecting-to-residential-proxy) +* [Google SERP proxy](./google_serp_proxy.md#examples) + +For code examples related to proxy management in Apify SDK and Crawlee, see: + +* [Apify SDK JavaScript](/sdk/js/docs/guides/proxy-management) +* [Apify SDK Python](/sdk/python/docs/concepts/proxy-management) +* [Crawlee](https://crawlee.dev/docs/guides/proxy-management) + +## IP address rotation {#ip-address-rotation} + +Web scrapers can rotate the IP addresses they use to access websites. They assign each request a different IP address, which makes it appear like they are all coming from different users. This greatly enhances performance and data throughout. + +Depending on whether you use a [browser](https://apify.com/apify/web-scraper) or [HTTP requests](https://apify.com/apify/cheerio-scraper) for your scraping jobs, IP address rotation works differently. + +* Browser – a different IP address is used for each browser. +* HTTP request – a different IP address is used for each request. + +Use [sessions](#sessions) to controll how you rotate and [persist](#session-persistence) IP addresses. See our guide [Anti-scraping techniques](/academy/anti-scraping/techniques) to learn more about IP address rotation and our findings on how blocking works. + +## Sessions {#sessions} + +Sessions allow you to use the same IP address for multiple connections. In cases where you need to keep the same session (e.g. when you need to log in to a website), it is best to keep the same proxy and so the IP address. On the other hand by switching the IP address, you can avoid being blocked by the website. + +To set a new session, pass the `session` parameter in your [username](./usage.md#username-parameters) field when connecting to a proxy. This will serve as the session's ID and an IP address will be assigned to it. To [use that IP address in other requests](./datacenter_proxy.md#multiple-requests-with-the-same-ip-address), pass that same session ID in the username field. + +We recommend you to use [SessionPool](https://crawlee.dev/api/core/class/SessionPool) abstraction when managing sessions. The created session will then store information such as cookies and can be used to generate [browser fingerprints](/academy/anti-scraping/mitigation/generating-fingerprints). You can also assign custom user data such as authorization tokens and specific headers. + +Sessions are available for [datacenter](./datacenter_proxy.md) and [residential](./residential_proxy.md#session-persistence) proxies. For datacenter proxies, a session persists for **26 hours** ([more info](./datacenter_proxy.md)). For residential proxies, it persists for **1 minute** ([more info](./residential_proxy.md#session-persistence)) but you can prolong the lifetime by regularly using the sessinon. Google SERP proxies do not support sessions. + +## Proxy groups + +You can see which proxy groups you have access to on the [Proxy page](https://console.apify.com/proxy/groups) in the Apify Console. To use a specific proxy group (or multiple groups), specify it in the `username` parameter. + +## Troubleshooting + +To view your connection status to [Apify Proxy](https://apify.com/proxy), open the URL below in the browser using the proxy. [http://proxy.apify.com/](http://proxy.apify.com/). If the proxy connection is working, the page should look something like this: + +![Apify proxy status page](./images/proxy-status.png) + +To test that your requests are proxied and IP addresses are being [rotated](/academy/anti-scraping/techniques) correctly, open the following API endpoint via the proxy. It shows information about the client IP address. + +[https://api.apify.com/v2/browser-info/](https://api.apify.com/v2/browser-info/) + +### A different approach to `502 Bad Gateway` + +There are times when the `502` status code is not comprehensive enough. Therefore, we have modified our server with `590-599` codes instead to provide more insight: + +* `590 Non Successful`: upstream responded with non-200 status code. +* `591 RESERVED`: *this status code is reserved for further use.* +* `592 Status Code Out Of Range`: upstream responded with status code different than 100-999. +* `593 Not Found`: DNS lookup failed - [`EAI_NODATA`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L82) or [`EAI_NONAME`](https://github.com/libuv/libuv/blob/cdbba74d7a756587a696fb3545051f9a525b85ac/include/uv.h#L83). +* `594 Connection Refused`: upstream refused connection. +* `595 Connection Reset`: connection reset due to loss of connection or timeout. +* `596 Broken Pipe`: trying to write on a closed socket. +* `597 Auth Failed`: incorrect upstream credentials. +* `598 RESERVED`: *this status code is reserved for further use.* +* `599 Upstream Error`: generic upstream error. + +The typical issues behind these codes are: + +* `590` and `592` indicate an issue on the upstream side. +* `593` indicates an incorrect `proxy-chain` configuration. +* `594`, `595` and `596` may occur due to connection loss. +* `597` indicates incorrect upstream credentials. +* `599` is a generic error, where the above is not applicable. diff --git a/sources/platform/proxy/your_own_proxies.md b/sources/platform/proxy/your_own_proxies.md new file mode 100644 index 000000000..17f5834de --- /dev/null +++ b/sources/platform/proxy/your_own_proxies.md @@ -0,0 +1,20 @@ +--- +title: Using your own proxies +description: Use your own proxies while using the Apify platform. +sidebar_position: 10.5 +slug: /proxy/using-your-own-proxies +--- + +# Using your own proxies + +In addition to our proxies, you can use your own both in Apify Console and SDK. + +## Custom proxies in console {#console} + +To use your own proxies with Apify Console, in your actor's **Input and options** tab, scroll down and open the **Proxy and browser configuration** section. Enter your proxy URLs, and you're good to go. + +![Using custom proxy in Apify Console](../images/proxy-custom.png) + +## Custom proxies in SDK {#SDK} + +In the Apify SDK, use the `proxyConfiguration.newUrl(sessionId)` (JavaScript) or `proxy_configuration.new_url(session_id)` (Python) command to add your custom proxy URLs to the proxy configuration. See the [JavaScript](/sdk/js/api/apify/class/ProxyConfiguration#newUrl) or [Python](/sdk/python/reference/class/ProxyConfiguration#new_url) SDK docs for more details.