From 7a09574ffaf6fc4cc71a397b7074eae50e4176df Mon Sep 17 00:00:00 2001 From: Honza Javorek Date: Wed, 24 Apr 2024 17:55:02 +0200 Subject: [PATCH 1/2] fix: typos and stylistic improvements --- .../apify_scrapers/getting_started.md | 2 +- .../filter_blocked_requests_using_sessions.md | 2 +- .../academy/webscraping/anti_scraping/index.md | 6 +++--- .../anti_scraping/mitigation/proxies.md | 18 ++++++++++++++---- .../anti_scraping/techniques/fingerprinting.md | 2 +- .../anti_scraping/techniques/rate_limiting.md | 2 +- .../handling_pagination.md | 2 +- .../locating_and_learning.md | 2 +- .../puppeteer_playwright/browser_contexts.md | 2 +- .../executing_scripts/extracting_data.md | 6 +----- .../puppeteer_playwright/page/page_methods.md | 2 +- .../puppeteer_playwright/proxies.md | 2 +- .../reading_intercepting_requests.md | 2 +- .../data_extraction/browser_devtools.md | 4 +--- 14 files changed, 29 insertions(+), 25 deletions(-) diff --git a/sources/academy/tutorials/apify_scrapers/getting_started.md b/sources/academy/tutorials/apify_scrapers/getting_started.md index 6a8aa11d1..f8460e173 100644 --- a/sources/academy/tutorials/apify_scrapers/getting_started.md +++ b/sources/academy/tutorials/apify_scrapers/getting_started.md @@ -290,7 +290,7 @@ The scraper: ## [](#scraping-practice) Scraping practice -We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it. We will only output data that are already available to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique identifier** in our results. To get those, we need the `request.url` because it is the URL and includes the Unique identifier. +We've covered all the concepts that we need to understand to successfully scrape the data in our goal, so let's get to it. We will only output data that are already available to us in the page's URL. Remember from [our goal](#the-goal) that we also want to include the **URL** and a **Unique identifier** in our results. To get those, we need the `request.url`, because it is the URL and includes the Unique identifier. ```js const { url } = request; diff --git a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md index 318c66b48..56a82f186 100644 --- a/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md +++ b/sources/academy/tutorials/node_js/filter_blocked_requests_using_sessions.md @@ -23,7 +23,7 @@ You want to crawl a website with a proxy pool, but most of your proxies are bloc Nobody can make sure that a proxy will work infinitely. The only real solution to this problem is to use [residential proxies](/platform/proxy#residential-proxy), but they can sometimes be too costly. -However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer) as inspiration for how to handle this situation with `PuppeteerCrawler`  class. +However, usually, at least some of our proxies work. To crawl successfully, it is therefore imperative to handle blocked requests properly. You first need to discover that you are blocked, which usually means that either your request returned status greater or equal to 400 (it didn't return the proper response) or that the page displayed a captcha. To ensure that this bad request is retried, you usually throw an error and it gets automatically retried later (our [SDK](/sdk/js/) handles this for you). Check out [this article](https://docs.apify.com/academy/node-js/handle-blocked-requests-puppeteer) as inspiration for how to handle this situation with `PuppeteerCrawler` class. ### Solution diff --git a/sources/academy/webscraping/anti_scraping/index.md b/sources/academy/webscraping/anti_scraping/index.md index 5afa7e654..27d432723 100644 --- a/sources/academy/webscraping/anti_scraping/index.md +++ b/sources/academy/webscraping/anti_scraping/index.md @@ -12,7 +12,7 @@ slug: /anti-scraping --- -If at any point in time you've strayed away from the Academy's demo content, and into the wild west by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions. +If at any point in time you've strayed away from the Academy's demo content, and into the Wild West by writing some scrapers of your own, you may have been hit with anti-scraping measures. This is extremely common in the scraping world; however, the good thing is that there are always solutions. This section covers the essentials of mitigating anti-scraping protections, such as proxies, HTTP headers and cookies, and a few other things to consider when working on a reliable and scalable crawler. Proper usage of the methods taught in the next lessons will allow you to extract data which is specific to a certain location, enable your crawler to browse websites as a logged-in user, and more. @@ -91,7 +91,7 @@ A common workflow of a website after it has detected a bot goes as follows: 2. A [Turing test](https://en.wikipedia.org/wiki/Turing_test) is provided to the bot. Typically a **captcha**. If the bot succeeds, it is added to the whitelist. 3. If the captcha is failed, the bot is added to the blacklist. -One thing to keep in mind while navigating through this course is that advanced scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations. +One thing to keep in mind while navigating through this course is that advanced anti-scraping methods are able to identify non-humans not only by one value (such as a single header value, or IP address), but are able to identify them through more complex things such as header combinations. Watch a conference talk by [Ondra Urban](https://github.com/mnmkng), which provides an overview of various anti-scraping measures and tactics for circumventing them. @@ -111,7 +111,7 @@ Because we here at Apify scrape for a living, we have discovered many popular an ### IP rate-limiting -This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rating don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. +This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rate limiting don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address. > Learn more about rate limiting [here](./techniques/rate_limiting.md) diff --git a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md index bbbb16fd2..2498f1c40 100644 --- a/sources/academy/webscraping/anti_scraping/mitigation/proxies.md +++ b/sources/academy/webscraping/anti_scraping/mitigation/proxies.md @@ -26,17 +26,27 @@ Although IP quality is still the most important factor when it comes to using pr Fixing rate-limiting issues is only the tip of the iceberg of what proxies can do for your scrapers, though. By implementing proxies properly, you can successfully avoid the majority of anti-scraping measures listed in the [previous lesson](../index.md). -## A bit about proxy links {#understanding-proxy-links} +## About proxy links {#understanding-proxy-links} -When using proxies in your crawlers, you'll most likely be using them in a format that looks like this: +To use a proxy, you need a proxy link, which contains the connection details, sometimes including credentials. ```text http://proxy.example.com:8080 ``` -This link is separated into two main components: the **host**, and the **port**. In our case, our hostname is `http://proxy.example.com`, and our port is `8080`. Sometimes, a proxy might use an IP address as the host, such as `103.130.104.33`. +The proxy link above has several parts: -If authentication (a username and a password) is required, the format will look a bit different: +- `http://` tells us we're using HTTP protocol, +- `proxy.example.com` is a hostname, i.e. an address to the proxy server, +- `8080` is a port number. + +Sometimes the proxy server has no name, so the link contains an IP address instead: + +```text +http://123.456.789.10:8080 +``` + +If proxy requires authentication, the proxy link can contain username and password: ```text http://USERNAME:PASSWORD@proxy.example.com:8080 diff --git a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md index 12890576f..1fadca91f 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/fingerprinting.md @@ -176,7 +176,7 @@ The script is modified with some random JavaScript elements. Additionally, it al ### Data obfuscation -Two main data obfuscation techniues are widely employed: +Two main data obfuscation techniques are widely employed: 1. **String splitting** uses the concatenation of multiple substrings. It is mostly used alongside an `eval()` or `document.write()`. 2. **Keyword replacement** allows the script to mask the accessed properties. This allows the script to have a random order of the substrings and makes it harder to detect. diff --git a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md index a5d7e47a3..b61cd06ab 100644 --- a/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md +++ b/sources/academy/webscraping/anti_scraping/techniques/rate_limiting.md @@ -17,7 +17,7 @@ In the past, most websites had their own anti-scraping solutions, the most commo In cases when a higher number of requests is expected for the crawler, using a [proxy](../mitigation/proxies.md) and rotating the IPs is essential to let the crawler run as smoothly as possible and avoid being blocked. -## Dealing rate limiting with proxy/session rotating {#dealing-with-rate-limiting} +## Dealing with rate limiting by rotating proxy or session {#dealing-with-rate-limiting} The most popular and effective way of avoiding rate-limiting issues is by rotating [proxies](../mitigation/proxies.md) after every **n** number of requests, which makes your scraper appear as if it is making requests from various different places. Since the majority of rate-limiting solutions are based on IP addresses, rotating IPs allows a scraper to make large amounts to a website without getting restricted. diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md index 7782b8fad..412f777e8 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/handling_pagination.md @@ -37,7 +37,7 @@ If we were to make a request with the **limit** set to **5** and the **offset** ## Cursor pagination {#cursor-pagination} -Becoming more and more common is cursor-based pagination. Like with offset-based pagination, a **limit** parameter is usually present; however, instead of **offset**, **cursor** is used instead. A cursor is just a marker (sometimes a token, a date, or just a number) for an item in the dataset. All results returned back from the API will be records that come after the item matching the **cursor** parameter provided. +Sometimes pagination uses **cursor** instead of **offset**. Cursor is a marker of an item in the dataset. It can be a date, number, or a more or less random string of letters and numbers. Request with a **cursor** parameter will result in an API response containing items which follow after the item which the cursor points to. One of the most painful things about scraping APIs with cursor pagination is that you can't skip to, for example, the 5th page. You have to paginate through each page one by one. diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md index 8342f186b..d8909832b 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md @@ -21,7 +21,7 @@ _Here's what we can see in the Network tab after reloading the page:_ Let's say that our target data is a full list of Tiësto's uploaded songs on SoundCloud. We can use the **Filter** option to search for the keyword `tracks`, and see if any endpoints have been hit that include that word. Multiple results may still be in the list when using this feature, so it is important to carefully examine the payloads and responses of each request in order to ensure that the correct one is found. -> **Note:** The keyword/piece of data that is used in this filtered search should be a target keyword or a piece of target data that that can be assumed will most likely be a part of the endpoint. +> To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`). After a little bit of digging through the different response values of each request in our filtered list within the Network tab, we can discover this endpoint, which returns a JSON list including 20 of Tiësto's latest tracks: diff --git a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md index 3fd94756c..8891772f1 100644 --- a/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md +++ b/sources/academy/webscraping/puppeteer_playwright/browser_contexts.md @@ -77,7 +77,7 @@ await browser.close(); ## Using browser contexts {#using-browser-contexts} -In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android: +In both Playwright and Puppeteer, various devices (iPhones, iPads, Androids, etc.) can be emulated by using [`playwright.devices`](https://playwright.dev/docs/api/class-playwright#playwright-devices) or [`puppeteer.devices`](https://pptr.dev/#?product=Puppeteer&version=v14.1.0&show=api-puppeteerdevices). We'll be using this to create two different browser contexts, one emulating an iPhone, and one emulating an Android device: diff --git a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md index 861ebfc08..42461c011 100644 --- a/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md +++ b/sources/academy/webscraping/puppeteer_playwright/executing_scripts/extracting_data.md @@ -14,11 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. - -> Most web data extraction cases involve looping through a list of items of some sort. - -Playwright & Puppeteer offer two main methods for data extraction +Now that we know how to execute scripts on a page, we're ready to learn a bit about [data extraction](../../scraping_basics_javascript/data_extraction/index.md). In this lesson, we'll be scraping all the on-sale products from our [Fakestore](https://demo-webstore.apify.org/search/on-sale) website. Playwright & Puppeteer offer two main methods for data extraction: 1. Directly in `page.evaluate()` and other evaluate functions such as `page.$$eval()`. 2. In the Node.js context using a parsing library such as [Cheerio](https://www.npmjs.com/package/cheerio) diff --git a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md index 40671a324..7874517f6 100644 --- a/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md +++ b/sources/academy/webscraping/puppeteer_playwright/page/page_methods.md @@ -14,7 +14,7 @@ import TabItem from '@theme/TabItem'; --- -Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/#?product=Puppeteer&version=v13.7.0&show=api-pagereloadoptions), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/#?product=Puppeteer&show=api-pagecontent). +Other than having methods for interacting with a page and waiting for events and elements, the **Page** object also supports various methods for doing other things, such as [reloading](https://pptr.dev/api/puppeteer.page.reload), [screenshotting](https://playwright.dev/docs/api/class-page#page-screenshot), [changing headers](https://playwright.dev/docs/api/class-page#page-set-extra-http-headers), and extracting the [page's content](https://pptr.dev/api/puppeteer.page.content/). Last lesson, we left off at a point where we were waiting for the page to navigate so that we can extract the page's title and take a screenshot of it. In this lesson, we'll be learning about the two methods we can use to achieve both of those things. diff --git a/sources/academy/webscraping/puppeteer_playwright/proxies.md b/sources/academy/webscraping/puppeteer_playwright/proxies.md index 556638ab6..60c1d0441 100644 --- a/sources/academy/webscraping/puppeteer_playwright/proxies.md +++ b/sources/academy/webscraping/puppeteer_playwright/proxies.md @@ -169,7 +169,7 @@ const browser = await puppeteer.launch({ -However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed into the `page.authenticate()` prior to any navigations being made, while in Playwright they can be passed into the **proxy** option object. +However, authentication parameters need to be passed in separately in order to work. In Puppeteer, the username and password need to be passed to the `page.authenticate()` prior to any navigations being made, while in Playwright they can be passed to the **proxy** option object. diff --git a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md index 0d2c2449a..2d7d4d007 100644 --- a/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md +++ b/sources/academy/webscraping/puppeteer_playwright/reading_intercepting_requests.md @@ -231,7 +231,7 @@ Upon running this code, we'll see the API response logged into the console: One of the most popular ways of speeding up website loading in Puppeteer and Playwright is by blocking certain resources from loading. These resources are usually CSS files, images, and other miscellaneous resources that aren't super necessary (mainly because the computer doesn't have eyes - it doesn't care how the website looks!). -In Puppeteer, we must first enable request interception with the `page.setRequestInterception()` function. Then, we can check whether or not the request's resource ends with one of our blocked extensions. If so, we'll abort the request. Otherwise, we'll let it continue. All of this logic will still be within the `page.on()` method. +In Puppeteer, we must first enable request interception with the `page.setRequestInterception()` function. Then, we can check whether or not the request's resource ends with one of our blocked file extensions. If so, we'll abort the request. Otherwise, we'll let it continue. All of this logic will still be within the `page.on()` method. With Playwright, request interception is a bit different. We use the [`page.route()`](https://playwright.dev/docs/api/class-page#page-route) function instead of `page.on()`, passing in a string, regular expression, or a function that will match the URL of the request we'd like to read from. The second parameter is also a callback function, but with the [**Route**](https://playwright.dev/docs/api/class-route) object passed into it instead. diff --git a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/browser_devtools.md b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/browser_devtools.md index 39b8b08a6..e4d24df9b 100644 --- a/sources/academy/webscraping/scraping_basics_javascript/data_extraction/browser_devtools.md +++ b/sources/academy/webscraping/scraping_basics_javascript/data_extraction/browser_devtools.md @@ -9,9 +9,7 @@ slug: /web-scraping-for-beginners/data-extraction/browser-devtools --- -Even though DevTools stands for developer tools, everyone can use them to inspect a website. Each major browser has its own DevTools. We will use Chrome DevTools as an example, but the advice is applicable to any browser, as the tools are extremely similar. To open Chrome DevTools, you can press **F12** or right-click anywhere in the page and choose **Inspect**. - -Now go to [Wikipedia](https://www.wikipedia.org/) and open your DevTools there. Inspecting the same website as us will make this lesson easier to follow. +Even though DevTools stands for developer tools, everyone can use them to inspect a website. Each major browser has its own DevTools. We will use Chrome DevTools as an example, but the advice is applicable to any browser, as the tools are extremely similar. To open Chrome DevTools, you can press **F12** or right-click anywhere in the page and choose **Inspect**. Now go to [Wikipedia](https://www.wikipedia.org/) and open your DevTools there. ![Wikipedia with Chrome DevTools open](./images/browser-devtools-wikipedia.png) From 8a5b2420ff890cc53a6084f3a5ad88db58f4413e Mon Sep 17 00:00:00 2001 From: Honza Javorek Date: Thu, 12 Sep 2024 15:03:56 +0200 Subject: [PATCH 2/2] style: turn blockquote to an admonition --- .../general_api_scraping/locating_and_learning.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md index d8909832b..e7c4062d1 100644 --- a/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md +++ b/sources/academy/webscraping/api_scraping/general_api_scraping/locating_and_learning.md @@ -21,7 +21,11 @@ _Here's what we can see in the Network tab after reloading the page:_ Let's say that our target data is a full list of Tiësto's uploaded songs on SoundCloud. We can use the **Filter** option to search for the keyword `tracks`, and see if any endpoints have been hit that include that word. Multiple results may still be in the list when using this feature, so it is important to carefully examine the payloads and responses of each request in order to ensure that the correct one is found. -> To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`). +:::note Filtering requests + +To find what we're looking for, we must wisely choose what piece of data (in this case a keyword) we filter by. Think of something that is most likely to be part of the endpoint (in this case a string `tracks`). + +::: After a little bit of digging through the different response values of each request in our filtered list within the Network tab, we can discover this endpoint, which returns a JSON list including 20 of Tiësto's latest tracks: