Skip to content

Commit

Permalink
Merge pull request #1010 from honzajavorek/honzajavorek/spelling
Browse files Browse the repository at this point in the history
fix: spelling of several terms
  • Loading branch information
honzajavorek authored May 22, 2024
2 parents 657da03 + 260a9ea commit a883f83
Show file tree
Hide file tree
Showing 20 changed files with 46 additions and 46 deletions.
2 changes: 1 addition & 1 deletion sources/academy/glossary/concepts/dynamic_pages.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,6 @@ Sometimes, it can be quite obvious when content is dynamically being rendered. F

![Image](https://blog.apify.com/content/images/2022/02/dynamicLoading-1--1--2.gif)

Here, it's very clear that new content is being generated. As we scroll down the Twitter feed, we can see the scroll bar jumping back up, signifying that more elements have been created using Javascript.
Here, it's very clear that new content is being generated. As we scroll down the Twitter feed, we can see the scroll bar jumping back up, signifying that more elements have been created using JavaScript.

Other times, it's less obvious though. Content can appear to be static (non-dynamic) when it is not, or even sometimes the other way around.
4 changes: 2 additions & 2 deletions sources/academy/glossary/tools/apify_cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@ The [Apify CLI](/cli) helps you create, develop, build and run Apify actors, and

## Installing {#installing}

To install the Apfiy CLI, you'll first need NPM, which comes preinstalled with Node.js. If you haven't yet installed Node, learn how to do that [here](../../webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md). Additionally, make sure you've got an Apify account, as you will need to log in to the CLI to gain access to its full potential.
To install the Apfiy CLI, you'll first need npm, which comes preinstalled with Node.js. If you haven't yet installed Node, learn how to do that [here](../../webscraping/web_scraping_for_beginners/data_extraction/computer_preparation.md). Additionally, make sure you've got an Apify account, as you will need to log in to the CLI to gain access to its full potential.

Open up a terminal instance and run the following command:

```shell
npm i -g apify-cli
```

This will install the CLI via NPM.
This will install the CLI via npm.

## Logging in {#logging-in}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ slug: /tools/quick-javascript-switcher

---

**Quick Javascript Switcher** is a very simple Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed.
**Quick JavaScript Switcher** is a very simple Chrome extension that allows you to switch on/off the JavaScript for the current page with one click. It can be added to your browser via the [Chrome Web Store](https://chrome.google.com/webstore/category/extensions). After adding it to Chrome, you'll see its respective button next to any other Chrome extensions you might have installed.

If JavaScript is enabled - clicking the button will switch it off and reload the page. The next click will re-enable JavaScript and refresh the page. This extension is useful for checking whether a certain website will work without JavaScript (and thus could be parsed without using a browser with a plain HTTP request) or not.

Expand Down
4 changes: 2 additions & 2 deletions sources/academy/glossary/tools/user_agent_switcher.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,14 +15,14 @@ slug: /tools/user-agent-switcher

![User-Agent Switcher groups](./images/user-agent-switcher-groups.png)

Clicking on a group will display a list of possible user-agents to set.
Clicking on a group will display a list of possible User-Agents to set.

![Default available Internet Explorer agents](./images/user-agent-switcher-agents.png)

After setting the **User-Agent**, the page will be refreshed.

## Configuration

The extension configuration page allows you to edit the **User-Agent** list in case you want to add a specific user-agent that isn't already provided. You can find some other options, but most likely you will never need to modify those.
The extension configuration page allows you to edit the **User-Agent** list in case you want to add a specific User-Agent that isn't already provided. You can find some other options, but most likely you will never need to modify those.

![User-Agent Switcher configuration page](./images/user-agent-switcher-config.png)
10 changes: 5 additions & 5 deletions sources/academy/platform/deploying_your_code/docker_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,22 +49,22 @@ Here's the Dockerfile for our Node.js example project's actor:
FROM apify/actor-node:16

# Second, copy just package.json and package-lock.json since they are the only files
# that affect NPM install in the next step
# that affect npm install in the next step
COPY package*.json ./

# Install NPM packages, skip optional and development dependencies to keep the
# Install npm packages, skip optional and development dependencies to keep the
# image small. Avoid logging too much and print the dependency tree for debugging
RUN npm --quiet set progress=false \
&& npm install --only=prod --no-optional \
&& echo "Installed NPM packages:" \
&& echo "Installed npm packages:" \
&& (npm list --all || true) \
&& echo "Node.js version:" \
&& node --version \
&& echo "NPM version:" \
&& echo "npm version:" \
&& npm --version

# Next, copy the remaining files and directories with the source code.
# Since we do this after NPM install, quick build will be really fast
# Since we do this after npm install, quick build will be really fast
# for simple source file changes.
COPY . ./

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,15 +18,15 @@ You can use one of the two main ways to programmatically interact with the Apify
## Learning 🧠 {#learning}

- Scroll through the [Apify API docs](/api/v2) (there's a whole lot there, so you're not expected to memorize everything).
- Read about the Apify client in [Apify's docs](/api/client/js). It can also be seen on [GitHub](https://github.com/apify/apify-client-js) and [NPM](https://www.npmjs.com/package/apify-client).
- Read about the Apify client in [Apify's docs](/api/client/js). It can also be seen on [GitHub](https://github.com/apify/apify-client-js) and [npm](https://www.npmjs.com/package/apify-client).
- Learn about the [`Actor.newClient()`](/sdk/js/reference/class/Actor#newClient) function in the Apify SDK.
- Skim through [this article](https://help.apify.com/en/articles/2868670-how-to-pass-data-from-web-scraper-to-another-actor) about API integration (this article is old; however, still relevant).

## Knowledge check 📝 {#quiz}

1. What is the relationship between the Apify API and the Apify client? Are there any significant differences?
2. How do you pass input when running an actor or task via API?
3. Do you need to install the `apify-client` NPM package when already using the `apify` package?
3. Do you need to install the `apify-client` npm package when already using the `apify` package?

## Our task

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -236,7 +236,7 @@ The one main difference is that the Apify client automatically uses [**exponenti

**A:** The input should be passed into the **body** of the request when running an actor/task via API.

**Q: Do you need to install the `apify-client` NPM package when already using the `apify` package?**
**Q: Do you need to install the `apify-client` npm package when already using the `apify` package?**

**A:** No. The Apify client is available right in the SDK with the `Actor.newClient()` function.

Expand Down
2 changes: 1 addition & 1 deletion sources/academy/platform/getting_started/apify_client.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ You can access `apify-client` examples in the Console Actor detail page. Click t

## Installing and importing {#installing-and-importing}

If you are going to use the client in Node.js, use this command within one of your projects to install the package through NPM:
If you are going to use the client in Node.js, use this command within one of your projects to install the package through npm:

```shell
npm install apify-client
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Some websites do not load any data without a browser, as they need to execute so

## Making the choice {#making-the-choice}

When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick Javascript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.
When choosing which scraper to use, we would suggest first checking whether the website works without JavaScript or not. Probably the easiest way to do so is to use the [Quick JavaScript Switcher](../../glossary/tools/quick_javascript_switcher.md) extension for Chrome. If JavaScript is not needed, or you've spotted some XHR requests in the **Network** tab with the data you need, you probably won't need to use an automated browser browser. You can then check what data is received in response using [Postman](../../glossary/tools/postman.md) or [Insomnia](../../glossary/tools/insomnia.md) or try to send a few requests programmatically. If the data is there and you're not blocked straight away, a request-based scraper is probably the way to go.

It also depends of course on whether you need to fill in some data (like a username and password) or select a location (such as entering a zip code manually). Tasks where interacting with the page is absolutely necessary cannot be done using plain HTTP scraping, and require headless browsers. In some cases, you might also decide to use a browser-based solution in order to better blend in with the rest of the "regular" traffic coming from real users.

6 changes: 3 additions & 3 deletions sources/academy/webscraping/anti_scraping/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,13 +111,13 @@ Because we here at Apify scrape for a living, we have discovered many popular an
### IP rate-limiting

This is the most straightforward and standard protection, which is mainly implemented to prevent DDOS attacks, but it also works for blocking scrapers. Websites using rating don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address.
This is the most straightforward and standard protection, which is mainly implemented to prevent DDoS attacks, but it also works for blocking scrapers. Websites using rating don't allow to more than some defined number of requests from one IP address in a certain time span. If the max-request number is low, then there is a high potential for false-positive due to IP address uniqueness, such as in large companies where hundreds of employees can share the same IP address.

> Learn more about rate limiting [here](./techniques/rate_limiting.md)
### Header checking

This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific [header](../../glossary/concepts/http_headers.md) sets which they send along with every request. The most commonly known header that helps to detect bots is the `user-agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `user-agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers.
This type of bot identification is based on the given fact that humans are accessing web pages through browsers, which have specific [header](../../glossary/concepts/http_headers.md) sets which they send along with every request. The most commonly known header that helps to detect bots is the `User-Agent` header, which holds a value that identifies which browser is being used, and what version it's running. Though `User-Agent` is the most commonly used header for the **Header checking** method, other headers are sometimes used as well. The evaluation is often also run based on the header consistency, and includes a known combination of browser headers.

### URL analysis

Expand All @@ -131,7 +131,7 @@ One of the best ways of avoiding the possible breaking of your scraper due to we

### IP session consistency

This technique is commonly used to entirely block the bot from accessing the website altogether. It works on the principle that every entity that accesses the site gets a token. This token is then saved together with the IP address and HTTP request information such as user-agent and other specific headers. If the entity makes another request, but without the session token, the IP address is added on the greylist.
This technique is commonly used to entirely block the bot from accessing the website altogether. It works on the principle that every entity that accesses the site gets a token. This token is then saved together with the IP address and HTTP request information such as User-Agent and other specific headers. If the entity makes another request, but without the session token, the IP address is added on the greylist.

### Interval analysis

Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
---
title: Generating fingerprints
description: Learn how to use two super handy NPM libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page.
description: Learn how to use two super handy npm libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page.
sidebar_position: 3
slug: /anti-scraping/mitigation/generating-fingerprints
---

# Generating fingerprints {#generating-fingerprints}

**Learn how to use two super handy NPM libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page.**
**Learn how to use two super handy npm libraries to easily generate fingerprints and inject them into a Playwright or Puppeteer page.**

---

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -140,4 +140,4 @@ Notice that we didn't provide it a list of proxy URLs. This is because the `SHAD

## Next up {#next}

[Next up](./generating_fingerprints.md), we'll be checking out how to use two NPM packages to generate and inject [browser fingerprints](../techniques/fingerprinting.md).
[Next up](./generating_fingerprints.md), we'll be checking out how to use two npm packages to generate and inject [browser fingerprints](../techniques/fingerprinting.md).
Original file line number Diff line number Diff line change
Expand Up @@ -11,15 +11,15 @@ slug: /anti-scraping/techniques/browser-challenges
## Browser challenges

Browser challenges are a type of security measure that relies on browser fingerprints. These challenges typically involve a javascript script that collects both static and dynamic browser fingerprints. Static fingerprints include attributes such as user-agent, video card, and number of CPU cores available. Dynamic fingerprints, on the other hand, might involve rendering fonts or objects in the canvas (known as a [canvas fingerprint](./fingerprinting.md#with-canvases)), or playing audio in the [AudioContext](./fingerprinting.md#from-audiocontext). We were covering the details in the previous [fingerprinting](./fingerprinting.md) lesson.
Browser challenges are a type of security measure that relies on browser fingerprints. These challenges typically involve a JavaScript program that collects both static and dynamic browser fingerprints. Static fingerprints include attributes such as User-Agent, video card, and number of CPU cores available. Dynamic fingerprints, on the other hand, might involve rendering fonts or objects in the canvas (known as a [canvas fingerprint](./fingerprinting.md#with-canvases)), or playing audio in the [AudioContext](./fingerprinting.md#from-audiocontext). We were covering the details in the previous [fingerprinting](./fingerprinting.md) lesson.

While some browser challenges are relatively straightforward - for example, just loading an image and checking if it renders correctly - others can be much more complex. One well-known example of a complex browser challenge is Cloudflare's browser screen check. In this challenge, Cloudflare visually inspects the browser screen and blocks the first request if any inconsistencies are found. This approach provides an extra layer of protection against automated attacks.

Many online protections incorporate browser challenges into their security measures, but the specific techniques used can vary.

## Cloudflare browser challenge

One of the most well-known browser challenges is the one used by Cloudflare. Cloudflare has a massive dataset of legitimate canvas fingerprints and user-agent pairs, which they use in conjunction with machine learning algorithms to detect any device property spoofing. This might include spoofed user-agents, operating systems, or GPUs.
One of the most well-known browser challenges is the one used by Cloudflare. Cloudflare has a massive dataset of legitimate canvas fingerprints and User-Agent pairs, which they use in conjunction with machine learning algorithms to detect any device property spoofing. This might include spoofed User-Agent headers, operating systems, or GPUs.

![Cloudflare browser check](https://images.ctfassets.net/slt3lc6tev37/55EYMR81XJCIG5uxLjQQOx/252a98adf90fa0ff2f70437cc5c0a3af/under-attack-mode_enabled.gif)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ To make sure we're all on the same page, we're going to set up the project toget
npm init -y && npm install graphql-tag puppeteer got-scraping
```

This command will first initialize the project with NPM, then will install the `puppeteer`, `graphql-tag`, and `got-scraping` packages, which we will need in this lesson.
This command will first initialize the project with npm, then will install the `puppeteer`, `graphql-tag`, and `got-scraping` packages, which we will need in this lesson.

Finally, create a file called **index.js**. This is the file we will be working in for the rest of the lesson.

Expand Down Expand Up @@ -113,7 +113,7 @@ Also in the previous lesson, we learned that the **media** type is dependent on
query SearchQuery($query: String!, $max_age: Int!) {
organization {
media(query: $query, max_age: $max_age , first: 1000) {

}
}
}
Expand Down Expand Up @@ -190,7 +190,7 @@ const GET_LATEST = gql`
`;
```
Alternatively, if you don't want to write your GraphQL queries right within your Javascript code, you can write them in files using the **.graphql** format, then read them from the filesystem or import them.
Alternatively, if you don't want to write your GraphQL queries right within your JavaScript code, you can write them in files using the **.graphql** format, then read them from the filesystem or import them.
> In order to receive nice GraphQL syntax highlighting in these template literals, download the [GraphQL VSCode extension](https://marketplace.visualstudio.com/items?itemName=GraphQL.vscode-graphql)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ This means that when using TS (a popular acronym for "TypeScript") on a large pr

## How different is TypeScript from JavaScript? {#how-different-is-it}

Think of it this way: Javascript **IS** Typescript, but TypeScript isn't JavaScript. All JavaScript code is valid TypeScript code, which means that you can pretty much turn any **.js** file into a **.ts** file and it'll still work just the same after being compiled. It also means that to learn TypeScript, you aren't going to have to learn a whole new programming language if you already know JavaScript.
Think of it this way: JavaScript **IS** TypeScript, but TypeScript isn't JavaScript. All JavaScript code is valid TypeScript code, which means that you can pretty much turn any **.js** file into a **.ts** file and it'll still work just the same after being compiled. It also means that to learn TypeScript, you aren't going to have to learn a whole new programming language if you already know JavaScript.

What are the differences? Well, there's really just one: TypeScript files cannot be run directly. They must first be compiled into regular JavaScript.

Expand Down
Loading

0 comments on commit a883f83

Please sign in to comment.