Skip to content

Commit

Permalink
fix(academy): typos, updates and clarifications (#1218)
Browse files Browse the repository at this point in the history
- fix typos, mainly excess**
- update google accept cookies
- update google search element selection
- logical correction
  • Loading branch information
honzajavorek authored Oct 8, 2024
2 parents ec5b323 + e17d550 commit e54ba89
Show file tree
Hide file tree
Showing 10 changed files with 29 additions and 27 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ slug: /expert-scraping-with-apify/actors-webhooks

# Webhooks & advanced Actor overview {#webhooks-and-advanced-actors}

**Learn more advanced details about Actors, how they work, and the default configurations they can take. **Also**,** learn how** to integrate your Actor with webhooks.**
**Learn more advanced details about Actors, how they work, and the default configurations they can take. Also, learn how to integrate your Actor with webhooks.**

---

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,8 @@ const dataset = await Actor.openDataset(datasetId);
// ...
```

> Tip: You will need to use `forceCloud` option - `Actor.openDataset(<name/id>, { forceCloud: true });` - to open dataset from platform storage while running Actor locally.
Next, we'll grab hold of the dataset's items with the `dataset.getData()` function:

```js
Expand Down Expand Up @@ -141,7 +143,7 @@ https://api.apify.com/v2/acts/USERNAME~filter-actor/runs?token=YOUR_TOKEN_HERE
Whichever one you choose is totally up to your preference.
Next, within the Actor, we will click the **Integrations** tab and choose **Webhook**, then fill out the details to look like this:
Next, within the Amazon scraping Actor, we will click the **Integrations** tab and choose **Webhook**, then fill out the details to look like this:
![Configuring a webhook](./images/adding-webhook.jpg)
Expand All @@ -163,7 +165,7 @@ Additionally, we should be able to see that our **filter-actor** was run, and ha
**Q: How do you allocate more CPU for an Actor's run?**
**A:** On the platform, more memory can be allocated in the Actor's input configuration, and the default allocated CPU can be changed in the Actor's **Settings** tab. When running locally, you can use the **APIFY_MEMORY_MBYTES**** environment variable to set the allocated CPU. 4GB is equal to 1 CPU core on the Apify platform.
**A:** On the platform, more memory can be allocated in the Actor's input configuration, and the default allocated CPU can be changed in the Actor's **Settings** tab. When running locally, you can use the **APIFY_MEMORY_MBYTES** environment variable to set the allocated CPU. 4GB is equal to 1 CPU core on the Apify platform.
**Q: Within itself, can you get the exact time that an Actor was started?**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ const crawler = new CheerioCrawler({
});
```

Now, we'll use the **maxUsageCount** key to force each session to be thrown away after 5 uses and **maxErrorScore**** to trash a session once it receives an error.
Now, we'll use the **maxUsageCount** key to force each session to be thrown away after 5 uses and **maxErrorScore** to trash a session once it receives an error.

```js
const crawler = new CheerioCrawler({
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ await Stats.initialize();
## Tracking errors {#tracking-errors}
In order to keep track of errors, we must write a new function within the crawler's configuration called **failedRequestHandler**. Passed into this function is an object containing an **Error** object for the error which occurred and the **Request** object, as well as information about the session and proxy which were used for the request.
In order to keep track of errors, we must write a new function within the crawler's configuration called **errorHandler**. Passed into this function is an object containing an **Error** object for the error which occurred and the **Request** object, as well as information about the session and proxy which were used for the request.
```js
const crawler = new CheerioCrawler({
Expand All @@ -79,7 +79,7 @@ const crawler = new CheerioCrawler({
maxConcurrency: 50,
requestHandler: router,
// Handle all failed requests
failedRequestHandler: async ({ error, request }) => {
errorHandler: async ({ error, request }) => {
// Add an error for this url to our error tracker
Stats.addError(request.url, error?.message);
},
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -67,7 +67,7 @@ That's it! Now, our Actor will push its data to a dataset named **amazon-offers-

We now want to store the cheapest item in the default key-value store under a key named **CHEAPEST-ITEM**. The most efficient and practical way of doing this is by filtering through all of the newly named dataset's items and pushing the cheapest one to the store.

Let's add the following code to the bottom of the Actor after **Crawl** finished** is logged to the console:
Let's add the following code to the bottom of the Actor after **Crawl finished** is logged to the console:

```js
// ...
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Once again, we'll be adding onto our main Amazon-scraping Actor in this activity

We have decided that we want to retain the data scraped by the Actor for a long period of time, so instead of pushing to the default dataset, we will be pushing to a named dataset. Additionally, we want to save the absolute cheapest item found by the scraper into the default key-value store under a key named **CHEAPEST-ITEM**.

Finally, we'll create a task for the Actor that saves the configuration with the **keyword** set to **google pixel****.
Finally, we'll create a task for the Actor that saves the configuration with the **keyword** set to **google pixel**.

[**Solution**](./solutions/using_storage_creating_tasks.md)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -36,15 +36,15 @@ Let's first focus on the first 3 steps listed above. By using `page.click()` and
<TabItem value="Playwright" label="Playwright">

```js
// Click the "I agree" button
// Click the "Accept all" button
await page.click('button:has-text("Accept all")');
```

</TabItem>
<TabItem value="Puppeteer" label="Puppeteer">

```js
// Click the "I agree" button
// Click the "Accept all" button
await page.click('button + button');
```

Expand All @@ -53,15 +53,15 @@ await page.click('button + button');

With `page.click()`, Puppeteer and Playwright actually drag the mouse and click, allowing the bot to act more human-like. This is different from programmatically clicking with `Element.click()` in vanilla client-side JavaScript.

Notice that in the Playwright example, we are using a different selector than in the Puppeteer example. This is because Playwright supports [many custom CSS selectors](https://playwright.dev/docs/other-locators#css-elements-matching-one-of-the-conditions), such as the **has-text** pseudo class. As a rule of thumb, using text selectors is much more preferable to using regular selectors, as they are much less likely to break. If Google makes the sibling above the **I agree** button a `<div>` element instead of a `<button>` element, our `button + button` selector will break. However, the button will always have the text **I agree**; therefore, `button:has-text("I agree")` is more reliable.
Notice that in the Playwright example, we are using a different selector than in the Puppeteer example. This is because Playwright supports [many custom CSS selectors](https://playwright.dev/docs/other-locators#css-elements-matching-one-of-the-conditions), such as the **has-text** pseudo class. As a rule of thumb, using text selectors is much more preferable to using regular selectors, as they are much less likely to break. If Google makes the sibling above the **Accept all** button a `<div>` element instead of a `<button>` element, our `button + button` selector will break. However, the button will always have the text **Accept all**; therefore, `button:has-text("Accept all")` is more reliable.

> If you're not already familiar with CSS selectors and how to find them, we recommend referring to [this lesson](../../scraping_basics_javascript/data_extraction/using_devtools.md) in the **Web scraping for beginners** course.
Then, we can type some text into an input field with `page.type()`; passing a CSS selector as the first, and the string to input as the second parameter:
Then, we can type some text into an input field `<textarea>` with `page.type()`; passing a CSS selector as the first, and the string to input as the second parameter:

```js
// Type the query into the search box
await page.type('input[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');
```

Finally, we can press a single key by accessing the `keyboard` property of `page` and calling the `press()` function on it:
Expand All @@ -85,11 +85,11 @@ const page = await browser.newPage();

await page.goto('https://www.google.com/');

// Click the "I agree" button
// Click the "Accept all" button
await page.click('button:has-text("Accept all")');

// Type the query into the search box
await page.type('textarea[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');

// Press enter
await page.keyboard.press('Enter');
Expand All @@ -110,11 +110,11 @@ const page = await browser.newPage();

await page.goto('https://www.google.com/');

// Click the "I agree" button
// Click the "Accept all" button
await page.click('button + button');

// Type the query into the search box
await page.type('textarea[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');

// Press enter
await page.keyboard.press('Enter');
Expand Down Expand Up @@ -146,7 +146,7 @@ await page.goto('https://www.google.com/');

await page.click('button:has-text("Accept all")');

await page.type('textarea[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');

await page.keyboard.press('Enter');

Expand All @@ -172,7 +172,7 @@ await page.goto('https://www.google.com/');

await page.click('button + button');

await page.type('textarea[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');

await page.keyboard.press('Enter');

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -63,10 +63,10 @@ const page = await browser.newPage();
await page.goto('https://google.com');

// Agree to the cookies policy
await page.click('button:has-text("I agree")');
await page.click('button:has-text("Accept all")');

// Type the query and visit the results page
await page.type('input[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');
await page.keyboard.press('Enter');

// Click on the first result
Expand Down Expand Up @@ -99,7 +99,7 @@ await page.goto('https://google.com');
await page.click('button + button');

// Type the query and visit the results page
await page.type('input[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');
await page.keyboard.press('Enter');

// Wait for the first result to appear on the page,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ await page.goto('https://www.google.com/');

await page.click('button + button');

await page.type('input[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');
await page.keyboard.press('Enter');

// Wait for the element to be present on the page prior to clicking it
Expand Down Expand Up @@ -104,10 +104,10 @@ const page = await browser.newPage();
await page.goto('https://google.com');

// Agree to the cookies policy
await page.click('button:has-text("I agree")');
await page.click('button:has-text("Accept all")');

// Type the query and visit the results page
await page.type('input[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');
await page.keyboard.press('Enter');

// Click on the first result
Expand Down Expand Up @@ -139,7 +139,7 @@ await page.goto('https://google.com');
await page.click('button + button');

// Type the query and visit the results page
await page.type('input[title="Search"]', 'hello world');
await page.type('textarea[title]', 'hello world');
await page.keyboard.press('Enter');

// Wait for the first result to appear on the page,
Expand Down
2 changes: 1 addition & 1 deletion sources/academy/webscraping/typescript/mini_project.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,7 +366,7 @@ async function scrape(input: UserInput) {
}
```

Now, we can access `result[0].images` on the return value of `scrape` if **removeImages** was false without any compiler errors being thrown. But, if we switch **removeImages** to false, TypeScript will yell at us.
Now, we can access `result[0].images` on the return value of `scrape` if **removeImages** was false without any compiler errors being thrown. But, if we switch **removeImages** to true, TypeScript will yell at us.

![No more error](./images/no-more-error.png)

Expand Down

0 comments on commit e54ba89

Please sign in to comment.