Skip to content

Commit

Permalink
fix: forgot to resolve some merge conflicts
Browse files Browse the repository at this point in the history
  • Loading branch information
honzajavorek committed Sep 10, 2024
1 parent c4024d0 commit d7894d6
Show file tree
Hide file tree
Showing 5 changed files with 4 additions and 70 deletions.
4 changes: 0 additions & 4 deletions sources/academy/platform/get_most_of_actors/actor_readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,7 @@ slug: /get-most-of-actors/actor-readme
- Whenever you build an Actor, think of the original request/idea and the "use case" = "user need" it should solve, please take notes and share them with Apify, so we can help you write a blog post supporting your Actor with more information, more detailed explanation, better SEO.
- Consider adding a video, images, and screenshots to your README to break up the text.
- This is an example of an Actor with a README that corresponds well to the guidelines below:
<<<<<<< HEAD
- https://apify.com/dtrungtin/airbnb-scraper
=======
- [apify.com/tri_angle/airbnb-scraper](https://apify.com/tri_angle/airbnb-scraper)
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
- Tip no.1: if you want to add snippets of code anywhere in your README, you can use [Carbon](https://github.com/carbon-app/carbon).
- Tip no.2: if you need any quick Markdown guidance, check out https://www.markdownguide.org/cheat-sheet/

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,7 @@ const crawler = new PuppeteerCrawler({
});
```

<<<<<<< HEAD
It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error.
=======
It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://docs.apify.com/academy/anti-scraping). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be just missing or corrupted. The developer can then choose if he will try to handle these problems in the code or just focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error.
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://docs.apify.com/academy/anti-scraping). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error.

Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code a Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these.

Expand Down
6 changes: 1 addition & 5 deletions sources/academy/tutorials/node_js/scraping_from_sitemaps.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,7 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js';

---

<<<<<<< HEAD
Let's say we want to scrape a database of craft beers ([brewbound.com](https://www.brewbound.com/)) before summer starts. If we are lucky, the website will contain a sitemap at [www.brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).
=======
Let's say we want to scrape a database of craft beers ([brewbound.com](https://www.brewbound.com/)) before summer starts. If we are lucky, the website will contain a sitemap at [https://www.brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
Let's say we want to scrape a database of craft beers ([brewbound.com](https://www.brewbound.com/)) before summer starts. If we are lucky, the website will contain a sitemap at [brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).

> Check out [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), which can discover sitemaps in hidden locations!
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -148,11 +148,7 @@ Now that we have this visualization to work off of, it will be much easier to bu

In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a query.

<<<<<<< HEAD
Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out!
=======
Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://www.cheddar.com/). From each article, we'd like to fetch the **title** and the **publish date**. After just a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out!
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://www.cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out!

![The media field pointing to datatype slugable](./images/media-field.jpg)

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ This article focuses on GitHub, but [we also have a guide for Bitbucket](https:/
To set up automated builds and tests for your Actors you need to:

1. Create a GitHub repository for your Actor code.
1. Get your Apify API token from the [Apify Console](https://console.apify.com/account#/integrations)
1. Get your Apify API token from the [Apify Console](https://console.apify.com/settings/integrations)

![Apify token in app](./images/ci-token.png)

Expand Down Expand Up @@ -75,11 +75,7 @@ To set up automated builds and tests for your Actors you need to:
</TabItem>
<<<<<<< HEAD
<TabItem value="beta.yml" label="beta.yml">
=======
[Find the Bitbucket version here](https://help.apify.com/en/articles/6988586-setting-up-continuous-integration-for-apify-actors-on-bitbucket).
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
```yaml
name: Test and build beta version
Expand Down Expand Up @@ -107,59 +103,13 @@ To set up automated builds and tests for your Actors you need to:
</TabItem>
</Tabs>
<<<<<<< HEAD
## GitHub integration
=======
[Add the token to GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions#creating-encrypted-secrets-for-a-repository). Go to **your repo > Settings > Secrets > New repository secret**.
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
To set up automatic builds from GitHub:
<<<<<<< HEAD
1. Go to your Actor's detail page and coy the Build Actor API endpoint URL from the API tab.
1. In your GitHub repository, go to Settings > Webhooks > Add webhook.
1. Paste the API URL into the Payload URL field.
=======
```cURL
https://api.apify.com/v2/acts/YOUR-ACTOR-NAME/builds?token=YOUR-TOKEN-HERE&version=0.0&tag=beta&waitForFinish=60
```

![Add build Actor URL to secrets](./images/ci-add-build-url.png)

## Set up automatic builds

[//]: # (TODO: This duplicates somehow the above part)

Once you have your [prerequisites](#prerequisites), you can start automating your builds. You can use [webhooks](https://en.wikipedia.org/wiki/Webhook) or the [Apify CLI](/cli/) ([described in our Bitbucket guide](https://help.apify.com/en/articles/6988586-setting-up-continuous-integration-for-apify-actors-on-bitbucket)) in your Git workflow.

To use webhooks, you can use the [distributhor/workflow-webhook](https://github.com/distributhor/workflow-webhook) action, which uses the secrets described in the [prerequisites](#prerequisites) section.

```yaml
name: Build Actor
- uses: distributhor/workflow-webhook@v1
env:
webhook_url: ${{ secrets.[VERSION]_BUILD_URL }}
webhook_secret: ${{ secrets.APIFY_TOKEN }}
```
You can find your builds under the Actor's **Builds** section.
![An Actor's builds](./images/ci-builds.png)
## [](#github-integration) Automatic builds from GitHub
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/2I3DM8Nvu1M" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
If the source code of an Actor is hosted in a [Git repository](#git-repository), it is possible to set up an integration so that the Actor is automatically rebuilt on every push to the Git repository. For that, you only need to set up a webhook in your Git source control system that will invoke the [Build Actor](/api/v2/#/reference/actors/build-collection/build-actor) API endpoint on every push.
For repositories on GitHub, you can use the following steps. First, go to the Actor detail page, open the **API** tab, and copy the **Build Actor** API endpoint URL. It should look something like this:
```text
https://api.apify.com/v2/acts/apify~hello-world/builds?token=<API_TOKEN>&version=0.1
```

Then go to your GitHub repository, click **Settings**, select **Webhooks** tab and click **Add webhook**. Paste the API URL to the **Payload URL** as follows:
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
![GitHub integration](./images/ci-github-integration.png)
Expand Down

0 comments on commit d7894d6

Please sign in to comment.