Skip to content

Commit

Permalink
chore: throw on build with broken anchors (#1190)
Browse files Browse the repository at this point in the history
Docusaurus v3 adds the
[`onBrokenAnchors`](https://docusaurus.io/docs/api/docusaurus-config#onBrokenAnchors)
setting that allows us to fail the build when Docusaurus finds broken
anchor links (internal fragment links).

Closes #952

---------

Co-authored-by: Michał Olender <[email protected]>
  • Loading branch information
barjin and TC-MO authored Oct 8, 2024
1 parent eb7a498 commit ec5b323
Show file tree
Hide file tree
Showing 29 changed files with 134 additions and 138 deletions.
2 changes: 2 additions & 0 deletions docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ module.exports = {
/** @type {import('@docusaurus/types').ReportingSeverity} */ ('throw'),
onBrokenMarkdownLinks:
/** @type {import('@docusaurus/types').ReportingSeverity} */ ('throw'),
onBrokenAnchors:
/** @type {import('@docusaurus/types').ReportingSeverity} */ ('warn'),
themes: [
[
require.resolve('./apify-docs-theme'),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ Each property's key corresponds to the name we're expecting within our code, whi

## Property types & editor types {#property-types}

Within our new **numbers** property, there are two more fields we must specify. Firstly, we must let the platform know that we're expecting an array of numbers with the **type** field. Then, we should also instruct Apify on which UI component to render for this input property. In our case, we have an array of numbers, which means we should use the **json** editor type that we discovered in the ["array" section](/platform/actors/development/actor-definition/input-schema#array) of the input schema documentation. We could also use **stringList**, but then we'd have to parse out the numbers from the strings.
Within our new **numbers** property, there are two more fields we must specify. Firstly, we must let the platform know that we're expecting an array of numbers with the **type** field. Then, we should also instruct Apify on which UI component to render for this input property. In our case, we have an array of numbers, which means we should use the **json** editor type that we discovered in the ["array" section](/platform/actors/development/actor-definition/input-schema/specification/v1#array) of the input schema documentation. We could also use **stringList**, but then we'd have to parse out the numbers from the strings.

```json
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ You might have already noticed that we've been using the **RESIDENTIAL** proxy g
## Learning 🧠 {#learning}

- Skim [this page](https://apify.com/proxy) for a general idea of Apify Proxy.
- Give the [proxy documentation](/platform/proxy#our-proxies) a solid readover (feel free to skip most of the examples).
- Give the [proxy documentation](/platform/proxy) a solid readover (feel free to skip most of the examples).
- Check out the [anti-scraping guide](../../webscraping/anti_scraping/index.md).
- Gain a solid understanding of the [SessionPool](https://crawlee.dev/api/core/class/SessionPool).
- Look at a few Actors on the [Apify store](https://apify.com/store). How are they utilizing proxies?
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -231,7 +231,7 @@ That's everything! Now, even if the Actor migrates (or is gracefully aborted and

**A:** It's not best to use this option by default. If it fails, there must be a reason, which would need to be thought through first - meaning that the edge case of failing should be handled when resurrecting the Actor. The state should be persisted beforehand.

**Q: Migrations happen randomly, but by [aborting gracefully](/platform/actors/running#aborting-runs), you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run?**
**Q: Migrations happen randomly, but by [aborting gracefully](/platform/actors/running/runs-and-builds#aborting-runs), you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run?**

**A:** After aborting or throwing an error mid-process, it manages to start back from where it was upon resurrection.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Storage allows us to save persistent data for further processing. As you'll lear
## Learning 🧠 {#learning}

- Check out [the docs about Actor tasks](/platform/actors/running/tasks).
- Read about the [two main storage options](/platform/storage#dataset) on the Apify platform.
- Read about the [two main storage options](/platform/storage/dataset) on the Apify platform.
- Understand the [crucial differences between named and unnamed storages](/platform/storage/usage#named-and-unnamed-storages).
- Learn about the [`Dataset`](/sdk/js/reference/class/Dataset) and [`KeyValueStore`](/sdk/js/reference/class/KeyValueStore) objects in the Apify SDK.

Expand Down
4 changes: 2 additions & 2 deletions sources/academy/platform/getting_started/inputs_outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Then, replace everything in **INPUT_SCHEMA.json** with this:
}
```

> If you're interested in learning more about how the code works, and what the **INPUT_SCHEMA.json** means, read about [inputs](/sdk/js/docs/examples/accept-user-input) and [adding data to a dataset](/sdk/js/docs/examples/add-data-to-dataset) in the Apify SDK documentation, and refer to the [input schema docs](/platform/actors/development/actor-definition/input-schema#integer).
> If you're interested in learning more about how the code works, and what the **INPUT_SCHEMA.json** means, read about [inputs](/sdk/js/docs/examples/accept-user-input) and [adding data to a dataset](/sdk/js/docs/examples/add-data-to-dataset) in the Apify SDK documentation, and refer to the [input schema docs](/platform/actors/development/actor-definition/input-schema/specification/v1#integer).
Finally, **Save** and **Build** the Actor just as you did in the previous lesson.

Expand All @@ -89,7 +89,7 @@ On the results tab, there are a whole lot of options for which format to view/do

There's our solution! Did it work for you as well? Now, we can download the data right from the results tab to be used elsewhere, or even programmatically retrieve it by using [Apify's API](/api/v2) (we'll be discussing how to do this in the next lesson).

It's important to note that the default dataset of the Actor, which we pushed our solution to, will be retained for 7 days. If we wanted the data to be retained for an indefinite period of time, we'd have to use a named dataset. For more information about named storages vs unnamed storages, read a bit about [data retention on the Apify platform](/platform/storage#data-retention).
It's important to note that the default dataset of the Actor, which we pushed our solution to, will be retained for 7 days. If we wanted the data to be retained for an indefinite period of time, we'd have to use a named dataset. For more information about named storages vs unnamed storages, read a bit about [data retention on the Apify platform](/platform/storage/usage#data-retention).

## Next up {#next}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ If the Actor being run via API takes 5 minutes or less to complete a typical run

> If you are unsure about the differences between an Actor and a task, you can read about them in the [tasks](/platform/actors/running/tasks) documentation. In brief, tasks are pre-configured inputs for Actors.
The API endpoints and usage (for both sync and async) for [Actors](/api/v2#/reference/actors/run-collection/run-actor) and [tasks](/api/v2#/reference/actor-tasks/run-collection/run-task) are essentially the same.
The API endpoints and usage (for both sync and async) for [Actors](/api/v2#tag/ActorsRun-collection/operation/act_runs_post) and [tasks](/api/v2#/reference/actor-tasks/run-collection/run-task) are essentially the same.

To run, or **call**, an Actor/task, you will need a few things:

Expand Down
36 changes: 18 additions & 18 deletions sources/academy/tutorials/apify_scrapers/cheerio_scraper.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ tutorial, great! You are ready to continue where we left off. If you haven't see
check it out, it will help you learn about Apify and scraping in general and set you up for this tutorial,
because this one builds on topics and code examples discussed there.

## [](#getting-to-know-our-tools) Getting to know our tools
## Getting to know our tools

In the [Getting started with Apify scrapers](/academy/apify-scrapers/getting-started) tutorial, we've confirmed that the scraper works as expected,
so now it's time to add more data to the results.
Expand All @@ -36,7 +36,7 @@ Now that's out of the way, let's open one of the Actor detail pages in the Store
> If you're wondering why we're using Web Scraper as an example instead of Cheerio Scraper,
it's only because we didn't want to triple the number of screenshots we needed to make. Lazy developers!

## [](#building-our-page-function) Building our Page function
## Building our Page function

Before we start, let's do a quick recap of the data we chose to scrape:

Expand All @@ -52,7 +52,7 @@ Before we start, let's do a quick recap of the data we chose to scrape:
We've already scraped numbers 1 and 2 in the [Getting started with Apify scrapers](/academy/apify-scrapers/getting-started)
tutorial, so let's get to the next one on the list: title.

### [](#title) Title
### Title

![$1](https://raw.githubusercontent.com/apify/actor-scraper/master/docs/img/title.webp)

Expand All @@ -79,7 +79,7 @@ async function pageFunction(context) {
}
```

### [](#description) Description
### Description

Getting the Actor's description is a little more involved, but still pretty straightforward. We cannot search for a `<p>` tag, because there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within
the `<header>` element too, same as the title. Moreover, the actual description is nested inside a `<span>` tag with a class `actor-description`.
Expand All @@ -97,7 +97,7 @@ async function pageFunction(context) {
}
```

### [](#modified-date) Modified date
### Modified date

The DevTools tell us that the `modifiedDate` can be found in a `<time>` element.

Expand Down Expand Up @@ -125,7 +125,7 @@ But we would much rather see a readable date in our results, not a unix timestam
constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`.
Phew!

### [](#run-count) Run count
### Run count

And so we're finishing up with the `runCount`. There's no specific element like `<time>`, so we need to create
a complex selector and then do a transformation on the result.
Expand Down Expand Up @@ -164,7 +164,7 @@ using a regular expression, but its type is still a `string`, so we finally conv
>
> This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function.
### [](#wrapping-it-up) Wrapping it up
### Wrapping it up

And there we have it! All the data we needed in a single object. For the sake of completeness, let's add
the properties we parsed from the URL earlier and we're good to go.
Expand Down Expand Up @@ -242,13 +242,13 @@ async function pageFunction(context) {
}
```

### [](#test-run) Test run
### Test run

As always, try hitting that **Save & Run** button and visit
the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped.
You nailed it!

## [](#pagination) Pagination
## Pagination

Pagination is a term that represents "going to the next page of results". You may have noticed that we did not
actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors,
Expand All @@ -264,7 +264,7 @@ with Cheerio? We don't have a browser to do it and we only have the HTML of the
answer is that we can't click a button. Does that mean that we cannot get the data at all? Usually not,
but it requires some clever DevTools-Fu.

### [](#analyzing-the-page) Analyzing the page
### Analyzing the page

While with Web Scraper and **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)), we could get away with clicking a button,
with Cheerio Scraper we need to dig a little deeper into the page's architecture. For this, we will use
Expand All @@ -280,7 +280,7 @@ Then we click the **Show more** button and wait for incoming requests to appear
Now, this is interesting. It seems that we've only received two images after clicking the button and no additional
data. This means that the data about Actors must already be available in the page and the **Show more** button only displays it. This is good news.

### [](#finding-the-actors) Finding the Actors
### Finding the Actors

Now that we know the information we seek is already in the page, we just need to find it. The first Actor in the store
is Web Scraper, so let's try using the search tool in the **Elements** tab to find some reference to it. The first
Expand Down Expand Up @@ -309,7 +309,7 @@ so you might already be wondering, can I make one request to the store to get th
and then parse it out and be done with it in a single request? Yes you can! And that's the power
of clever page analysis.

### [](#using-the-data-to-enqueue-all-actor-details) Using the data to enqueue all Actor details
### Using the data to enqueue all Actor details

We don't really need to go to all the Actor details now, but for the sake of practice, let's imagine we only found
Actor names such as `cheerio-scraper` and their owners, such as `apify` in the data. We will use this information
Expand Down Expand Up @@ -342,7 +342,7 @@ how to route those requests.
>If you're wondering how we know the structure of the URL, see the [Getting started
with Apify Scrapers](./getting_started.md) tutorial again.

### [](#plugging-it-into-the-page-function) Plugging it into the Page function
### Plugging it into the Page function

We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`.
Remember the `// Do some stuff later` comment? Let's replace it.
Expand Down Expand Up @@ -411,13 +411,13 @@ to get all results with Cheerio only and other times it takes hours of research.
the right scraper for your job. But don't get discouraged. Often times, the only thing you will ever need is to
define a correct Pseudo URL. Do your research first before giving up on Cheerio Scraper.

## [](#downloading-our-scraped-data) Downloading the scraped data
## Downloading the scraped data

You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it.

> If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**.
### [](#clean-items) Clean items
### Clean items

You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record
for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields
Expand All @@ -427,7 +427,7 @@ Clean items, on the other hand, include only the data you returned from the `pag

To control this, open the **Advanced options** view on the **Dataset** tab.

## [](#bonus-making-your-code-neater) Bonus: Making your code neater
## Bonus: Making your code neater

You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier
time maintaining or extending your task, feel free to define other functions inside the `pageFunction`
Expand Down Expand Up @@ -495,11 +495,11 @@ async function pageFunction(context) {
> If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature
of JavaScript. It helps you put what matters on top, if you so desire.

## [](#final-word) Final word
## Final word

Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)!

## [](#whats-next) What's next
## What's next

* Check out the [Apify SDK](https://docs.apify.com/sdk) and its [Getting started](https://docs.apify.com/sdk/js/docs/guides/apify-platform) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking.
* [Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors.
Expand Down
Loading

0 comments on commit ec5b323

Please sign in to comment.