Skip to content

Commit

Permalink
feat: kick off the third devtools lesson
Browse files Browse the repository at this point in the history
  • Loading branch information
honzajavorek committed Dec 6, 2024
1 parent c1a3dce commit 4bd5048
Show file tree
Hide file tree
Showing 5 changed files with 82 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,93 @@ sidebar_position: 3
slug: /scraping-basics-python/devtools-extracting-data
---

import Exercises from './_exercises.mdx';

**In this lesson we'll use the browser tools for developers to manually extract product data from an e-commerce website.**

---

In our pursuit to scrape products from the [Sales page](https://warehouse-theme-metal.myshopify.com/collections/sales), we've been able to locate parent elements containing relevant data. Now how do we extract the data?

## Finding product details

Previously, we've figured out how to save the subwoofer product card to a variable in the **Console**:

```js
products = document.querySelectorAll('.product-item');
subwoofer = products[2];
```

The product details are within the element as text, so maybe if we extract the text, we could work out the individual values?

```js
subwoofer.textContent;
```

That indeed outputs all the text, but in a form which would be hard to break down to relevant pieces.

![Printing text content of the parent element](./images/devtools-extracting-text.png)

We'll need to first locate relevant child elements and extract the data from each of them individually.

## Extracting title

We'll use the **Elements** tab of DevTools to inspect all child elements of the product card for the Sony subwoofer. We can see that the title of the product is inside an `a` element with several classes. From those the `product-item__title` seems like a great choice to locate the element.

![Finding child elements](./images/devtools-product-details.png)

JavaScript represents HTML elements as [Element](https://developer.mozilla.org/en-US/docs/Web/API/Element) objects. Among properties we've already played with, such as `textContent` or `outerHTML`, it also has the [`querySelector()`](https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelector) method. Here the method looks for matches only within children of the element:

```js
title = subwoofer.querySelector('.product-item__title');
title.textContent;
```

Notice we're calling `querySelector()` on the `subwoofer` variable, not `document`. And just like this, we've scraped our first piece of data! We've extracted the product title:

![Extracting product title](./images/devtools-extracting-title.png)

## Extracting price

To figure out how to get the price, we'll use the **Elements** tab of DevTools again. We notice there are two prices, a regular price and a sale price. For the purposes of watching prices we'll need the sale price. Both are `span` elements with the `price` class.

![Finding child elements](./images/devtools-product-details.png)

We could either rely on the fact that the sale price is likely to be always the one which is highlighted, or that it's always the first price. For now we'll rely on the former and we'll let `querySelector()` to simply return the first result:

```js
price = subwoofer.querySelector('.price');
price.textContent;
```

It works, but the price isn't alone in the result. Before we'd use such data, we'd need to do some **data cleaning**:

![Extracting product price](./images/devtools-extracting-price.png)

But for now that's okay. We're just testing the waters now, so that we have an idea about what our scraper will need to do. Once we'll get to extracting prices in Python, we'll figure out how to get numbers out of them.

## Extracting URL

:::danger Work in Progress

Under development.

:::

## Extracting all URLs

:::danger Work in Progress

Under development.

:::

---

<Exercises />

:::danger Work in Progress

This lesson is under development. Please read [Extracting data with DevTools](../scraping_basics_javascript/data_extraction/devtools_continued.md) in the meantime so you can follow the upcoming lessons.
Under development.

:::
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4bd5048

Please sign in to comment.