Skip to content

Commit

Permalink
consolidated pages and added partitions page
Browse files Browse the repository at this point in the history
  • Loading branch information
C00ldudeNoonan committed Oct 21, 2024
1 parent 6f078db commit 8b6d1f6
Show file tree
Hide file tree
Showing 7 changed files with 133 additions and 160 deletions.
67 changes: 42 additions & 25 deletions docs/docs-beta/docs/tutorial/01-etl-tutorial-introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ If you haven't already, complete the [Quick Start](/getting-started/quickstart)
- Running a pipeline by materializing assets
- Adding schedules, sensors, and partitions to your assets

[Add image for what the completed global asset graph looks like]

## Step 1: Set up your Dagster environment

First, set up a new Dagster project.
Expand All @@ -42,46 +44,61 @@ First, set up a new Dagster project.
3. Install Dagster and the required dependencies:

```bash title="Install Dagster and dependencies"
pip install dagster dagster-webserver pandas
pip install dagster dagster-webserver pandas dagster-duckdb
```

## Step 2: Copying Data Files
## Step 2: Copying Project Scaffold

Next we will get the raw data for the project.
Next we will get the raw data for the project. As well as the project scaffold, Dagster has several pre-built scaffolds you can install depending on your use case. You can see the full up to date list by running. `dagster project list-examples`

1. Create a new folder for the raw data:
Use the project scaffold command for this project.

```bash title="Create the data directory"
mkdir data
cd data
```bash title="ETL Project Scaffold"
dagster project from-example --getting_started_etl_tutorial
```

2. Copy the raw csv files:
The project should have this structure.

```bash title="Copy the csv files"
curl -L -o products.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/products.csv
```
dagster-etl-tutorial/
├── etl_tutorial/
│ └── definitions.py
├── data/
│ └── products.csv
│ └── sales_data.csv
│ └── sales_reps.csv
│ └── sample_request/
│ └── request.json
├── pyproject.toml
├── setup.cfg
├── setup.py
```

curl -L -o sales_reps.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_reps.csv
## Dagster Project Structure

In the root directory there are three configuration files that are common in Python package management. These manage dependencies and identifies the Dagster modules in the project. The etl_tutorial folder is where our Dagster definition for this code location exists. The data directory is where the raw data for the project is stored and we will reference these files in our software-defined assets.


### File/Directory Descriptions

- **etl_tutorial/**: This is a Python module that contains your Dagster code. It is the main directory where you will define your assets, jobs, schedules, sensors, and resources.

- **definitions.py**: This file is typically used to define jobs, schedules, and sensors. It organizes the various components of your Dagster project. This allows Dagster to load the definitions in a module.

- **pyproject.toml**: This file is used to specify build system requirements and package metadata for Python projects. It is part of the Python packaging ecosystem.

- **setup.cfg**: This file is used for configuration of your Python package. It can include metadata about the package, dependencies, and other configuration options.

- **setup.py**: This script is used to build and distribute your Python package. It is a standard file in Python projects for specifying package details.

curl -L -o sales_data.csv https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sales_data.csv
```
3. Copy Sample Request json file

```bash title="Create the sample request"
mkdir sample_request
cd sample_request
curl -L -o request.json https://raw.githubusercontent.com/dagster-io/dagster/refs/heads/master/examples/docs_beta_snippets/docs_beta_snippets/guides/tutorials/etl_tutorial/data/sample_request/request.json

# navigating back to the root directory
cd../..
```


## What you've learned

- Set up a Python virtual environment and installed Dagster
- Copied raw data for project
- Setup project scaffold
- How a Dagster project is structured and what these files do

## Next steps

- Continue this tutorial with [setting up your dagster project ](/tutorial/dagster-project-setup)
- Continue this tutorial with [your first asset](/tutorial/02-your-first-asset)
125 changes: 0 additions & 125 deletions docs/docs-beta/docs/tutorial/02-dagster-project-setup.md

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -10,17 +10,21 @@ last_update:

Now that we have the raw data files and the Dagster project setup lets create some loading those csv's into duckdb.

Check failure on line 11 in docs/docs-beta/docs/tutorial/02-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'csv's'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'csv's'?", "location": {"path": "docs/docs-beta/docs/tutorial/02-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/02-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'csv's' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'csv's' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/02-your-first-asset.md", "range": {"start": {"line": 11, "column": 98}}}, "severity": "ERROR"}

Check failure on line 11 in docs/docs-beta/docs/tutorial/02-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DuckDB' instead of 'duckdb'. Raw Output: {"message": "[Vale.Terms] Use 'DuckDB' instead of 'duckdb'.", "location": {"path": "docs/docs-beta/docs/tutorial/02-your-first-asset.md", "range": {"start": {"line": 11, "column": 109}}}, "severity": "ERROR"}

Check warning on line 11 in docs/docs-beta/docs/tutorial/02-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/02-your-first-asset.md", "range": {"start": {"line": 11, "column": 116}}}, "severity": "WARNING"}

Asset definitions enable a declarative approach to data management, in which code is the source of truth on what data assets should exist and how those assets are computed.

<iframe width="560" height="315" src="https://www.youtube.com/embed/In4CUoFKOfY?si=Xnk_CADS1pf7D5BA" title="YouTube video player" frameborder="0" allow="accelerometer; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

## What you'll learn

- Creating our intial defintions object
- Creating our initial definitions object
- Adding a duckdb resource

Check failure on line 20 in docs/docs-beta/docs/tutorial/02-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DuckDB' instead of 'duckdb'. Raw Output: {"message": "[Vale.Terms] Use 'DuckDB' instead of 'duckdb'.", "location": {"path": "docs/docs-beta/docs/tutorial/02-your-first-asset.md", "range": {"start": {"line": 20, "column": 12}}}, "severity": "ERROR"}
- Building some basic software defined assets

Check warning on line 21 in docs/docs-beta/docs/tutorial/02-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/02-your-first-asset.md", "range": {"start": {"line": 21, "column": 46}}}, "severity": "WARNING"}

## Building definitions object

The definitions object [need docs reference] in Dagster serves as the central configuration point for defining and organizing various componenets within a Dagster Project. It acts as a container that holds all the necessary configurations for a code location, ensuring that everything is organized and easily accessible.
The definitions object [need docs reference] in Dagster serves as the central configuration point for defining and organizing various components within a Dagster Project. It acts as a container that holds all the necessary configurations for a code location, ensuring that everything is organized and easily accessible.

Check warning on line 25 in docs/docs-beta/docs/tutorial/02-your-first-asset.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/02-your-first-asset.md", "range": {"start": {"line": 25, "column": 320}}}, "severity": "WARNING"}

1. Creating Definitions Object and duckdb resource
1. Creating Definitions object and duckdb resource

Open the definitions.py file and add the following import statements and definitions object.

Expand Down Expand Up @@ -62,7 +66,7 @@ Same thing for Sales Data

4. Bringing our assets into the Definitions object

Now to pull these assets into our definitions object simply add them to the empty list in the assets parameter.
Now to pull these assets into our definitions object, add them to the empty list in the assets parameter.

```python
defs = dg.Definitions(
Expand All @@ -74,13 +78,26 @@ Now to pull these assets into our definitions object simply add them to the empt
),
```

## Materialize Assets

Lets fire up Dagster and materialize these assets. If you are not in the project root directory navigate there now.

Run the `dagster dev` command. Dagster should open up in your browser. Navigate to the Global asset lineage page. You should see this

[screenshot of global asset lineage]

Click on products and then materilize. Navigate to the jobs screen.

[screenshot of run]

Do the same for sales_reps, and sales_data. from

## What you've learned

- Created a Dagster Definition
- Built our ingestion assets



## Next steps

- Continue this tutorial with your [Asset Dependencies]
- Continue this tutorial with your [Asset Dependencies](/tutorial/02-your-first-asset)
54 changes: 54 additions & 0 deletions docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
---
title: Asset Dependencies and Checks
description: Reference Assets as dependencies to other assets and asset checks.

Check warning on line 3 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 3, "column": 80}}}, "severity": "WARNING"}
last_update:
date: 2024-10-16
author: Alex Noonan
---

# Asset Dependencies and Asset Checks

The DAG or Directed Acyclic Graph is a key part of Dagster. This is an improvement over the typical cron workflow for orchestration. With a Dag approach you can easily understand complex data pipelines. The key benefits of Dags are

Check failure on line 11 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DAG' instead of 'Dag'. Raw Output: {"message": "[Vale.Terms] Use 'DAG' instead of 'Dag'.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 11, "column": 141}}}, "severity": "ERROR"}

1. Clarity: The DAG provides a clear visual representation of the entire workflow.
2. Efficiency: Parallel tasks can be identified and executed simultaneously.
3. Reliability: Dependencies ensure that tasks are executed in the correct order.
4. Scalability: Complex workflows can be managed effectively.
5. Maintenance: It's easier to update or troubleshoot specific parts of the workflow.

## What you'll learn

- Creating downstream Assets

Check warning on line 21 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 21, "column": 29}}}, "severity": "WARNING"}
- How to make an [asset check](guides/asset-checks.md)

## Creating a downstream asset

Now that we have all of our raw data loaded and staged into the duckdb database our next step is to merge it together. The data structure that of a fact table (sales data) with 2 dimensions off of it (sales reps and products). To accomplish that in SQL we will bring in our sales_data table and then left join on sales reps and products on their respective id columns. Additionally, we will keep this view concise and only have relevant columns for analysis.

Check failure on line 26 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Terms] Use 'DuckDB' instead of 'duckdb'. Raw Output: {"message": "[Vale.Terms] Use 'DuckDB' instead of 'duckdb'.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 26, "column": 65}}}, "severity": "ERROR"}

Check failure on line 26 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Vale.Spelling] Did you really mean 'sales_data'? Raw Output: {"message": "[Vale.Spelling] Did you really mean 'sales_data'?", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 26, "column": 275}}}, "severity": "ERROR"}

Check failure on line 26 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.spelling] Is 'sales_data' spelled correctly? Raw Output: {"message": "[Dagster.spelling] Is 'sales_data' spelled correctly?", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 26, "column": 275}}}, "severity": "ERROR"}

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="89" lineEnd="132"/>

As you can see here this asset looks a lot like our previous ones with a few small changes. We put this asset into a different group. To make this asset dependant on the raw tables we add the asset keys the `deps` parameter in the asset definition.

Check failure on line 30 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.british] Use the US spelling 'dependent' instead of the British 'dependant'. Raw Output: {"message": "[Dagster.british] Use the US spelling 'dependent' instead of the British 'dependant'.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 30, "column": 154}}}, "severity": "ERROR"}

Check warning on line 30 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 30, "column": 249}}}, "severity": "WARNING"}

## Asset checks

Data Quality is critical in analytics. Just like in a factory producing cars, manufacturers inspect parts after they complete steps to identify defects and processes that may be creating more than acceptable. In this case we want to create a test to identify if there are any rows that have a product or sales rep that are not in the table.

Check warning on line 34 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 34, "column": 341}}}, "severity": "WARNING"}

<CodeExample filePath="guides/tutorials/etl_tutorial/etl_tutorial/definitions.py" language="python" lineStart="134" lineEnd="149"/>



## Materialize These things

Go back into the UI, refresh definitions and materialize this asset

[Screenshot of the asset details page and asset check]

Check warning on line 44 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Terms.dagster-ui] Use 'Asset details' instead of 'asset details' when referring to a Dagster UI component or page. Raw Output: {"message": "[Terms.dagster-ui] Use 'Asset details' instead of 'asset details' when referring to a Dagster UI component or page.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 44, "column": 20}}}, "severity": "WARNING"}

## What you've learned

- Creating downstream assets
- Software defined asset checks.

Check warning on line 49 in docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md

View workflow job for this annotation

GitHub Actions / runner / vale

[vale] reported by reviewdog 🐶 [Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line. Raw Output: {"message": "[Dagster.chars-eol-whitespace] Remove whitespace characters from the end of the line.", "location": {"path": "docs/docs-beta/docs/tutorial/03-asset-dependencies-and-checks.md", "range": {"start": {"line": 49, "column": 33}}}, "severity": "WARNING"}


## Next steps

- Continue this tutorial with your [Partitions](/tutorial/02-your-first-asset)
10 changes: 10 additions & 0 deletions docs/docs-beta/docs/tutorial/04-partitions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: Partitions
description: Partitioning Assets by datetime and categories
last_update:
date: 2024-10-16
author: Alex Noonan
---



6 changes: 3 additions & 3 deletions docs/docs-beta/sidebars.ts
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ const sidebars: SidebarsConfig = {
label: 'Tutorial',
collapsed: false,
items: [
'tutorial/01-etl-tutorial-introduction',
'tutorial/02-dagster-project-setup',
'tutorial/03-your-first-asset',
'tutorial/etl-tutorial-introduction',
'tutorial/your-first-asset',
'tutorial/asset-dependencies-and-checks',
],
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,7 +158,7 @@ def missing_dimension_check(duckdb: DuckDBResource) -> dg.AssetCheckResult:
compute_kind="duckdb",
group_name="analysis",
deps=[joined_data],
auto_materialize_policy=dg.AutoMaterializePolicy.eager(),
auto_materialize_policy=dg.AutoMaterializePolicy.eager(), # need to adjust to declarative automation
)
def monthly_sales_performance(
context: dg.AssetExecutionContext, duckdb: DuckDBResource
Expand Down

0 comments on commit 8b6d1f6

Please sign in to comment.