How to try out Software-Defined Assets #5024

sryza · 2021-10-05T14:50:55Z

sryza
Oct 5, 2021

Edit: the examples in this discussion are now out-of-date. Please refer to the docs on software-defined assets instead: https://docs.dagster.io/concepts/assets/software-defined-assets

Dagster 0.12.12 introduced experimental "Software-defined asset" APIs: @asset and build_assets_job. These APIs sit on top of the new graph/job/op APIs and enable a novel way of constructing Dagster jobs that puts assets at the forefront.

As a reminder, to Dagster, an "asset" is a data product: an object produced by a data pipeline, e.g. a table, ML model, or report.

Conceptually, software-defined assets invert the typical relationship between assets and computation. Instead of defining a graph of ops and recording which assets those ops end up materializing, you define a set of assets, each of which knows how to compute its contents from upstream assets.

Taking a software-defined asset approach has a few main benefits:

Write less code - because each asset knows about the assets it depends on, you don't need to use @graph / @pipeline to wire up dependencies between your ops.
Track cross-job dependencies via asset lineage - Dagit allows you to find the parents and children of any asset, even if they live in different jobs. This is useful for finding the sources of problems and for understanding the consequences of changing or removing an asset.
Know when you need to take action on an asset - In a unified view, Dagster compares the assets you've defined in code to the assets you've materialized in storage. You can catch that you've deployed code for generating a new table, but that you haven't yet materialized it. Or that you've deployed code that adds a column to a table, but that your stored table is still missing that column. Or that you've removed an asset definition, but the table still exists in storage.

Defining an asset

A software-defined asset combines:

An asset key, e.g. the name of a table.
A function, which can be run to compute the contents of the asset.
A set of upstream assets that are provided as inputs to the function when computing the asset.

Here's an example of a pair of assets defined using the @asset decorator:

from dagster.core.asset_defs import asset
from pandas import DataFrame

@asset 
def raw_events() -> DataFrame:
    return DataFrame([["login", 123], [None, 234]], columns=["type", "user_id"])

@asset 
def events(raw_events: DataFrame) -> DataFrame:  # arg name defines dep on raw_events
    return raw_events.dropna()

Zooming in on the “events” asset...

The asset key is "events" - taken from the name of the decorated function.
The decorated function is used to compute the contents of the asset.
The upstream assets are inferred from the names of the arguments of the decorated function (similar to how pytest fixtures work). So this asset depends on a single upstream asset, with the asset key "raw_events". If you find this dependency inference too extravagant, you can also express dependencies explicitly with arguments to the decorator.

The asset APIs work most elegantly when you're able to separate IO from compute using IOManagers. The IOManager handles reading and writing the inputs and outputs to persistent storage, while the body of the asset's function handles the logical data transformation.

Building a job from a set of assets

You can build a Dagster job that materializes a set of assets. The generated job can be used anywhere you'd use a regular Dagster job. You can invoke execute_in_process, include it inside a Dagster repository, etc.

from dagster.core.asset_defs import build_assets_job

events_job = build_assets_job("events_job", assets=[raw_events, events])

Viewing assets in Dagit

To turn on the experimental asset UI, click the gear icon in the top right of Dagit, and switch on "Experimental Asset APIs":

Then, when you navigate to a job that was built from a set of assets, you'll see a page that looks like this:

This is different from the standard Job / Pipeline page in a few ways:

Each node in the graph corresponds to an asset that's part of the job. Nodes include information about the materializations of the asset - e.g. the last time it was materialized and metadata, like # of rows, that was recorded with that materialization.
It includes dependencies that span jobs. Clicking on an asset that's generated in a different job will take you to the page for that job.
You can right-click any asset to launch a run that refreshes it.

Assets and dbt

Software-defined assets support a dbt-native approach to orchestration. A dbt model is essentially a software-defined asset. It has an asset key (the name of the dbt model), an op (the SQL select statement that computes the model), and upstream assets (the refs and sources inside the select statement).

You can load all the models in a dbt project into assets:

from dagster.core.asset_defs import build_assets_job
from dagster_dbt.asset_defs import load_assets_from_dbt_manifest

assets = load_assets_from_dbt_manifest(
    json.load(open(os.path.join(DBT_PROJECT_DIR, "target", "manifest.json"))),
)

dbt_project_job = build_assets_job("dbt_project_job", assets=assets)`

You can then visualize your dbt model graph in Dagit, execute models individually, and track lineage between dbt models and non-dbt assets, or between dbt models in different dbt projects. One of the things this is useful for is determining the consequences of changing or removing a dbt model.

The Dagit screenshot above shows a trio of assets loaded from a dbt project. Dagster automatically loads in column documentation from dbt's schema.yml, as well as the SQL for the model.

Future Directions

What's laid out above is the initial foundation of software-defined assets in Dagster. Here's what we foresee building on top of it in the future:

A visual utility, similar in spirit to terraform plan , that allows you to view a diff between your current deployed data and your assets defined in code.
Policies for triggering runs that enable automatically updating downstream assets when an asset changes.
Jobs that contain a mix of steps correspond to software-defined assets and steps that do not correspond to software-defined assets.
Partitioned assets - instead of just modeling that "asset X depends on asset Y", modeling that "partition A of asset X depends on partition B of asset Y".

abkfenris · 2021-10-05T16:51:07Z

abkfenris
Oct 5, 2021

Another twist to think about as a future direction would be when a partition represents an aggregation of other assets, "yearly partition A of asset X depends on monthly partitions 1-12 of asset Y".

1 reply

sryza Oct 6, 2021
Author

@abkfenris, definitely - for partitioned assets, we'll need to be able to model those cases where a partition of one asset depends on multiple partitions of an upstream asset (and the reverse, where multiple partitions of an asset depend on the same partition of an upstream asset).

mrdavidlaing · 2021-10-10T11:55:16Z

mrdavidlaing
Oct 10, 2021

Future Directions
...
Jobs that contain a mix of steps correspond to software-defined assets and steps that do not correspond to software-defined assets.

@sryza, could you share a bit more on how the API for this might work?

1 reply

sryza Nov 10, 2021
Author

@mrdavidlaing sorry for the delayed response - this is still very up in the air, but below is one idea we've discussed. The way it works is:

Uses a set of assets to generate a graph
Inserts additional ops into the graph, which can depend on, or be depended on, by the asset ops

assets = [...]

@op(ins={"start_after": In(Nothing)})
def email_new_users():
    ...

@assets_job(
    resource_defs=...
    assets=assets
)
def update_and_email(assets):
    assets("something").start_after(git_clone_op())
    email_people(start_after=assets("something_else")))

Are there particular cases you have in mind for how you'd like to interleave assets and non-asset ops?

esselius · 2021-10-26T09:52:11Z

esselius
Oct 26, 2021

Easily defining a partitioned asset and then depending on the combined partition output somehow would be amazing!

The docs in this area needs to be better, now I've had to read the HN example over and over and do a lot of guessing to make something work :(

1 reply

sryza Nov 10, 2021
Author

The docs now include a guide that might help: https://docs.dagster.io/guides/dagster/software-defined-assets

geoHeil · 2022-02-09T09:05:01Z

geoHeil
Feb 9, 2022

With the upcoming 0.14 edition of dagster will partitioned (software-defined) assets be part of the API?

5 replies

sryza Feb 9, 2022
Author

We discussed this on Slack already, but including the answer here too for posterity:

Yup! In fact, partitioned assets are available in preview in our most recent release - 0.13.18.

We are working on documentation, but, for now, the hacker news assets demo includes some examples of partitioned assets, e.g. https://github.com/dagster-io/dagster/blob/master/examples/hacker_news_assets/hacker_news_assets/assets/items.py. Here's a job that includes some: https://github.com/dagster-io/dagster/blob/master/examples/hacker_news_assets/hacker_news_assets/jobs/hacker_news_api_download.py.

Currently, all the assets within the same job must have the same partitioning (or no partitioning).

esselius Feb 15, 2022

Currently, all the assets within the same job must have the same partitioning (or no partitioning).

Do you think this will ever change?

sryza Feb 17, 2022
Author

@esselius it's definitely possible it would change - we need to think through the semantics a little more.

To be clear, a single job can have assets with partitions and assets without partitions. What currently doesn't work is having a job that has some assets that are daily-partitioned and other assets that are hourly-partitioned. Is that what you're looking for? At what cadence would you want to run that job?

esselius Feb 17, 2022

I was just wondering, it makes intuitive sense that the job time scope (or partition) correlates with the asset partition inside

If an assets partitioning can be more granular than the job time scoping, would that mean that two ops for the same asset is launched and would you specify if they can run in parallel or serially?

If an assets partitioning can be less granular than the job time scoping, does that in practice mean that that assets op is only created once every X job runs?

In both cases, would job time scope sizes only be divisible by whatever the partitioning is of the included assets?

I'm interested because these things determine how I create jobs and what the scope of a job should be, like do I create a job of all the daily asset partitions? all daily assets in a specific domain? or do I create a job of all the assets in a the domain regardless of asset partitioning, which might be simpler to create but harder to understand if job runs has seemingly random ops

geoHeil Feb 17, 2022

Or what if one job would update an SCD2 representation perhaps ...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to try out Software-Defined Assets #5024

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to try out Software-Defined Assets #5024

Edit: the examples in this discussion are now out-of-date. Please refer to the docs on software-defined assets instead: https://docs.dagster.io/concepts/assets/software-defined-assets

Defining an asset

Building a job from a set of assets

Viewing assets in Dagit

Assets and dbt

Future Directions

Replies: 4 comments · 8 replies

sryza Oct 6, 2021 Author

sryza Nov 10, 2021 Author

sryza Nov 10, 2021 Author

sryza Feb 9, 2022 Author

sryza Feb 17, 2022 Author

Replies: 4 comments 8 replies

sryza Oct 6, 2021
Author

sryza Nov 10, 2021
Author

sryza Nov 10, 2021
Author

sryza Feb 9, 2022
Author

sryza Feb 17, 2022
Author