RFC: Partitioned Assets #2950

sryza · 2020-09-24T00:23:53Z

sryza
Sep 24, 2020

Motivation

Conceptually, when we say that “a pipeline is partitioned”, we really mean that it produces or consumes partitioned assets.

More concretely, anyone responsible for maintaining an asset would find it useful to be able to answer these questions about it:

What partitions of my asset exist and are up-to-date?
- People and code depending on an asset care about this as well.
What run produced each partition of my asset?

Proposed Changes

Add a partition attribute to AssetMaterialization:

yield AssetMaterialization(
    asset_key=["prod", "table_1"],
    partition_name="2020=08-05",
)

Should we allow a single AssetMaterialization to include multiple partitions?

Dagit

Some cool views we could build:

Asset-Partition Matrix

Plot a value over time like we do with current run-based longitudinal views

alangenfeld · 2020-09-24T15:36:14Z

alangenfeld
Sep 24, 2020
Maintainer

This seems like a slam dunk to me, very small complexity cost for substantial utility.

Should we allow a single AssetMaterialization to include multiple partitions?

When I think of this, I think of broader support for multi-dimension partitioning. I believe supporting that here makes sense as part of broader support of it. For example daily_active (2020-08-07, US) daily_active (2020-08-07, GB).

Having a single asset belong to multiple partitions makes less sense to me. If this was what you meant, do you have a motivating example?

5 replies

sryza Sep 24, 2020
Author

Having a single asset belong to multiple partitions makes less sense to me. If this was what you meant, do you have a motivating example?

I'm imagining situations where a single step execution fills in multiple partitions of the same asset. E.g. if I want to do a backfill over a big Spark table, I'd probably rather have Spark handle the parallelism than Dagster, so I'd like to kick off a single step with a big Spark job that does all the partitions.

alangenfeld Sep 24, 2020
Maintainer

single step execution fills in multiple partitions of the same asset

So in my head the way this would work would be to yield multiple AssetMaterialization each with a different partition name.

To try to refine what I am saying - I think having having multiple partition names from the same partition set on a single AssetMaterialization feels off to me. Having multiple partition names, each from a separate partition set does make sense to me.

sryza Sep 24, 2020
Author

Yeah, agreed. Each partition could have different metadata.

schrockn Sep 24, 2020
Maintainer

Not totally related, but it would be good to have compelling multi-dimensional partition example checked in the repo so that we can ground these discussions a bit.

kinghuang Oct 3, 2020

I support yielding multiple AssetMaterializations (or AssetPartitionMaterialization as described below), each with a different partition. We have Dask-based pipelines that process and output multiple partitions right now, which we haven't been able to represent as Dagster partitions.

schrockn · 2020-09-24T15:41:36Z

schrockn
Sep 24, 2020
Maintainer

It looks like the ship has sailed on this, but to me key makes much more sense than name to uniquely identify a partition.

3 replies

sryza Sep 24, 2020
Author

No code has been merged, if that's what you mean.

I went with name because that's what's currently in Partition. Could also change it there?

sryza Sep 24, 2020
Author

Oh nevermind, that's what you're referring to already

schrockn Sep 24, 2020
Maintainer

yeah exactly

alangenfeld · 2020-09-24T16:18:47Z

alangenfeld
Sep 24, 2020
Maintainer

Just to talk through it a bit more, if you have

yield AssetMaterialization(
    asset_key=["prod", "table_1"],
    partition_name="2020-08-05",
)

At some level, that is not really any different than

yield AssetMaterialization(
    asset_key=["prod", "table_1", "2020-08-05"],
)

except for losing the information that prod and table_1 identify the conceptual "asset" and 2020-08-05 identifies a partition of it. The physical storage or address of the asset likely feels closer to ["prod", "table_1", "2020-08-05"] .

What if we pushed this information down in to AssetKey itself in some way? Something like

yield AssetMaterialization(
    asset_key=AssetKey(["prod", "table_1"]).partition("2020-08-05"),
)

2 replies

sryza Sep 24, 2020
Author

What if we pushed this information down in to AssetKey itself in some way?

@alangenfeld and I chatted on this a bit. It makes sense to me to separate the encapsulate the set of attributes that make up the handle for the materialized object. My concern was that there should be one AssetKey per conceptual asset.

We discussed adding an AssetPartitionKey, which would include both an AssetKey and a partition name. I've been playing around with this a bit.

If I'm understanding correctly, it requires us to do one of the following:

Have AssetMaterializations contain a union of (AssetKey, AssetPartitionKey). This kind of union adds some awkwardness to dealing with AssetMaterializations.
Have a separate AssetPartitionMaterialization, which we'd then pipe through alongside AssetMaterialization.

At the graphql layer, if we don't add AssetPartitionKey, we'd have:

def get_asset_events(graphene_info, asset_key: AssetKey, partition_name: Optional[String]):
def get_asset_run_ids(graphene_info, asset_key: AssetKey, partition_name: Optional[String]):

With AssetPartitionKey, we'd have:

def get_asset_events(graphene_info, asset_key: AssetKey):
def get_asset_run_ids(graphene_info, asset_key: AssetKey):
def get_asset_partition_events(graphene_info, asset_key: AssetPartitionKey):
def get_asset_partition_run_ids(graphene_info, asset_key: AssetPartitionKey):

Alex - do you have thoughts on what's preferable? My impulse is that the non-AssetPartitionKey world is a little cleaner, but this is also my first time dealing with this part of the system.

alangenfeld Sep 25, 2020
Maintainer

This kind of union adds some awkwardness to dealing with AssetMaterializations.

It might mitigate some of the awkwardness if you have a common interface (IAssetKey) that both types implement. Similarly on the GraphQL input side you could do something like

input AssetKeyInput {
  path: [String!]!
  partition_name: String # optional
}

do you have thoughts on what's preferable?

Unclear that there is an actual substantive difference either way - so I think its implementors call.

sryza · 2020-09-29T16:01:07Z

sryza
Sep 29, 2020
Author

Tests now passing: https://dagster.phacility.com/D4526

0 replies

kinghuang · 2020-10-03T18:57:29Z

kinghuang
Oct 3, 2020

With this RFC, how will PartitionSetDefinitions be adapted?

Right now, I effectively treat PartitionSetDefinition as a partitioned version of PresetDefinition, since the two don't interact with each other (relates to #2704). However, when I was first learning about Dagster, what I envisioned PartitionSetDefinition doing was more along the lines of what this RFC proposes to do to assets: define the partitions of a specific asset, not the partitions of a pipeline.

If I have a solid that takes a partitioned dataframe facts_df and joins it with a non-partitioned dataframe dimensions_df, what I really want to do is be able to select a partition of facts_df to operate on.

This is especially true when there are multiple partitioned assets in a pipeline. Say there are two assets A and B with 4 and 6 partitions, respectively. Ideally, I'd like to provide PartitionSetDefinitions for each asset independently, and have a way to make a selection for each asset. Under the current system, I have to make a PartitionSetDefinition that is a product of the two (4A × 6B), which gets long and unwieldy.

1 reply

sryza Oct 13, 2020
Author

With this RFC, how will PartitionSetDefinitions be adapted?

We don't have plans to adapt PartitionSetDefinitions yet, but it's something I've been thinking about. One thing that comes up for me when thinking through this is that particular asset partitions tend to have dependencies on particular asset partitions in the assets they depend on.

Some cases to think through:

The case that Dagster supports best now: a set of assets partitioned in the same way where each partition has a corresponding partition in the upstream assets. E.g. a chain of tables all partitioned hourly.
A daily rollup table, whose daily partitions depend on all the hourly partitions for that day in an upstream table.
A rolling window table, where each partition depends on the prior 24 partitions in the parent table.

For a deep pipeline, if each asset has its own partitioning, it might be onerous for users to need to specify every partition for every run. In all these cases, if you know the particular partition you're trying to produce in some set of tables, you can infer the partitions you need to deal with in the upstream tables.

@kinghuang does this resonate with the usages you're thinking about?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Partitioned Assets #2950

{{title}}

Replies: 5 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

RFC: Partitioned Assets #2950

sryza Sep 24, 2020

Motivation

Proposed Changes

Dagit

Asset-Partition Matrix

Plot a value over time like we do with current run-based longitudinal views

Replies: 5 comments · 11 replies

alangenfeld Sep 24, 2020 Maintainer

sryza Sep 24, 2020 Author

alangenfeld Sep 24, 2020 Maintainer

sryza Sep 24, 2020 Author

schrockn Sep 24, 2020 Maintainer

kinghuang Oct 3, 2020

schrockn Sep 24, 2020 Maintainer

sryza Sep 24, 2020 Author

sryza Sep 24, 2020 Author

schrockn Sep 24, 2020 Maintainer

alangenfeld Sep 24, 2020 Maintainer

sryza Sep 24, 2020 Author

alangenfeld Sep 25, 2020 Maintainer

sryza Sep 29, 2020 Author

kinghuang Oct 3, 2020

sryza Oct 13, 2020 Author

sryza
Sep 24, 2020

Replies: 5 comments 11 replies

alangenfeld
Sep 24, 2020
Maintainer

sryza Sep 24, 2020
Author

alangenfeld Sep 24, 2020
Maintainer

sryza Sep 24, 2020
Author

schrockn Sep 24, 2020
Maintainer

schrockn
Sep 24, 2020
Maintainer

sryza Sep 24, 2020
Author

sryza Sep 24, 2020
Author

schrockn Sep 24, 2020
Maintainer

alangenfeld
Sep 24, 2020
Maintainer

sryza Sep 24, 2020
Author

alangenfeld Sep 25, 2020
Maintainer

sryza
Sep 29, 2020
Author

kinghuang
Oct 3, 2020

sryza Oct 13, 2020
Author