Replies: 1 comment
-
Related to #22567 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Dagster's definition of partitions as lifted from the docs:
However, this assumes that the asset's partitions are being written and read symmetrically, i.e. how you write the asset correlates to how you read the asset. This is not always the case, taking this example also from the docs:
If you were to implement this using the current dagster approach, you would be reading from the table once per partition, each time applying a date filter to retrieve only the relevant data. However, this may represent a significant inefficiency wrt. how the downstream asset is produced. This would be especially noticed if the granularity were smaller (eg. hourly partitions in dagster) and/or the dagster partitions were not aligned with database indexes, or if the query were to place undue load on the db server
A much faster, safer, and less resource-intensive way of materializing the parquet files would be to process the table's transaction log, or change data feed, using an offset cursor to incrementally load orders from the database in batches.
Dagster currently has no primitives to define an asymmetrical read/write, where you write to multiple partitions in a single run, but assets further downstream (who typically would have partition alignment with the parquet files asset) would read on a per-partition basis.
Currently I'm experimenting with having two logical assets to represent the different modes (a non-partitioned incremental load asset, and a partitioned read asset -- both representing the same physically partitioned asset).
But I couldn't help but wonder, would it make sense to implement a feature where an incremental materialization may write to zero, one, or more partitions in a single run?
Beta Was this translation helpful? Give feedback.
All reactions