You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The lifecycle of fixing data quality issues can be very long. The best way of reducing the turn-around is by making detection as fast as possible. The earliest a data quality issue can be caught is before writing the data frame as a table. This would also help stop the pipeline in case there's downstream effects to the data being written.
This is a common pattern in data engineering. Sometimes called Stage (write to tmp table) Check (execute data quality checks) Exchange (write to production table), iceberg refers to it as write-audit-publish. The idea is the same, before marking the data as production execute some checks.
Chronon has tableUtils module that takes care of writing the data and even collecting some stats in order to make this writes more efficient. The idea would be to define the schema for verifications we may want to do on the data before writing to minimize the time to detect an excessive null rate (commonly associated to bad timestamps or missing input data), missing data (easily detected by a drop in rows), bot activity (new heavy hitters), bad timestamps that reflect past activity, or new values on categorical data for example.
Requirements
[ ] Schema for data quality check definitions
[ ] Expand or migrate DataFrame Stats to take this new responsibility pre-writing data.
[ ] Extra checks may take performance implications, but should be as requested by the configuration to make sure the performance to data reliability balance is acceptable.
Verification
New behavior can be unit tested.
Approach
TBD
User API (when required)
TBD
The text was updated successfully, but these errors were encountered:
Problem Statement
The lifecycle of fixing data quality issues can be very long. The best way of reducing the turn-around is by making detection as fast as possible. The earliest a data quality issue can be caught is before writing the data frame as a table. This would also help stop the pipeline in case there's downstream effects to the data being written.
This is a common pattern in data engineering. Sometimes called Stage (write to tmp table) Check (execute data quality checks) Exchange (write to production table), iceberg refers to it as write-audit-publish. The idea is the same, before marking the data as production execute some checks.
Chronon has tableUtils module that takes care of writing the data and even collecting some stats in order to make this writes more efficient. The idea would be to define the schema for verifications we may want to do on the data before writing to minimize the time to detect an excessive null rate (commonly associated to bad timestamps or missing input data), missing data (easily detected by a drop in rows), bot activity (new heavy hitters), bad timestamps that reflect past activity, or new values on categorical data for example.
Requirements
[ ] Schema for data quality check definitions
[ ] Expand or migrate DataFrame Stats to take this new responsibility pre-writing data.
[ ] Extra checks may take performance implications, but should be as requested by the configuration to make sure the performance to data reliability balance is acceptable.
Verification
Approach
User API (when required)
The text was updated successfully, but these errors were encountered: