[WIP][Discuss][CHIP] Data Quality Enforcement Pre-writing. #695

cristianfr · 2024-02-27T00:11:16Z

Problem Statement

The lifecycle of fixing data quality issues can be very long. The best way of reducing the turn-around is by making detection as fast as possible. The earliest a data quality issue can be caught is before writing the data frame as a table. This would also help stop the pipeline in case there's downstream effects to the data being written.

This is a common pattern in data engineering. Sometimes called Stage (write to tmp table) Check (execute data quality checks) Exchange (write to production table), iceberg refers to it as write-audit-publish. The idea is the same, before marking the data as production execute some checks.

Chronon has tableUtils module that takes care of writing the data and even collecting some stats in order to make this writes more efficient. The idea would be to define the schema for verifications we may want to do on the data before writing to minimize the time to detect an excessive null rate (commonly associated to bad timestamps or missing input data), missing data (easily detected by a drop in rows), bot activity (new heavy hitters), bad timestamps that reflect past activity, or new values on categorical data for example.

Requirements

[ ] Schema for data quality check definitions
[ ] Expand or migrate DataFrame Stats to take this new responsibility pre-writing data.
[ ] Extra checks may take performance implications, but should be as requested by the configuration to make sure the performance to data reliability balance is acceptable.

Verification

New behavior can be unit tested.

Approach

TBD

User API (when required)

TBD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Discuss][CHIP] Data Quality Enforcement Pre-writing. #695

[WIP][Discuss][CHIP] Data Quality Enforcement Pre-writing. #695

cristianfr commented Feb 27, 2024 •

edited

Loading

[WIP][Discuss][CHIP] Data Quality Enforcement Pre-writing. #695

[WIP][Discuss][CHIP] Data Quality Enforcement Pre-writing. #695

Comments

cristianfr commented Feb 27, 2024 • edited Loading

Problem Statement

Requirements

Verification

Approach

User API (when required)

cristianfr commented Feb 27, 2024 •

edited

Loading