Validate that task ID data type is consistent across rounds #28

annakrystalli · 2024-08-02T07:51:16Z

For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round.
Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. However:

there are task IDs that accept more than one data type
Custom task IDs which are beyond our control

have the potential to vary between modeling tasks/rounds and change over time and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).

Dynamic check for more than one data type in task ID columns

Develop a dynamic config level validation check that:

Validates that task ID values across all rounds and modeling tasks share a single data type.
If not, determine the simplest data type that can encode all values.
If later rounds introduce a change in data type issue a warning that hub integrity might be affected by such a change

annakrystalli added the validation label Aug 2, 2024

annakrystalli added this to hubverse Development overview Aug 2, 2024

github-project-automation bot moved this to Todo in hubverse Development overview Aug 2, 2024

annakrystalli added this to the robust-hub-schema milestone Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate that task ID data type is consistent across rounds #28

Validate that task ID data type is consistent across rounds #28

annakrystalli commented Aug 2, 2024

Validate that task ID data type is consistent across rounds #28

Validate that task ID data type is consistent across rounds #28

Comments

annakrystalli commented Aug 2, 2024

Dynamic check for more than one data type in task ID columns