Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate that task ID data type is consistent across rounds #28

Open
annakrystalli opened this issue Aug 2, 2024 · 0 comments
Open

Comments

@annakrystalli
Copy link
Member

For a hub to be successfully accessed as an arrow dataset, column data types should not change from round to round.
Generally many task IDs that are covered by our schema shouldn't change data type in further rounds as that's somewhat fixed by the schema. However:

  1. there are task IDs that accept more than one data type
  2. Custom task IDs which are beyond our control

have the potential to vary between modeling tasks/rounds and change over time and this could indeed cause problems downstream. This is mainly a problem for parquet files (but has a small chance to cause problems in csvs too).

Dynamic check for more than one data type in task ID columns

Develop a dynamic config level validation check that:

  • Validates that task ID values across all rounds and modeling tasks share a single data type.
  • If not, determine the simplest data type that can encode all values.
  • If later rounds introduce a change in data type issue a warning that hub integrity might be affected by such a change
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Todo
Development

No branches or pull requests

1 participant