Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue an error (instead of a warning) if incoming column data types don't match the hub schema #111

Closed
bsweger opened this issue Aug 22, 2024 · 4 comments
Assignees

Comments

@bsweger
Copy link

bsweger commented Aug 22, 2024

When researching hubverse-org/hubverse-transform#14, I realized that parquet model-output files with a column data type that doesn't match the config-derived schema will result in a validation warning instead of an error (full convo is in the linked issue, see relevant snippet below).

Such a mismatch can result in downstream errors when working with the data (for both local/github and cloud-based hubs). Therefore, we should consider re-classifying the warning to an error.

How do we want/expect modelers to respond to validation warnings vs errors? For FIPS codes, when running the validation process against a parquet file without all 50 states and no string value (like "US"), the validation will fail because the 01-09 FIPS codes get their leading zero dropped and don't validate against the list of valid locations in tasks.json.

But if you then remove all model-output entries for locations 01 - 09, the validation returns the "data types do not match hub schema" warning, but not an error (because two-digit FIPS code do match the list of valid locations).


! 2024-05-12-parquet-test_data.parquet: Column data types do not match hub schema.  `origin_date ` should be "Date " not "character ",

  `location ` should be "character " not "integer ", `output_type_id ` should be "character " not "double "

✖ 2024-05-12-parquet-test_data.parquet: `tbl` contains invalid values/value combinations.  Column `location` contains invalid values "9",

  "8", and "6".

@bsweger bsweger changed the title Issue and error (instead of a warning) if incoming column data types don't match the hub schema Issue an error (instead of a warning) if incoming column data types don't match the hub schema Aug 22, 2024
@annakrystalli
Copy link
Member

While I'm not adverse to this suggestion, I want to make sure any misconceptions are cleared up first.

As stated in the docs both errors and warnings are considered validation failures and will throw an error via check_for_errors(). This is consistent with R convention where the standard is for R CMD CHECK to fail if it returns any errors or warnings. So validation warnings cannot be optionally ignored and files with data types inconsistent with the hub schema would not pass validation. As an aside, there is also now a whole section on the importance of maintaining a stable hub schema.

The error vs warning is admittedly used primarily for internal purposes to the validation process, i.e. an error usually causes an early return of the validation process because downstream tests cannot be performed until the failed error check is addressed. In the case of deviation from the hub schema, because the rest of the validation checks are performed on all character versions of data, validation can continue despite data type differences, hence this failure is handled as a warning and the rest of validation continues. That doesn't mean that the warning can be ignored. It will still trigger an error in check_for_errors().

So while I could change this to an error, I don't really want it to cause an early return when it's possible to run further checks. So it would be going against the paradigm all other checks are following. Having said that, I could make that clearer in the docs, i.e. that warning vs error has more to do with the validation process.

Having said all that, while check_for_errors() does focus in on warnings & errors, directing users to what needs fixing, when a hub_validations object is just printed, the ! symbol definitely does not draw the same attention as an x and can be difficult to decipher from the info (i) symbol. There also no simple indication of whether the file is valid or not.

So I'm definitely up for thinking through how to make warnings more visible in general e.g. here are some other options in cli:

cli::cli_text("{cli::symbol$cross} error | {cli::symbol$circle_cross} warning | {cli::symbol$checkbox_on} warning")
#> ✖ error | ⓧ warning | ☒ warning

and perhaps also issuing an overall summary when printing a hub_validations object, e.g:

cli::cli_h2(cli::format_inline("{cli::symbol$tick} File valid"))
#> 
#> ── ✔ File valid ──
#> 
cli::cli_h2(cli::format_inline("{cli::symbol$cross} File invalid"))
#> 
#> ── ✖ File invalid ──
#> 

@bsweger
Copy link
Author

bsweger commented Aug 28, 2024

@annakrystalli Thanks for the detailed explanation of what's happening under the hood!

Taking a step back from the underlying technical issues, I keep coming back to this statement:

This is consistent with R convention where the standard is for R CMD CHECK to fail if it returns any errors or warnings. So validation warnings cannot be optionally ignored and files with data types inconsistent with the hub schema would not pass validation.

I'm wary of applying R coding conventions as guidance for overall hub usability. Given that hub admins have the option merge a model-output submission even if the overall validations fail, a reasonable user could assume that merging warnings won't break anything in the same way that merging an error would.

Model output files that don't conform to their hub's schema is an error that can impact downstream operations (especially for hubs that accept parquet submissions), so my .02 is that we should treat it as such.

All that said, given that this is an edge case with a low probability of occurring in general hub usage, addressing it might not be a high priority for us. [To reproduce the "invalid schema" warning with a hub that accepts FIPS codes for locations, you'd have to submit a model output file that contains only 2 digit FIPS codes and no character values like US].

@zkamvar
Copy link
Member

zkamvar commented Aug 28, 2024

Opinion

Giving my opinion though it was not requested, when I run a program, the messages I expect are:

  • things I can ignore
  • things that I need to address

The blogdown package does a really good job at this: https://alison.netlify.app/ares-kind-tools/#120

R's warnings are a bit weird because they exist in a grey area between "things I need to address" and "ignore". They always appear after your run is finished and often are hidden if you have more than 50.

Suggestion

I'm wary of applying R coding conventions as guidance for overall hub usability. Given that hub admins have the option merge a model-output submission even if the overall validations fail, a reasonable user could assume that merging warnings won't break anything in the same way that merging an error would.

For what it's worth, I believe calling it warning/error in this case is mostly semantic. The output of check_for_errors() is an error if there are either check_failure or check_error elements.

I have a suggestion: since they both result in an error, this might be solved by converting the following to rlang::error_cnd:

res <- rlang::warning_cnd(

(I tested this and the only thing that changes are minor attributes of snapshot tests).

This way, they are both presented as different types of errors with the difference being if the error prevented downstream checks from being run.

@annakrystalli
Copy link
Member

Thanks for the input both!

I feel more comfortable changing all check_failures to error rather than just the one. So I've gone ahead and implemented that in #114

I also feel that for users (admins and teams), the print method is the most important aspect of communicating the type and implications of different errors so I've also modified the print method and added more info to the vignette too.

One thing this removes though is the removal of a warning class object as an output of any checks. While hubValidations checks don't make use of that kind of class currently, perhaps we or user custom functions might do in future.

@annakrystalli annakrystalli self-assigned this Sep 2, 2024
@annakrystalli annakrystalli moved this from Todo to Ready for Review in hubverse Development overview Sep 2, 2024
@github-project-automation github-project-automation bot moved this from Ready for Review to Done in hubverse Development overview Sep 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

No branches or pull requests

3 participants