Issue an error (instead of a warning) if incoming column data types don't match the hub schema #111

bsweger · 2024-08-22T19:32:20Z

When researching hubverse-org/hubverse-transform#14, I realized that parquet model-output files with a column data type that doesn't match the config-derived schema will result in a validation warning instead of an error (full convo is in the linked issue, see relevant snippet below).

Such a mismatch can result in downstream errors when working with the data (for both local/github and cloud-based hubs). Therefore, we should consider re-classifying the warning to an error.

How do we want/expect modelers to respond to validation warnings vs errors? For FIPS codes, when running the validation process against a parquet file without all 50 states and no string value (like "US"), the validation will fail because the 01-09 FIPS codes get their leading zero dropped and don't validate against the list of valid locations in tasks.json.

But if you then remove all model-output entries for locations 01 - 09, the validation returns the "data types do not match hub schema" warning, but not an error (because two-digit FIPS code do match the list of valid locations).
! 2024-05-12-parquet-test_data.parquet: Column data types do not match hub schema.  `origin_date ` should be "Date " not "character ",

  `location ` should be "character " not "integer ", `output_type_id ` should be "character " not "double "

✖ 2024-05-12-parquet-test_data.parquet: `tbl` contains invalid values/value combinations.  Column `location` contains invalid values "9",

  "8", and "6".

The text was updated successfully, but these errors were encountered:

annakrystalli · 2024-08-26T07:43:03Z

While I'm not adverse to this suggestion, I want to make sure any misconceptions are cleared up first.

As stated in the docs both errors and warnings are considered validation failures and will throw an error via check_for_errors(). This is consistent with R convention where the standard is for R CMD CHECK to fail if it returns any errors or warnings. So validation warnings cannot be optionally ignored and files with data types inconsistent with the hub schema would not pass validation. As an aside, there is also now a whole section on the importance of maintaining a stable hub schema.

The error vs warning is admittedly used primarily for internal purposes to the validation process, i.e. an error usually causes an early return of the validation process because downstream tests cannot be performed until the failed error check is addressed. In the case of deviation from the hub schema, because the rest of the validation checks are performed on all character versions of data, validation can continue despite data type differences, hence this failure is handled as a warning and the rest of validation continues. That doesn't mean that the warning can be ignored. It will still trigger an error in check_for_errors().

So while I could change this to an error, I don't really want it to cause an early return when it's possible to run further checks. So it would be going against the paradigm all other checks are following. Having said that, I could make that clearer in the docs, i.e. that warning vs error has more to do with the validation process.

Having said all that, while check_for_errors() does focus in on warnings & errors, directing users to what needs fixing, when a hub_validations object is just printed, the ! symbol definitely does not draw the same attention as an x and can be difficult to decipher from the info (i) symbol. There also no simple indication of whether the file is valid or not.

So I'm definitely up for thinking through how to make warnings more visible in general e.g. here are some other options in cli:

cli::cli_text("{cli::symbol$cross} error | {cli::symbol$circle_cross} warning | {cli::symbol$checkbox_on} warning")
#> ✖ error | ⓧ warning | ☒ warning

and perhaps also issuing an overall summary when printing a hub_validations object, e.g:

cli::cli_h2(cli::format_inline("{cli::symbol$tick} File valid"))
#> 
#> ── ✔ File valid ──
#> 
cli::cli_h2(cli::format_inline("{cli::symbol$cross} File invalid"))
#> 
#> ── ✖ File invalid ──
#>

bsweger · 2024-08-28T14:22:19Z

@annakrystalli Thanks for the detailed explanation of what's happening under the hood!

Taking a step back from the underlying technical issues, I keep coming back to this statement:

This is consistent with R convention where the standard is for R CMD CHECK to fail if it returns any errors or warnings. So validation warnings cannot be optionally ignored and files with data types inconsistent with the hub schema would not pass validation.

I'm wary of applying R coding conventions as guidance for overall hub usability. Given that hub admins have the option merge a model-output submission even if the overall validations fail, a reasonable user could assume that merging warnings won't break anything in the same way that merging an error would.

Model output files that don't conform to their hub's schema is an error that can impact downstream operations (especially for hubs that accept parquet submissions), so my .02 is that we should treat it as such.

All that said, given that this is an edge case with a low probability of occurring in general hub usage, addressing it might not be a high priority for us. [To reproduce the "invalid schema" warning with a hub that accepts FIPS codes for locations, you'd have to submit a model output file that contains only 2 digit FIPS codes and no character values like US].

zkamvar · 2024-08-28T16:17:02Z

Opinion

Giving my opinion though it was not requested, when I run a program, the messages I expect are:

things I can ignore
things that I need to address

The blogdown package does a really good job at this: https://alison.netlify.app/ares-kind-tools/#120

R's warnings are a bit weird because they exist in a grey area between "things I need to address" and "ignore". They always appear after your run is finished and often are hidden if you have more than 50.

Suggestion

I'm wary of applying R coding conventions as guidance for overall hub usability. Given that hub admins have the option merge a model-output submission even if the overall validations fail, a reasonable user could assume that merging warnings won't break anything in the same way that merging an error would.

For what it's worth, I believe calling it warning/error in this case is mostly semantic. The output of check_for_errors() is an error if there are either check_failure or check_error elements.

I have a suggestion: since they both result in an error, this might be solved by converting the following to rlang::error_cnd:

hubValidations/R/capture_check_cnd.R

Line 85 in 8c8b2f4

res <- rlang::warning_cnd(

(I tested this and the only thing that changes are minor attributes of snapshot tests).

This way, they are both presented as different types of errors with the difference being if the error prevented downstream checks from being run.

… aspect. Related to #111

annakrystalli · 2024-09-02T10:57:04Z

Thanks for the input both!

I feel more comfortable changing all check_failures to error rather than just the one. So I've gone ahead and implemented that in #114

I also feel that for users (admins and teams), the print method is the most important aspect of communicating the type and implications of different errors so I've also modified the print method and added more info to the vignette too.

One thing this removes though is the removal of a warning class object as an output of any checks. While hubValidations checks don't make use of that kind of class currently, perhaps we or user custom functions might do in future.

github-project-automation bot added this to hubverse Development overview Aug 22, 2024

github-project-automation bot moved this to Todo in hubverse Development overview Aug 22, 2024

bsweger changed the title ~~Issue and error (instead of a warning) if incoming column data types don't match the hub schema~~ Issue an error (instead of a warning) if incoming column data types don't match the hub schema Aug 22, 2024

annakrystalli added a commit that referenced this issue Sep 2, 2024

Document elevation of check_failure to error and explain early return…

dfdcc75

… aspect. Related to #111

annakrystalli mentioned this issue Sep 2, 2024

Change check fail class and print method #114

Merged

annakrystalli self-assigned this Sep 2, 2024

annakrystalli moved this from Todo to Ready for Review in hubverse Development overview Sep 2, 2024

annakrystalli closed this as completed in c5295f3 Sep 5, 2024

github-project-automation bot moved this from Ready for Review to Done in hubverse Development overview Sep 5, 2024

zkamvar mentioned this issue Oct 8, 2024

Writing custom check functions vignette #127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue an error (instead of a warning) if incoming column data types don't match the hub schema #111

Issue an error (instead of a warning) if incoming column data types don't match the hub schema #111

bsweger commented Aug 22, 2024

annakrystalli commented Aug 26, 2024

bsweger commented Aug 28, 2024

zkamvar commented Aug 28, 2024

annakrystalli commented Sep 2, 2024

Issue an error (instead of a warning) if incoming column data types don't match the hub schema #111

Issue an error (instead of a warning) if incoming column data types don't match the hub schema #111

Comments

bsweger commented Aug 22, 2024

annakrystalli commented Aug 26, 2024

bsweger commented Aug 28, 2024

zkamvar commented Aug 28, 2024

Opinion

Suggestion

annakrystalli commented Sep 2, 2024