Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of numeric output type IDs. #60

Merged
merged 20 commits into from
Nov 27, 2023
Merged

Conversation

annakrystalli
Copy link
Member

@annakrystalli annakrystalli commented Nov 1, 2023

Resolves #54 & #58 (+ related discussion in #59)

Rationale behind fix:

I think I came up with a solution to the issues we were having with trailing zeros in tasks.json output type IDs as well the high precision floating point issues identified by @elray1 in #58 . When the output type id column is character, it involves coercing any type IDs associated with a numeric output type to numeric and then back to character. This takes care of trailing zeros or high precision values (much like what happens in expand grid) so everything matches nicely, follows the rationale that for numbers trailing zeros are meaningless and that they are generally compared using tolerances (rather than exact matches). The conversion to numeric creates a representation that gets rid of any high precision tolerances and the conversion back to character means values can be compared for exact matches. I wouldn't call it particularly elegant but it seems to be working.

However though is required on any downstream implications. Currently, the 0.1000000000000000055511 value @elray1 introduced to demonstrate behaviour would match a 0.1 quantile value and pass validation. We would need to check this doesn't affect ensembling downstream. A similar conversion might be required there too to ensure matching.

Also fixes the bug reported by European flu-forecast-hub and improves execution error message capturing.

tests/testthat/testdata/hub-chr/hub-config/admin.json Outdated Show resolved Hide resolved
tests/testthat/testdata/hub-it/hub-config/admin.json Outdated Show resolved Hide resolved
tests/testthat/testdata/hub-num/hub-config/admin.json Outdated Show resolved Hide resolved
Comment on lines +141 to +143
tbl$output_type_id[type_coerce][valid] <- as.character(
num_output_type_id[valid]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the use of as.character here would open up the problems I outlined in discussion #59. I added an example of a unit test that passes, but that I think should not pass, here. In that example, the output_type_id in the submission file is 0.09999999999999997779554. On my system, this is not equal to the floating point representation of 0.1, but as.character casts it to "0.1". Here's an illustration:

> a <- 0.0999999999999999777955
> b <- 0.1000000000000000055511
> c <- 0.1
> a == b
[1] FALSE
> a == c
[1] FALSE
> b == c
[1] TRUE
> as.character(a) == as.character(b)
[1] TRUE
> as.character(a) == as.character(c)
[1] TRUE
> as.character(b) == as.character(c)
[1] TRUE
> as.character(a)
[1] "0.1"

As a result of this, if we use values coerced via as.character to do validations, we may be generous in passing submissions that as.character rounds to the value specified in the tasks config file. This would then cause issues later on in places like ensembling code or filtering code, where grouping or filtering on the output_type_id may give unexpected results.

I am wondering if a way forward might be to do the data type coercion in a different order, or just deal with character representations from the start for purposes of validating output_type_ids and task id variables? Something like:

  1. For purposes of validation, read in all fields other than the value column as character strings. Additionally, read the contents of tasks.json as strings if that's possible (may not be?)?
  2. Check exact equality of values for output_type_ids and task id variables with the required/optional values specified in tasks.json, as strings. For example, if the tasks.json file says that a valid quantile level is 0.1, we expect to see the exact characters "0.1" in the submission file. This would change our current passing validation results for submissions with values like 0.1000000000000000055511 to failures.

Something like this would entirely sidestep the issues of what gets rounded to 0.1 or what the floating point representation of 0.1 is.

Another option might be to use arrow's cast function for type conversion instead of as.character, since it seems to be more careful in what it rounds. But my question about that is whether the outputs here would be stable across platforms. For example, maybe on one computer the floating point representation of 0.1 is 0.1000000000000000055511 and on another computer (e.g., one that's not using 64 bit representations of numbers) it's something slightly different. For that reason, it seems better to avoid relying on equality of floating point representations if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @elray1 . Great points and I agree that although some output IDs are numeric, given we want to be matching output ID values effectively as categories / use them for grouping (both in validation and downstream) it feels indeed more appropriate to work with data as character for this purposes and match exact values, not check for equality using numeric tolerances or rounding behaviours.

So I will look into the most efficient way of including all character versions of data in our model output validation checks.

I will merge your currently failing test to include in the test suite. I also feel the original test you included (of a 0.1000000000000000055511 value included instead of a 0.1 value should now also fail.

Persistent issue with trailing zeros in tasks.json

The one remaining problem however that as.character() was solving for us is removing trailing zeros. When reading the tasks.json into R from JSON and then passing values to expand.grid to create the grid of all possible valid values, any trailing zeros are removed from numeric output type IDs. Although I spent quite some time trying to figure out a way to read in values from JSON in R as character (retaining any trailing zeros) I could not find a way so I don't think the suggestion:

Additionally, read the contents of tasks.json as strings if that's possible (may not be?)?
is currently possible.

Using as.character() on model output data as proposed in this PR was addressing this issue:

as.character(c(0.300, 0.350))
#> [1] "0.3"  "0.35"

so that including 0.300 as an output type ID in a tasks.json would match a 0.300 value in the data (which previously failed because the value from config would be converted to 0.3 while the value in model output data would be read in as 0.300. Having said that, the approach in this PR is still a bit of a hack, as, to match values, trailing zeros are removed from both which is not quite right. I think we will have to tackle this with strong documentation and perhaps some sort of validation check (although I think a validation check will be difficult for the same reasons. Detecting trailing zeros in numerics will likely be hampered by the requirement to retain trailing zeros...our original problem!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK Just to clarify what the new approach I'm putting together is doing:

  • The read_model_out_file fn now has argument coerce_types with values hub, chr and none instead of logical use_hub_schema.
    • hub reads in (csvs) or coerces (parquet & arrow which have schema defined in file, using arrow cast) to the hub schema.
    • chr reads in (csvs) or coerces (parquet & arrow) all columns as character.
    • none: no coercion. Read using default arrow::read_* fun settings.

This now causes both 0.1000000000000000055511 and 0.09999999999999997779554 to fail validation. It also means that values with trailing zeros (e.g. 0.300) will always fail, regardless of whether 0.300 is specified in tasks.json (leading to the issue discussed in #54). While I do feel this is actually correct behaviour when validating model output data, it still feels unsatisfactory to me form the aspect that values with trailing zeros can currently be specified in config files but that will not be respected when reading in and will cause validation errors. I think this will remain problematic until we find a way to detect and throw a warning about trailing zeros during config file validation.

Copy link
Member Author

@annakrystalli annakrystalli Nov 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirm read and column type checking behaviour for different file types

Also can we just confirm that you are happy with the following column type checks (same as before after addressing issues discussed in #54) but needs some recoding due to changed default behaviour in read_model_out_file).

  • When reading in csv files (which do not have column type information embedded), whether a file can be parsed using the hub schema on read is checked via check_file_read fn. If that check passes, the file is then read in using thehub schema. If a csv is read in therefore, it will always pass the column type check.
  • For arrow and parquet files, files are read in according to the file schema (not coerced to the hub schema). As such, the check on column types checks that the file schema matches the hub schema (and may therefore fail).

Could just give me a thumbs up if you agree?

@annakrystalli annakrystalli merged commit 7138feb into main Nov 27, 2023
8 checks passed
@annakrystalli annakrystalli deleted the num-type-ids branch November 27, 2023 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Unexpected error about output_type_id data type with FluSight submission
2 participants