Improve handling of numeric output type IDs. #60

annakrystalli · 2023-11-01T14:24:24Z

Resolves #54 & #58 (+ related discussion in #59)

Rationale behind fix:

I think I came up with a solution to the issues we were having with trailing zeros in tasks.json output type IDs as well the high precision floating point issues identified by @elray1 in #58 . When the output type id column is character, it involves coercing any type IDs associated with a numeric output type to numeric and then back to character. This takes care of trailing zeros or high precision values (much like what happens in expand grid) so everything matches nicely, follows the rationale that for numbers trailing zeros are meaningless and that they are generally compared using tolerances (rather than exact matches). The conversion to numeric creates a representation that gets rid of any high precision tolerances and the conversion back to character means values can be compared for exact matches. I wouldn't call it particularly elegant but it seems to be working.

However though is required on any downstream implications. Currently, the 0.1000000000000000055511 value @elray1 introduced to demonstrate behaviour would match a 0.1 quantile value and pass validation. We would need to check this doesn't affect ensembling downstream. A similar conversion might be required there too to ensure matching.

Also fixes the bug reported by European flu-forecast-hub and improves execution error message capturing.

…ombos.

…oint values that are approximately 1

tests/testthat/testdata/hub-chr/hub-config/admin.json

tests/testthat/testdata/hub-it/hub-config/admin.json

tests/testthat/testdata/hub-num/hub-config/admin.json

elray1 · 2023-11-02T17:32:51Z

R/check_tbl_values.R

+    tbl$output_type_id[type_coerce][valid] <- as.character(
+      num_output_type_id[valid]
+    )


I think the use of as.character here would open up the problems I outlined in discussion #59. I added an example of a unit test that passes, but that I think should not pass, here. In that example, the output_type_id in the submission file is 0.09999999999999997779554. On my system, this is not equal to the floating point representation of 0.1, but as.character casts it to "0.1". Here's an illustration:

> a <- 0.0999999999999999777955 > b <- 0.1000000000000000055511 > c <- 0.1 > a == b [1] FALSE > a == c [1] FALSE > b == c [1] TRUE > as.character(a) == as.character(b) [1] TRUE > as.character(a) == as.character(c) [1] TRUE > as.character(b) == as.character(c) [1] TRUE > as.character(a) [1] "0.1"

As a result of this, if we use values coerced via as.character to do validations, we may be generous in passing submissions that as.character rounds to the value specified in the tasks config file. This would then cause issues later on in places like ensembling code or filtering code, where grouping or filtering on the output_type_id may give unexpected results.

I am wondering if a way forward might be to do the data type coercion in a different order, or just deal with character representations from the start for purposes of validating output_type_ids and task id variables? Something like:

For purposes of validation, read in all fields other than the value column as character strings. Additionally, read the contents of tasks.json as strings if that's possible (may not be?)?

Check exact equality of values for output_type_ids and task id variables with the required/optional values specified in tasks.json, as strings. For example, if the tasks.json file says that a valid quantile level is 0.1, we expect to see the exact characters "0.1" in the submission file. This would change our current passing validation results for submissions with values like 0.1000000000000000055511 to failures.

Something like this would entirely sidestep the issues of what gets rounded to 0.1 or what the floating point representation of 0.1 is.

Another option might be to use arrow's cast function for type conversion instead of as.character, since it seems to be more careful in what it rounds. But my question about that is whether the outputs here would be stable across platforms. For example, maybe on one computer the floating point representation of 0.1 is 0.1000000000000000055511 and on another computer (e.g., one that's not using 64 bit representations of numbers) it's something slightly different. For that reason, it seems better to avoid relying on equality of floating point representations if possible.

Thanks @elray1 . Great points and I agree that although some output IDs are numeric, given we want to be matching output ID values effectively as categories / use them for grouping (both in validation and downstream) it feels indeed more appropriate to work with data as character for this purposes and match exact values, not check for equality using numeric tolerances or rounding behaviours.

So I will look into the most efficient way of including all character versions of data in our model output validation checks.

I will merge your currently failing test to include in the test suite. I also feel the original test you included (of a 0.1000000000000000055511 value included instead of a 0.1 value should now also fail.

Persistent issue with trailing zeros in tasks.json

The one remaining problem however that as.character() was solving for us is removing trailing zeros. When reading the tasks.json into R from JSON and then passing values to expand.grid to create the grid of all possible valid values, any trailing zeros are removed from numeric output type IDs. Although I spent quite some time trying to figure out a way to read in values from JSON in R as character (retaining any trailing zeros) I could not find a way so I don't think the suggestion:

Additionally, read the contents of tasks.json as strings if that's possible (may not be?)?
is currently possible.

Using as.character() on model output data as proposed in this PR was addressing this issue:

as.character(c(0.300, 0.350)) #> [1] "0.3" "0.35"

so that including 0.300 as an output type ID in a tasks.json would match a 0.300 value in the data (which previously failed because the value from config would be converted to 0.3 while the value in model output data would be read in as 0.300. Having said that, the approach in this PR is still a bit of a hack, as, to match values, trailing zeros are removed from both which is not quite right. I think we will have to tackle this with strong documentation and perhaps some sort of validation check (although I think a validation check will be difficult for the same reasons. Detecting trailing zeros in numerics will likely be hampered by the requirement to retain trailing zeros...our original problem!

OK Just to clarify what the new approach I'm putting together is doing:

The read_model_out_file fn now has argument coerce_types with values hub, chr and none instead of logical use_hub_schema.

hub reads in (csvs) or coerces (parquet & arrow which have schema defined in file, using arrow cast) to the hub schema.

chr reads in (csvs) or coerces (parquet & arrow) all columns as character.

none: no coercion. Read using default arrow::read_* fun settings.

This now causes both 0.1000000000000000055511 and 0.09999999999999997779554 to fail validation. It also means that values with trailing zeros (e.g. 0.300) will always fail, regardless of whether 0.300 is specified in tasks.json (leading to the issue discussed in #54). While I do feel this is actually correct behaviour when validating model output data, it still feels unsatisfactory to me form the aspect that values with trailing zeros can currently be specified in config files but that will not be respected when reading in and will cause validation errors. I think this will remain problematic until we find a way to detect and throw a warning about trailing zeros during config file validation.

Confirm read and column type checking behaviour for different file types

Also can we just confirm that you are happy with the following column type checks (same as before after addressing issues discussed in #54) but needs some recoding due to changed default behaviour in read_model_out_file).

When reading in csv files (which do not have column type information embedded), whether a file can be parsed using the hub schema on read is checked via check_file_read fn. If that check passes, the file is then read in using thehub schema. If a csv is read in therefore, it will always pass the column type check.

For arrow and parquet files, files are read in according to the file schema (not coerced to the hub schema). As such, the check on column types checks that the file schema matches the hub schema (and may therefore fail).

Could just give me a thumbs up if you agree?

Co-authored-by: Evan Ray <[email protected]>

…ck_tbl_values_errors_float_approx_1 add test cases (currently failing) that validations reject floating point values that are approximately 1

annakrystalli and others added 12 commits November 1, 2023 16:15

add model-metadata-schema to pass updated hubUtils validation checks

aee5bac

Improve handling of numeric output type IDs. Resolves #58 & #54

e8da5b4

Merge branch 'main' into num-type-ids

ea92169

fix error caused by model task with no required task id/output type c…

8e0da25

…ombos.

improve exec error message capturing

ae8c0fc

update news

b9374a2

add non req model task tests

ae89852

bump hubUtils requirement

7da8f3a

Add deployment note in docs

730040b

add ability to parse model metadata file names for metadata

ec8e614

Add fn opt_check_metadata_team_max_model_n. Resolves #34

177e8a5

add test cases (currently failing) that validations reject floating p…

5b53f1b

…oint values that are approximately 1

elray1 requested changes Nov 2, 2023

View reviewed changes

annakrystalli and others added 8 commits November 9, 2023 09:17

Update tests/testthat/testdata/hub-chr/hub-config/admin.json

7e42400

Co-authored-by: Evan Ray <[email protected]>

Update tests/testthat/testdata/hub-it/hub-config/admin.json

b116688

Co-authored-by: Evan Ray <[email protected]>

Update tests/testthat/testdata/hub-num/hub-config/admin.json

d756bee

Co-authored-by: Evan Ray <[email protected]>

Merge pull request #61 from Infectious-Disease-Modeling-Hubs/test_che…

0ae59f1

…ck_tbl_values_errors_float_approx_1 add test cases (currently failing) that validations reject floating point values that are approximately 1

skip tests offline

e1842e1

ensure default read behaviour for file type

30910f6

require all character tbls

27f1e65

read in all character tbl for some checks

24a6bf8

annakrystalli merged commit 7138feb into main Nov 27, 2023
8 checks passed

annakrystalli deleted the num-type-ids branch November 27, 2023 12:03

annakrystalli mentioned this pull request Dec 6, 2023

Remove trailing zeros from quantile output type ID examples hubverse-org/hubDocs#81

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of numeric output type IDs. #60

Improve handling of numeric output type IDs. #60

annakrystalli commented Nov 1, 2023 •

edited

Loading

elray1 Nov 2, 2023

annakrystalli Nov 9, 2023

annakrystalli Nov 10, 2023

annakrystalli Nov 10, 2023 •

edited

Loading

Improve handling of numeric output type IDs. #60

Improve handling of numeric output type IDs. #60

Conversation

annakrystalli commented Nov 1, 2023 • edited Loading

elray1 Nov 2, 2023

Choose a reason for hiding this comment

annakrystalli Nov 9, 2023

Choose a reason for hiding this comment

Persistent issue with trailing zeros in tasks.json

annakrystalli Nov 10, 2023

Choose a reason for hiding this comment

annakrystalli Nov 10, 2023 • edited Loading

Choose a reason for hiding this comment

Confirm read and column type checking behaviour for different file types

annakrystalli commented Nov 1, 2023 •

edited

Loading

Persistent issue with trailing zeros in `tasks.json`

annakrystalli Nov 10, 2023 •

edited

Loading