-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of numeric output type IDs. #60
Conversation
…oint values that are approximately 1
tbl$output_type_id[type_coerce][valid] <- as.character( | ||
num_output_type_id[valid] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the use of as.character
here would open up the problems I outlined in discussion #59. I added an example of a unit test that passes, but that I think should not pass, here. In that example, the output_type_id
in the submission file is 0.09999999999999997779554
. On my system, this is not equal to the floating point representation of 0.1, but as.character
casts it to "0.1"
. Here's an illustration:
> a <- 0.0999999999999999777955
> b <- 0.1000000000000000055511
> c <- 0.1
> a == b
[1] FALSE
> a == c
[1] FALSE
> b == c
[1] TRUE
> as.character(a) == as.character(b)
[1] TRUE
> as.character(a) == as.character(c)
[1] TRUE
> as.character(b) == as.character(c)
[1] TRUE
> as.character(a)
[1] "0.1"
As a result of this, if we use values coerced via as.character
to do validations, we may be generous in passing submissions that as.character
rounds to the value specified in the tasks config file. This would then cause issues later on in places like ensembling code or filtering code, where grouping or filtering on the output_type_id
may give unexpected results.
I am wondering if a way forward might be to do the data type coercion in a different order, or just deal with character representations from the start for purposes of validating output_type_ids
and task id variables? Something like:
- For purposes of validation, read in all fields other than the
value
column as character strings. Additionally, read the contents oftasks.json
as strings if that's possible (may not be?)? - Check exact equality of values for
output_type_id
s and task id variables with the required/optional values specified intasks.json
, as strings. For example, if thetasks.json
file says that a valid quantile level is 0.1, we expect to see the exact characters"0.1"
in the submission file. This would change our current passing validation results for submissions with values like 0.1000000000000000055511 to failures.
Something like this would entirely sidestep the issues of what gets rounded to 0.1 or what the floating point representation of 0.1 is.
Another option might be to use arrow's cast
function for type conversion instead of as.character
, since it seems to be more careful in what it rounds. But my question about that is whether the outputs here would be stable across platforms. For example, maybe on one computer the floating point representation of 0.1 is 0.1000000000000000055511 and on another computer (e.g., one that's not using 64 bit representations of numbers) it's something slightly different. For that reason, it seems better to avoid relying on equality of floating point representations if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @elray1 . Great points and I agree that although some output IDs are numeric, given we want to be matching output ID values effectively as categories / use them for grouping (both in validation and downstream) it feels indeed more appropriate to work with data as character for this purposes and match exact values, not check for equality using numeric tolerances or rounding behaviours.
So I will look into the most efficient way of including all character versions of data in our model output validation checks.
I will merge your currently failing test to include in the test suite. I also feel the original test you included (of a 0.1000000000000000055511
value included instead of a 0.1
value should now also fail.
Persistent issue with trailing zeros in tasks.json
The one remaining problem however that as.character()
was solving for us is removing trailing zeros. When reading the tasks.json
into R from JSON and then passing values to expand.grid
to create the grid of all possible valid values, any trailing zeros are removed from numeric output type IDs. Although I spent quite some time trying to figure out a way to read in values from JSON in R as character (retaining any trailing zeros) I could not find a way so I don't think the suggestion:
Additionally, read the contents of tasks.json as strings if that's possible (may not be?)?
is currently possible.
Using as.character()
on model output data as proposed in this PR was addressing this issue:
as.character(c(0.300, 0.350))
#> [1] "0.3" "0.35"
so that including 0.300
as an output type ID in a tasks.json
would match a 0.300
value in the data (which previously failed because the value from config would be converted to 0.3
while the value in model output data would be read in as 0.300
. Having said that, the approach in this PR is still a bit of a hack, as, to match values, trailing zeros are removed from both which is not quite right. I think we will have to tackle this with strong documentation and perhaps some sort of validation check (although I think a validation check will be difficult for the same reasons. Detecting trailing zeros in numerics will likely be hampered by the requirement to retain trailing zeros...our original problem!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK Just to clarify what the new approach I'm putting together is doing:
- The
read_model_out_file
fn now has argumentcoerce_types
with valueshub
,chr
andnone
instead of logicaluse_hub_schema
.hub
reads in (csvs) or coerces (parquet & arrow which have schema defined in file, using arrowcast
) to the hub schema.chr
reads in (csvs) or coerces (parquet & arrow) all columns as character.none
: no coercion. Read using defaultarrow::read_*
fun settings.
This now causes both 0.1000000000000000055511
and 0.09999999999999997779554
to fail validation. It also means that values with trailing zeros (e.g. 0.300
) will always fail, regardless of whether 0.300
is specified in tasks.json
(leading to the issue discussed in #54). While I do feel this is actually correct behaviour when validating model output data, it still feels unsatisfactory to me form the aspect that values with trailing zeros can currently be specified in config files but that will not be respected when reading in and will cause validation errors. I think this will remain problematic until we find a way to detect and throw a warning about trailing zeros during config file validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirm read and column type checking behaviour for different file types
Also can we just confirm that you are happy with the following column type checks (same as before after addressing issues discussed in #54) but needs some recoding due to changed default behaviour in read_model_out_file
).
- When reading in
csv
files (which do not have column type information embedded), whether a file can be parsed using the hub schema on read is checked viacheck_file_read
fn. If that check passes, the file is then read in using thehub schema. If a csv is read in therefore, it will always pass the column type check. - For arrow and parquet files, files are read in according to the file schema (not coerced to the hub schema). As such, the check on column types checks that the file schema matches the hub schema (and may therefore fail).
Could just give me a thumbs up if you agree?
Co-authored-by: Evan Ray <[email protected]>
Co-authored-by: Evan Ray <[email protected]>
Co-authored-by: Evan Ray <[email protected]>
…ck_tbl_values_errors_float_approx_1 add test cases (currently failing) that validations reject floating point values that are approximately 1
Resolves #54 & #58 (+ related discussion in #59)
Rationale behind fix:
Also fixes the bug reported by European flu-forecast-hub and improves execution error message capturing.