Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check_tbl_derived_task_id_vals() function #176

Merged
merged 4 commits into from
Dec 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ export(check_submission_metadata_file_exists)
export(check_submission_time)
export(check_tbl_col_types)
export(check_tbl_colnames)
export(check_tbl_derived_task_id_vals)
export(check_tbl_match_round_id)
export(check_tbl_rows_unique)
export(check_tbl_spl_compound_taskid_set)
Expand Down
1 change: 1 addition & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# hubValidations (development version)

* Added `check_tbl_derived_task_id_vals()` check to `validate_model_data()` that ensures that values in derived task ID columns match expected values for the corresponding derived task IDs in the round as defined in `tasks.json` config (#110). Given the dependence of derived task IDs on the values of their source task ID values, the check ignores the combinations of derived task ID values with those of other task IDs and focuses only on identifying values that do not match corresponding accepted values.
* `submission_tmpl()` gains the `force_output_types` allowing users to force optional output types to be included in a submission template when `required_vals_only = TRUE`. In conjunction with the use of the `output_types` argument, this allows users to create submission templates which include optional output types they plan to submit.
* `check_tbl_values_required()` no longer reports false positives for v4 hubs, which fixes the bug reported in #177. Evaluation of whether all combinations of required values have been submitted through `check_tbl_values_required()` is now chunked by output type for v4 config and above. This reduces memory pressure and should speed up required value validation in hubs with complex task.json files.

Expand Down
86 changes: 86 additions & 0 deletions R/check_tbl_derived_task_id_vals.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
#' Check derived task ID columns contain valid values
#'
#' This check is used to validate that values in any derived task ID columns
#' matches accepted values for each derived task ID in the config.
#' Given the dependence of derived task IDs on the values of other values,
#' it ignores the combinations of derived task ID values with those of other task IDs
#' and focuses only on identifying values that do not match the accepted values.
#' @param tbl a tibble/data.frame of the contents of the file being validated. Column types must **all be character**.
#' @inherit check_tbl_colnames params
#' @inheritParams expand_model_out_grid
#' @return
#' Depending on whether validation has succeeded, one of:
#' - `<message/check_success>` condition class object.
#' - `<error/check_failure>` condition class object.
#'
#' If no `derived_task_ids` are specified, the check is skipped and a
#' `<message/check_info>` condition class object is retuned.
#'
#' Returned object also inherits from subclass `<hub_check>`.
#' @export
check_tbl_derived_task_id_vals <- function(
tbl, round_id, file_path, hub_path,
derived_task_ids = get_hub_derived_task_ids(
hub_path, round_id
)) {
if (is.null(derived_task_ids)) {
return(
capture_check_info(
file_path = file_path,
msg = "No derived task IDs to check. Skipping derived task ID value check."
)
)
}
config_tasks <- read_config(hub_path, "tasks")

derived_task_id_vals <- get_round_config_values(
config_tasks = config_tasks,
round_id = round_id,
derived_task_ids = NULL
)[derived_task_ids]

setdiff_vals <- purrr::map2(
tbl[derived_task_ids], derived_task_id_vals, setdiff
)

invalid_derived_task_ids <- lengths(setdiff_vals) > 0L

check <- !any(invalid_derived_task_ids)

if (check) {
details <- NULL
errors <- NULL
} else {
invalid_vals <- setdiff_vals[invalid_derived_task_ids]
invalid_vals_msg <- purrr::map_chr(
seq_along(invalid_vals),
\(.x) {
paste0(
"{.arg {names(invalid_vals)[", .x,
"]}}: {.val {invalid_vals[[", .x, "]]}}"
)
}
)
details <- c(
invalid_vals_msg,
"see {.code errors} for more details."
) |>
paste(collapse = ", ") |>
cli::format_inline()
errors <- invalid_vals
}

capture_check_cnd(
check = check,
file_path = file_path,
msg_subject = "{.var tbl}",
msg_attribute = "",
msg_verbs = c(
"contains valid derived task ID values.",
"contains invalid derived task ID values."
),
errors = errors,
error = FALSE,
details = details
)
}
10 changes: 10 additions & 0 deletions R/validate_model_data.R
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,16 @@ validate_model_data <- function(hub_path, file_path, round_id_col = NULL,
return(checks)
}

checks$derived_task_id_vals <- try_check(
check_tbl_derived_task_id_vals(
tbl_chr,
round_id = round_id,
file_path = file_path,
hub_path = hub_path,
derived_task_ids = derived_task_ids
), file_path
)

checks$rows_unique <- try_check(
check_tbl_rows_unique(
tbl_chr,
Expand Down
7 changes: 4 additions & 3 deletions inst/check_table.csv
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ unique_round_id,Round ID column contains a single unique round ID. Skipped if `r
match_round_id,Round ID from file contents matches round ID from file name. Skipped if `round_id_from_var` is FALSE in config.,TRUE,check_error,validate_model_data,check_tbl_match_round_id,,FALSE
colnames,File column names match expected column names for round (i.e. task ID names + hub standard column names),TRUE,check_error,validate_model_data,check_tbl_colnames,,FALSE
col_types,File column types match expected column types from config. Mainly applicable to parquet & arrow files.,FALSE,check_failure,validate_model_data,check_tbl_col_types,,FALSE
valid_vals,Columns (excluding `value` column) contain valid combinations of task ID / output type / output type ID values,TRUE,check_error,validate_model_data,check_tbl_values,error_tbl: table of invalid task ID/output type/output type ID value combinations,FALSE
rows_unique,Columns (excluding `value` column) contain unique combinations of task ID / output type / output type ID values,FALSE,check_failure,validate_model_data,check_tbl_rows_unique,,FALSE
req_vals,Columns (excluding `value` column) contain all required combinations of task ID / output type / output type ID values,FALSE,check_failure,validate_model_data,check_tbl_values_required,missing_df: table of missing task ID/output type/output type ID value combinations,FALSE
valid_vals,Columns (excluding the `value` and any derived task ID columns) contain valid combinations of task ID / output type / output type ID values,TRUE,check_error,validate_model_data,check_tbl_values,error_tbl: table of invalid task ID/output type/output type ID value combinations,FALSE
derived_task_id_vals,Derived task ID columns contain valid values.,FALSE,check_failure,validate_model_data,check_tbl_derived_task_id_vals,errors: named list of derived task ID values. Each element contains the invalid values for each derived task ID that failed the check.,FALSE
rows_unique,Columns (excluding the `value` and any derived task ID columns) contain unique combinations of task ID / output type / output type ID values,FALSE,check_failure,validate_model_data,check_tbl_rows_unique,,FALSE
req_vals,Columns (excluding the `value` and any derived task ID columns) contain all required combinations of task ID / output type / output type ID values,FALSE,check_failure,validate_model_data,check_tbl_values_required,missing_df: table of missing task ID/output type/output type ID value combinations,FALSE
value_col_valid,Values in `value` column are coercible to data type configured for each output type,FALSE,check_failure,validate_model_data,check_tbl_value_col,,FALSE
value_col_non_desc,Values in `value` column are non-decreasing as output_type_ids increase for all unique task ID /output type value combinations. Applies to `quantile` or `cdf` output types only,FALSE,check_failure,validate_model_data,check_tbl_value_col_ascending,error_tbl: table of rows affected,FALSE
value_col_sum1,Values in the `value` column of `pmf` output type data for each unique task ID combination sum to 1.,FALSE,check_failure,validate_model_data,check_tbl_value_col_sum1,error_tbl: table of rows affected,FALSE
Expand Down
56 changes: 56 additions & 0 deletions man/check_tbl_derived_task_id_vals.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 10 additions & 3 deletions man/validate_model_data.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 10 additions & 3 deletions man/validate_pr.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 10 additions & 3 deletions man/validate_submission.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

28 changes: 28 additions & 0 deletions tests/testthat/_snaps/check_tbl_derived_task_id_vals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# check_tbl_derived_task_ids_vals works

Code
check_tbl_derived_task_id_vals(tbl, round_id, file_path, hub_path)
Output
<message/check_success>
Message:
`tbl` contains valid derived task ID values.

---

Code
check_tbl_derived_task_id_vals(tbl, round_id, file_path, hub_path,
derived_task_ids = NULL)
Output
<message/check_info>
Message:
No derived task IDs to check. Skipping derived task ID value check.

---

Code
check_tbl_derived_task_id_vals(tbl, round_id, file_path, hub_path)
Output
<error/check_failure>
Error:
! `tbl` contains invalid derived task ID values. `target_date`: "random_val", see `errors` for more details.

Loading
Loading