93/PR 3/3 - Propagate performance optimisation throughout checks & validation functions #109

annakrystalli · 2024-08-14T15:51:40Z

This PR completes the first stage of optimising validation performance described in #93

It builds on the functionality introduced to:

Create output type ID specific subsets of expanded valid value grids (98/ Subset expanded grid of valid values by output type #107)
Ignore derived task IDs in expanded valid value grids (93/add derived_task_id argument #108 )

and propagates these to relevant check_*() and validate_*() functions to enable output type validation batching as well as the creation of more performant expanded grids during validation.

This has lead to significant performance improvements, especially when dealing with complex configs with derived task IDs (see below)

I've also added a section to the documentation to help admins configure their validation workflows to make use of the derived_taks_ids feature as at the moment this can only be done manually (but see hubverse-org/schemas#96 for future plans).

Benchmarks

So far I have tested with a test model output file and new Respicast config file supplied by @M-7th which includes a derived target_end_date task id and have got extremely promising results with these fixes!!

expression	optimised	min	median	mem_alloc
check_tbl_required_values	FALSE	9.62m	9.62m	563.6GB
check_tbl_required_values	TRUE	12.15s	12.15s	12.9GB
check_tbl_spl_compound_taskid_set	FALSE	24.09s	24.09s	14.6GB
check_tbl_spl_compound_taskid_set	TRUE	1.03s	1.03s	637.5MB
check_tbl_spl_compound_tid	FALSE	40.66s	40.66s	23.9GB
check_tbl_spl_compound_tid	TRUE	96.18ms	98.97ms	38.3MB
check_tbl_spl_n	FALSE	41.7s	41.7s	23.9GB
check_tbl_spl_n	TRUE	93.11ms	98.31ms	38.3MB
check_tbl_spl_non_compound_tid	FALSE	39.45s	39.45s	23.9GB
check_tbl_spl_non_compound_tid	TRUE	90.79ms	100.18ms	38.6MB
check_tbl_value_col	FALSE	24.81s	24.81s	15.9GB
check_tbl_value_col	TRUE	31.43ms	31.89ms	22.9MB
check_tbl_values	FALSE	40.63s	40.63s	18.2GB
check_tbl_values	TRUE	43.69ms	44.76ms	20.5MB
validate_submission	FALSE	10.81m	10.81m	683.9GB
validate_submission	TRUE	12.44s	12.44s	13.7GB

View Respicast config

{
    "schema_version": "https://raw.githubusercontent.com/hubverse-org/schemas/main/v3.0.1/tasks-schema.json",
    "rounds": [
        {
            "round_id_from_variable": true,
            "round_id": "round_id",
            "model_tasks": [
                {
                    "task_ids": {
                        "round_id": {
                            "required": null,
                            "optional": ["2024_2025_1_COVID", "2024_2025_1_FLU"]
                        },
                        "scenario_id": {
                            "required": null,
                            "optional": ["A", "B", "C", "D", "E", "F"]
                        },
                        "target": {
                            "required": null,
                            "optional": ["ILI incidence", "ili_plus"]
                        },
                        "location": {
                            "required": null,
                            "optional": ["AT","BE","BG","CH","CY","CZ","DE","DK","EE","ES","FI","FR","GR","HR","HU","IE","IS","IT","LI","LT","LU","LV","MT","NL","NO","PL","PT","RO","SE","SI","SK","GB-ENG","GB-WLS","GB-NIR","GB-SCT"]
                        },
                        "pop_group": {
                            "required": null,
                            "optional": ["0-4_vaxYes", "0-4_vaxNo"]
                        },
                        "target_end_date": {
                            "required": null,
                            "optional": ["2024-08-11","2024-08-18","2024-08-25","2024-09-01","2024-09-08","2024-09-15","2024-09-22","2024-09-29","2024-10-06","2024-10-13","2024-10-20","2024-10-27","2024-11-03","2024-11-10","2024-11-17","2024-11-24","2024-12-01","2024-12-08","2024-12-15","2024-12-22","2024-12-29","2025-01-05","2025-01-12","2025-01-19","2025-01-26","2025-02-02","2025-02-09","2025-02-16","2025-02-23","2025-03-02","2025-03-09","2025-03-16","2025-03-23","2025-03-30","2025-04-06","2025-04-13","2025-04-20","2025-04-27","2025-05-04","2025-05-11","2025-05-18","2025-05-25","2025-06-01"]
                        },
                        "horizon": {
                            "required": null,
                            "optional": [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43]
                        }
                    },
                    "output_type": {
                        "sample": {
                            "output_type_id_params": {
                               "is_required": true,
                               "type": "character",
                               "max_length": 6,
                               "min_samples_per_task": 1,
                               "max_samples_per_task": 100,
                               "compound_taskid_set": ["round_id", "scenario_id", "target", "location", "pop_group"]
                           },
                           "value":{
                               "type": "double",
                               "minimum": 0
                           }
                        },
                        "quantile": {
                            "output_type_id": {
                                "required": null,
                                "optional": [0.01,0.025,0.05,0.1,0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,0.975,0.99]                                
                            },
                            "value": {
                                "type": "double",
                                "minimum": 0
                            }
                        }
                    },
                    "target_metadata": [
                        {
                           "target_id": "ILI incidence",
                           "target_name": "Weekly incidence for Influenza like illness",
                           "target_units": "cases per 100,000 population",
                           "target_keys": {
                               "target": "ILI incidence"
                           },
                           "description": "This target represents the count of new ILI cases per 100,000 in the week ending on the date [horizon] weeks after the reference_date",
                           "target_type": "continuous",
                           "is_step_ahead": true,
                           "time_unit": "week"
                        },
                        {
                           "target_id": "ili_plus",
                           "target_name": "Weekly incidence for Influenza like illness",
                           "target_units": "cases per 100,000 population",
                           "target_keys": {
                               "target": "ili_plus"
                           },
                           "description": "This target represents the count of new ILI cases per 100,000 in the week ending on the date [horizon] weeks after the reference_date",
                           "target_type": "continuous",
                           "is_step_ahead": true,
                           "time_unit": "week"
                        }
                    ]
                }
            ],
            "submissions_due": {"start":"2024-04-14","end":"2024-09-14"}
        }
    ],
    "output_type_id_datatype": "character"
}

System Info

benchmarkme::get_cpu()
#> $vendor_id
#> character(0)
#> 
#> $model_name
#> [1] "Apple M1 Pro"
#> 
#> $no_of_cores
#> [1] 10
benchmarkme::get_ram()
#> 34.4 GB
benchmarkme::get_platform_info()
#> $OS.type
#> [1] "unix"
#> 
#> $file.sep
#> [1] "/"
#> 
#> $dynlib.ext
#> [1] ".so"
#> 
#> $GUI
#> [1] "X11"
#> 
#> $endian
#> [1] "little"
#> 
#> $pkgType
#> [1] "mac.binary.big-sur-arm64"
#> 
#> $path.sep
#> [1] ":"
#> 
#> $r_arch
#> [1] ""

^{Created on 2024-08-16 with reprex v2.1.0}

…task-ids. Related to #93

…validation fns

Merge branch '98/subset-grid-by-out-type' into 93/3-batch-value-validation # Conflicts: # R/check_tbl_spl_compound_tid.R # R/v3-sample-utils.R # tests/testthat/test-check_tbl_spl_compound_tid.R # tests/testthat/test-check_tbl_spl_non_compound_tid.R

annakrystalli · 2024-08-15T13:59:12Z

R/check_tbl_value_col.R

-  #     }
-  #     details <- details_bullets_div(details)
-  # }
-


This is just obsolete commented out code which I just removed

annakrystalli · 2024-08-15T13:59:47Z

R/check_tbl_values.R

-    )
-  }
-  tbl
-}


These two function also seem not to be being used anymore so deleted

annakrystalli · 2024-08-15T14:05:48Z

R/check_tbl_values_required.R

@@ -5,18 +5,23 @@
 #' @inherit check_tbl_colnames params
 #' @inherit check_tbl_col_types return
 #' @export
-check_tbl_values_required <- function(tbl, round_id, file_path, hub_path) {
+check_tbl_values_required <- function(tbl, round_id, file_path, hub_path,
+                                      derived_task_ids = NULL) {


Unfortunately, this function cannot be batched by output type. Rather it needs to continue to be batched by model task to ensure that if optional task ID values are supplied for a given model task, the correct required output types are also supplied. As such, we need to evaluate model tasks as a whole rather than splitting them unless we complicate this already very complicated function even more. I experimented a bit but decided it wasn't worth the effort at this time. Being able to ignore derived task ids already confers speed ups and I'm hoping I'll be able to gain more through speeding up conc_rows() in the next round of optimisations.

annakrystalli · 2024-08-15T14:07:28Z

R/expand_model_out_grid.R

        "i" = "{.arg output_types} must be members of: {.val {round_output_types}}"
      ),
      call = call
    )
  }
-  valid_output_types
+  output_types
 }


This is the fix requested by @elray1 here: #107 (review)

elray1

Overall, looks good!

I read through the unit tests fairly carefully. Had a couple of minor questions about whether we might want to check that the derived_task_ids are truly ignored (not that I doubt they are)
I read the code changes moderately carefully. Most looked clear. I don't know all the places where we might want/need to add the new derived_task_ids argument, but I trust that you found them. There is a lot of logic in some places, I'm not fresh on this code base, and it's not all documented as thoroughly as I would find helpful. So basically I gave up on trying to understand all the code updates. Asked a question in one place. The fact that the unit tests are passing is reassuring :)
Made a suggestion in the documentation but then saw you had written similar text elsewhere.

I will wait to approve changes till I see what your thoughts are on a couple of these, but also not specifically requesting changes.

R/check_tbl_values.R

vignettes/articles/validate-submission.Rmd

tests/testthat/test-check_tbl_spl_compound_tid.R

elray1 · 2024-08-15T21:04:48Z

tests/testthat/test-validate_submission.R

@@ -319,3 +319,31 @@ test_that("validate_submission handles overriding output type id data type corre
    )[["col_types"]]
  )
 })
+
+test_that("Ignoring derived_task_ids in validate_submission works", {


similar question to in another test -- to test whether derived_task_ids were actually ignored, it seems like a test would involve setting a derived task id to something invalid and then seeing that an error was not raised?

I actually think maybe the question is not relevant where I asked it first, but may be relevant here.

Great points about the tests! I sort of knew from the benchmarks that it was working but admittedly, this wasn't being that explicitly tested throughout (mainly just in the expand_model_out_grid() tests.

So added a bunch of tests throughout following your suggestion of introducing a deliberate error in 23b4655

R/check_tbl_values.R

Co-authored-by: Evan Ray <[email protected]>

annakrystalli added 11 commits August 9, 2024 16:52

optimise check_tbl_values by output_type batching & ignoring derived …

2aa70e9

…task-ids. Related to #93

Ignore derived task-ids in check_tbl_values_required

93a4bf2

add match_tbl_to_model_task function

2a2e948

refactor check_tbl_value_col to work on smaller subsets of data

2bc332c

propagate output subsetting and derived_task_ids arg to spl checks

7fe78b4

propagate output subsetting and derived_task_ids arg to higher level …

1b003f2

…validation fns

Fix error_tbl bug by using original tbl rowids

0686ff4

remove unused functions

df0815e

Fix linter issues

4f18bca

Bump version

809a340

Add more detail to NEWS

451b7a2

Base automatically changed from 93/add-derived-tid-arg to 98/subset-grid-by-out-type August 15, 2024 07:03

annakrystalli added 3 commits August 15, 2024 10:14

merge earlier PRs

4a45645

Merge branch '98/subset-grid-by-out-type' into 93/3-batch-value-validation # Conflicts: # R/check_tbl_spl_compound_tid.R # R/v3-sample-utils.R # tests/testthat/test-check_tbl_spl_compound_tid.R # tests/testthat/test-check_tbl_spl_non_compound_tid.R

Throw error if invalid output_type supplied instead of ignoring.

edbf939

add output_types and derived_task_ids arguments to submission_tmpl

28715b3

annakrystalli linked an issue Aug 15, 2024 that may be closed by this pull request

Improve validation performance #93

Open

4 tasks

Add note on derived task IDs in pkgdown docs

8ec8261

annakrystalli changed the title ~~93/3 batch value validation~~ 93/PR 3/3 - Propagate performance optimisation throughout checks & validation functions Aug 15, 2024

annakrystalli linked an issue Aug 15, 2024 that may be closed by this pull request

Columns where values are dependent on the value of other columns cause problems in value combination validation. #38

Open

annakrystalli removed a link to an issue Aug 15, 2024

Improve validation performance #93

Open

4 tasks

annakrystalli commented Aug 15, 2024

View reviewed changes

R/check_tbl_values.R

)

}

tbl

}

Copy link

Member Author

annakrystalli Aug 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two function also seem not to be being used anymore so deleted

annakrystalli commented Aug 15, 2024

View reviewed changes

annakrystalli requested review from elray1 and LucieContamin August 15, 2024 14:07

annakrystalli added the performance label Aug 15, 2024

annakrystalli marked this pull request as ready for review August 15, 2024 14:08

annakrystalli mentioned this pull request Aug 15, 2024

Add simple value checks for derived task IDs #110

Closed

annakrystalli self-assigned this Aug 15, 2024

elray1 reviewed Aug 15, 2024

View reviewed changes

annakrystalli added 2 commits August 16, 2024 10:09

Add fn comments

2163c5c

add tests that explicitly check derived_task_ids are ignored

23b4655

elray1 approved these changes Aug 16, 2024

View reviewed changes

R/check_tbl_values.R Outdated Show resolved Hide resolved

R/check_tbl_values.R Show resolved Hide resolved

Update R/check_tbl_values.R

7b8a953

Co-authored-by: Evan Ray <[email protected]>

annakrystalli merged commit 29c3909 into 98/subset-grid-by-out-type Aug 16, 2024

annakrystalli deleted the 93/3-batch-value-validation branch August 16, 2024 08:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

93/PR 3/3 - Propagate performance optimisation throughout checks & validation functions #109

93/PR 3/3 - Propagate performance optimisation throughout checks & validation functions #109

annakrystalli commented Aug 14, 2024 •

edited

Loading

annakrystalli Aug 15, 2024

annakrystalli Aug 15, 2024

annakrystalli Aug 15, 2024

annakrystalli Aug 15, 2024

elray1 left a comment

elray1 Aug 15, 2024

elray1 Aug 15, 2024

annakrystalli Aug 16, 2024 •

edited

Loading

93/PR 3/3 - Propagate performance optimisation throughout checks & validation functions #109

93/PR 3/3 - Propagate performance optimisation throughout checks & validation functions #109

Conversation

annakrystalli commented Aug 14, 2024 • edited Loading

Benchmarks

View Respicast config

System Info

annakrystalli Aug 15, 2024

Choose a reason for hiding this comment

annakrystalli Aug 15, 2024

Choose a reason for hiding this comment

annakrystalli Aug 15, 2024

Choose a reason for hiding this comment

annakrystalli Aug 15, 2024

Choose a reason for hiding this comment

elray1 left a comment

Choose a reason for hiding this comment

elray1 Aug 15, 2024

Choose a reason for hiding this comment

elray1 Aug 15, 2024

Choose a reason for hiding this comment

annakrystalli Aug 16, 2024 • edited Loading

Choose a reason for hiding this comment

annakrystalli commented Aug 14, 2024 •

edited

Loading

annakrystalli Aug 16, 2024 •

edited

Loading