Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve validation performance #93

Open
2 of 4 tasks
annakrystalli opened this issue Jul 2, 2024 · 0 comments
Open
2 of 4 tasks

Improve validation performance #93

annakrystalli opened this issue Jul 2, 2024 · 0 comments
Assignees

Comments

@annakrystalli
Copy link
Member

annakrystalli commented Jul 2, 2024

Background

While our current validation functionality works well enough on smaller files / hubs with less complex config files, it can be much slower on larger files / more complex config. This has also been noted and reported by the community (e.g. #86)

The most time consuming functions are checking that the combination of values are valid and validating that all required value combinations have been submitted.

There are a number of reasons/bottlenecks

  • The size of the expanded grid of all possible values in a complex hub can be very large.
  • The expansion of value combinations which are effectively invalid because their value is related to the value of another variable (e.g target_end_date which only has a single valid value dependant on origin_date and horizon - see Columns where values are dependent on the value of other columns cause problems in value combination validation. #38.) Expanding the values of such task ids unnecessarily increases the size of the expanded value grid while the actual validation is performed via optional validation check hubValidations::opt_check_tbl_horizon_timediff().
  • Creating an index via conc_rows to then split the submitted table and check for required values in check_tbl_values_required.

These are likely the most effective areas to direct effort to improve performance.

Specific Actions

  • Perform memory intensive validations in a piecemeal way: Once Allow subsetting of expand_model_out_val_grid() by output type #98 is implemented, we should refactor any checks making use of expand_model_out_val_grid() to perform the checks one output type at a time. This way we avoid burdening memory with the full expanded grid at any one time.
  • Memoise expand_model_out_val_grid(): As this function is called a number of times but always returns the same result for the same config, it's a good candidate for memoisation (Memoise expand_model_out_vals_grid #85) .
  • Optimise conc_rows: I've already tried and failed to improve the performance of this function but given it's the main bottleneck to check_tbl_values_required, it feels important to revisit and try again.
  • Introduce mechanism for excluding task ids from expanded grid of valid values: This relates to task ids like target_end_date in which expanding their values is meaningless yet can be very memory consuming. For such task ids, validation would involve:
    • validating that the unique values in the task id column are valid with respect to the config (instead of checking them as part of combinations)
    • Using custom/optional functions to validate expected properties/relationships of such variables.
    • In expanded grids such task ids would likely be encoded as NAs
@annakrystalli annakrystalli added this to the optimise-validations milestone Jul 2, 2024
@nickreich nickreich moved this from Todo to Up Next in hubverse Development overview Jul 17, 2024
@annakrystalli annakrystalli moved this from Up Next to In Progress in hubverse Development overview Aug 12, 2024
@annakrystalli annakrystalli self-assigned this Aug 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Wishlist
Development

No branches or pull requests

1 participant