-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a statistical model fitted to the original dataset to synthesize data #179
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #179 +/- ##
========================================
Coverage 93.57% 93.57%
========================================
Files 28 28
Lines 1464 1464
========================================
Hits 1370 1370
Misses 94 94 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix the qa
issues, please.
I'm not sure about this PR. I think the implementation and the time series model is very neat and the R2 is good, but the resulting coefficients file is 39MB (compressed or not?) and the validation script requires LFS to restore the original 1.6 GB data file.
The main point of #149 was to remove the large dataset and drop LFS, and the retained inputs are ~50MB. We can tile that small dataset to increase the profiling load. This PR adds more variation in the input data, by retaining more of the original signal of the large dataset, but it isn't clear that this alone improves the profiling.
I have fixed the QA now. I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset. After that the profiling can just call the |
Well, yes but the PR adds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the comments on this PR about the use of LFS. To add to that and the comments below, can you provide complete docstrings (Args
etc) and a clear description of the usage of the functions, which is not clear. It would be good to have longer, more explanatory variable names and more comments.
Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again. |
Yes but this requires that the whole repo retains the use of I think the point here is that we need a block of data to use in profiling:
|
A simple solution at the moment would be to validate the generated dataset using the reduced (1 year) dataset which does not require LFS. |
When #189 is implemented the data set can also be uploaded to Zenodo. |
Another option would be to use the reduced dataset for validity checking. |
With #256, I think this becomes stale and we can close it. It only really exists because of the processing time and data file size limitations of trying to run the profiling on GithubActions. If we move back to local profiling as planned, we'd still need test datasets but I think we can scale them back up to give more stringent and stable test and then store them on e.g. Zenodo. We can then close this. |
Happy to have this closed. |
Storing the dataset on Zenodo is a good alternative. |
Closed as we have moved away from profiling on GitHub runners, which removes the need to compress the profiling datasets. |
Description
A statistical model (with a linear component and several seasonal components with different periods) was fitted to the large LFS dataset and stored in
pyrealm_build_data/data_model_params.nc
, by the scriptpyrealm_build_data/synth_data.py
. It can be later used to reconstruct the original dataset with a decent accuracy, replacing the 1.8GB dataset with a 0.3MB file of coefficients.A testing file was also added in
tests/regression/data/
to ensure the quality of the reconstructed dataset by asserting that its R2 score is at least 0.85 (1.0 means perfect).