Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a statistical model fitted to the original dataset to synthesize data #179

Closed
wants to merge 21 commits into from

Conversation

tztsai
Copy link
Collaborator

@tztsai tztsai commented Feb 2, 2024

Description

A statistical model (with a linear component and several seasonal components with different periods) was fitted to the large LFS dataset and stored in pyrealm_build_data/data_model_params.nc, by the script pyrealm_build_data/synth_data.py. It can be later used to reconstruct the original dataset with a decent accuracy, replacing the 1.8GB dataset with a 0.3MB file of coefficients.

A testing file was also added in tests/regression/data/ to ensure the quality of the reconstructed dataset by asserting that its R2 score is at least 0.85 (1.0 means perfect).

@tztsai tztsai requested a review from davidorme February 2, 2024 10:48
@codecov-commenter
Copy link

codecov-commenter commented Feb 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (caab98b) 93.57% compared to head (c7fa14d) 93.57%.

Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #179   +/-   ##
========================================
  Coverage    93.57%   93.57%           
========================================
  Files           28       28           
  Lines         1464     1464           
========================================
  Hits          1370     1370           
  Misses          94       94           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix the qa issues, please.

I'm not sure about this PR. I think the implementation and the time series model is very neat and the R2 is good, but the resulting coefficients file is 39MB (compressed or not?) and the validation script requires LFS to restore the original 1.6 GB data file.

The main point of #149 was to remove the large dataset and drop LFS, and the retained inputs are ~50MB. We can tile that small dataset to increase the profiling load. This PR adds more variation in the input data, by retaining more of the original signal of the large dataset, but it isn't clear that this alone improves the profiling.

@tztsai
Copy link
Collaborator Author

tztsai commented Feb 19, 2024

Can you fix the qa issues, please.

I'm not sure about this PR. I think the implementation and the time series model is very neat and the R2 is good, but the resulting coefficients file is 39MB (compressed or not?) and the validation script requires LFS to restore the original 1.6 GB data file.

The main point of #149 was to remove the large dataset and drop LFS, and the retained inputs are ~50MB. We can tile that small dataset to increase the profiling load. This PR adds more variation in the input data, by retaining more of the original signal of the large dataset, but it isn't clear that this alone improves the profiling.

I have fixed the QA now. I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset. After that the profiling can just call the reconstruct function without any usage of LFS. In addition, the previous coefficients file stores features for each time step of the original dataset. Letting the reconstruct function calculate these features dynamically given a custom datetime index, the file only needs to occupy ~300KB.

@davidorme
Copy link
Collaborator

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Copy link
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comments on this PR about the use of LFS. To add to that and the comments below, can you provide complete docstrings (Args etc) and a clear description of the usage of the functions, which is not clear. It would be good to have longer, more explanatory variable names and more comments.

pyrealm_build_data/synth_data.py Show resolved Hide resolved
pyrealm_build_data/synth_data.py Outdated Show resolved Hide resolved
pyrealm_build_data/synth_data.py Outdated Show resolved Hide resolved
pyrealm_build_data/synth_data.py Outdated Show resolved Hide resolved
pyrealm_build_data/synth_data.py Outdated Show resolved Hide resolved
@MarionBWeinzierl MarionBWeinzierl added enhancement New feature or request rse labels Feb 20, 2024
@tztsai
Copy link
Collaborator Author

tztsai commented Feb 20, 2024

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

@davidorme
Copy link
Collaborator

davidorme commented Feb 21, 2024

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

Yes but this requires that the whole repo retains the use of git lfs so that this flow can be used. And we had deliberately reduced the dataset size to avoid using git lfs.

I think the point here is that we need a block of data to use in profiling:

  • We have a block of real data that we can tile at the moment.
  • We could adopt this as a time series model without really caring about the precision of the replication. We probably genuinely don't care although it would be good to have provenance.
  • I had been vaguely thinking about procedurally generated data (something like Perlin noise) instead.

@tztsai
Copy link
Collaborator Author

tztsai commented Feb 21, 2024

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

Yes but this requires that the whole repo retains the use of git lfs so that this flow can be used. And we had deliberately reduced the dataset size to avoid using git lfs.

A simple solution at the moment would be to validate the generated dataset using the reduced (1 year) dataset which does not require LFS.

@tztsai tztsai requested a review from davidorme February 27, 2024 10:57
@MarionBWeinzierl
Copy link
Collaborator

When #189 is implemented the data set can also be uploaded to Zenodo.

@MarionBWeinzierl
Copy link
Collaborator

Another option would be to use the reduced dataset for validity checking.

@davidorme
Copy link
Collaborator

@MarionBWeinzierl

With #256, I think this becomes stale and we can close it. It only really exists because of the processing time and data file size limitations of trying to run the profiling on GithubActions.

If we move back to local profiling as planned, we'd still need test datasets but I think we can scale them back up to give more stringent and stable test and then store them on e.g. Zenodo. We can then close this.

@MarionBWeinzierl
Copy link
Collaborator

Happy to have this closed.

@tztsai
Copy link
Collaborator Author

tztsai commented Oct 22, 2024

Storing the dataset on Zenodo is a good alternative.

@davidorme
Copy link
Collaborator

Closed as we have moved away from profiling on GitHub runners, which removes the need to compress the profiling datasets.

@davidorme davidorme closed this Oct 22, 2024
@davidorme davidorme removed their request for review October 22, 2024 17:24
@davidorme davidorme deleted the synth-data branch October 22, 2024 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request rse
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants