Use a statistical model fitted to the original dataset to synthesize data #179

tztsai · 2024-02-02T10:47:26Z

Description

A statistical model (with a linear component and several seasonal components with different periods) was fitted to the large LFS dataset and stored in pyrealm_build_data/data_model_params.nc, by the script pyrealm_build_data/synth_data.py. It can be later used to reconstruct the original dataset with a decent accuracy, replacing the 1.8GB dataset with a 0.3MB file of coefficients.

A testing file was also added in tests/regression/data/ to ensure the quality of the reconstructed dataset by asserting that its R2 score is at least 0.85 (1.0 means perfect).

codecov-commenter · 2024-02-02T11:32:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (caab98b) 93.57% compared to head (c7fa14d) 93.57%.

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #179   +/-   ##
========================================
  Coverage    93.57%   93.57%           
========================================
  Files           28       28           
  Lines         1464     1464           
========================================
  Hits          1370     1370           
  Misses          94       94

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

davidorme

Can you fix the qa issues, please.

I'm not sure about this PR. I think the implementation and the time series model is very neat and the R2 is good, but the resulting coefficients file is 39MB (compressed or not?) and the validation script requires LFS to restore the original 1.6 GB data file.

The main point of #149 was to remove the large dataset and drop LFS, and the retained inputs are ~50MB. We can tile that small dataset to increase the profiling load. This PR adds more variation in the input data, by retaining more of the original signal of the large dataset, but it isn't clear that this alone improves the profiling.

tztsai · 2024-02-19T18:40:55Z

Can you fix the qa issues, please.

I'm not sure about this PR. I think the implementation and the time series model is very neat and the R2 is good, but the resulting coefficients file is 39MB (compressed or not?) and the validation script requires LFS to restore the original 1.6 GB data file.

The main point of #149 was to remove the large dataset and drop LFS, and the retained inputs are ~50MB. We can tile that small dataset to increase the profiling load. This PR adds more variation in the input data, by retaining more of the original signal of the large dataset, but it isn't clear that this alone improves the profiling.

I have fixed the QA now. I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset. After that the profiling can just call the reconstruct function without any usage of LFS. In addition, the previous coefficients file stores features for each time step of the original dataset. Letting the reconstruct function calculate these features dynamically given a custom datetime index, the file only needs to occupy ~300KB.

davidorme · 2024-02-20T09:22:05Z

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

davidorme

See the comments on this PR about the use of LFS. To add to that and the comments below, can you provide complete docstrings (Args etc) and a clear description of the usage of the functions, which is not clear. It would be good to have longer, more explanatory variable names and more comments.

pyrealm_build_data/synth_data.py

tztsai · 2024-02-20T16:26:18Z

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

davidorme · 2024-02-21T10:11:11Z

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

Yes but this requires that the whole repo retains the use of git lfs so that this flow can be used. And we had deliberately reduced the dataset size to avoid using git lfs.

I think the point here is that we need a block of data to use in profiling:

We have a block of real data that we can tile at the moment.
We could adopt this as a time series model without really caring about the precision of the replication. We probably genuinely don't care although it would be good to have provenance.
I had been vaguely thinking about procedurally generated data (something like Perlin noise) instead.

tztsai · 2024-02-21T14:01:47Z

I think the validation script only needs to run once to ensure the reconstructed dataset is similar enough to the original dataset.

Well, yes but the PR adds tests/regression/data/test_synth_data.py to the regression testing suite and that is entirely reliant on the LFS file. It skips if the file is missing, but we're either restoring the need for LFS if we want that file to run or adding a test that will always skip.

Locally I have checked out the LFS object, so the validation script can run locally to check the quality of the reconstructed dataset. In the CI workflow, however, since we have removed the step of fetching LFS files, this test file will be skipped. This looks sensible - the developers can check the quality of the dataset reconstruction in the local environment, and the reconstructed dataset will be used in the CI workflow for profiling later without the need of running the quality check again.

Yes but this requires that the whole repo retains the use of git lfs so that this flow can be used. And we had deliberately reduced the dataset size to avoid using git lfs.

A simple solution at the moment would be to validate the generated dataset using the reduced (1 year) dataset which does not require LFS.

MarionBWeinzierl · 2024-03-07T09:51:08Z

When #189 is implemented the data set can also be uploaded to Zenodo.

MarionBWeinzierl · 2024-03-07T09:52:25Z

Another option would be to use the reduced dataset for validity checking.

davidorme · 2024-10-18T12:49:34Z

@MarionBWeinzierl

With #256, I think this becomes stale and we can close it. It only really exists because of the processing time and data file size limitations of trying to run the profiling on GithubActions.

If we move back to local profiling as planned, we'd still need test datasets but I think we can scale them back up to give more stringent and stable test and then store them on e.g. Zenodo. We can then close this.

MarionBWeinzierl · 2024-10-22T12:55:55Z

Happy to have this closed.

tztsai · 2024-10-22T13:02:45Z

Storing the dataset on Zenodo is a good alternative.

davidorme · 2024-10-22T13:05:17Z

Closed as we have moved away from profiling on GitHub runners, which removes the need to compress the profiling datasets.

tztsai added 2 commits February 1, 2024 23:50

add data synthesis script

282d7db

add regression test for data synthesis

7ae44e2

tztsai requested a review from davidorme February 2, 2024 10:48

refactor code

25515ee

tztsai added 3 commits February 5, 2024 12:43

reduce size of saved weights

f3efac2

fix file not found issue in test

5de3373

fix pytest.skip

1b85b25

mondus assigned tztsai Feb 6, 2024

TZCai and others added 4 commits February 6, 2024 11:27

update pandas version in requirement

abf2a6b

add constraint to optimization

07a79d9

simplify optimization method

7c6accc

update doc of synth_data and refactor

b6921f7

davidorme reviewed Feb 19, 2024

View reviewed changes

TZCai added 2 commits February 19, 2024 18:19

fix flake8 issue E501

165a773

add value bounds to the reconstructed data

3aa2b7d

Merge branch 'develop' into synth-data

c7fa14d

davidorme requested changes Feb 20, 2024

View reviewed changes

MarionBWeinzierl added enhancement New feature or request rse labels Feb 20, 2024

TZCai and others added 2 commits February 21, 2024 18:35

refactor synth_data and extend docstring

983aa9f

Merge branch 'develop' into synth-data

f1f64e0

tztsai requested a review from davidorme February 27, 2024 10:57

tztsai added 6 commits March 18, 2024 15:02

Merge branch 'develop' into synth-data

691d649

Merge branch 'develop' into synth-data

21078a9

fix mypy issue

7ca9c1c

Merge branch 'develop' into synth-data

7016e81

update poetry lock

48e4418

Merge branch 'develop' into synth-data

ff056fa

j-emberton unassigned tztsai Oct 17, 2024

davidorme closed this Oct 22, 2024

davidorme removed their request for review October 22, 2024 17:24

davidorme deleted the synth-data branch October 22, 2024 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a statistical model fitted to the original dataset to synthesize data #179

Use a statistical model fitted to the original dataset to synthesize data #179

tztsai commented Feb 2, 2024 •

edited

Loading

codecov-commenter commented Feb 2, 2024 •

edited

Loading

davidorme left a comment

tztsai commented Feb 19, 2024

davidorme commented Feb 20, 2024

davidorme left a comment

tztsai commented Feb 20, 2024 •

edited

Loading

davidorme commented Feb 21, 2024 •

edited

Loading

tztsai commented Feb 21, 2024 •

edited

Loading

MarionBWeinzierl commented Mar 7, 2024

MarionBWeinzierl commented Mar 7, 2024

davidorme commented Oct 18, 2024

MarionBWeinzierl commented Oct 22, 2024

tztsai commented Oct 22, 2024

davidorme commented Oct 22, 2024

Use a statistical model fitted to the original dataset to synthesize data #179

Use a statistical model fitted to the original dataset to synthesize data #179

Conversation

tztsai commented Feb 2, 2024 • edited Loading

Description

codecov-commenter commented Feb 2, 2024 • edited Loading

Codecov Report

davidorme left a comment

Choose a reason for hiding this comment

tztsai commented Feb 19, 2024

davidorme commented Feb 20, 2024

davidorme left a comment

Choose a reason for hiding this comment

tztsai commented Feb 20, 2024 • edited Loading

davidorme commented Feb 21, 2024 • edited Loading

tztsai commented Feb 21, 2024 • edited Loading

MarionBWeinzierl commented Mar 7, 2024

MarionBWeinzierl commented Mar 7, 2024

davidorme commented Oct 18, 2024

MarionBWeinzierl commented Oct 22, 2024

tztsai commented Oct 22, 2024

davidorme commented Oct 22, 2024

tztsai commented Feb 2, 2024 •

edited

Loading

codecov-commenter commented Feb 2, 2024 •

edited

Loading

tztsai commented Feb 20, 2024 •

edited

Loading

davidorme commented Feb 21, 2024 •

edited

Loading

tztsai commented Feb 21, 2024 •

edited

Loading