Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54

tztsai · 2024-06-10T17:52:52Z

In order to implement a neural network to replace the bagging trees for MLacc, we need to

For each, ipool, collect all input datasets into a single dataset for NN training
Restore the time dimension of the dataset with a monthly frequency to increase the data size
Implement and train an NN for each ipool; tune the hyperparameters using a validation dataset
Test the evaluation performance and compare it with the previous ML model

Fixes #53 #54 #57

ma595 · 2024-07-18T09:15:41Z

This needs a bit of a clean up / rebase.

Tools/mapGlobe.py

Tools/ML.py

ma595

@tztsai Checking your changes from last week. Can you please clarify for my benefit. Thanks.

Tools/ML.py

Tools/mapGlobe.py

ma595 · 2024-09-16T17:53:59Z

@tztsai This is really good work. I will run the code and review over the next few days.

Because of the size of this PR and the fact it touches so many files, it might be good to split up into two. This would make it easier to review. My idea is to implement two PRs that do the following:

1. New refactored data pipeline and associated test to check results are same as before (if relevant). Probably stopping at 60cfbdf and then opening another PR to finish off:
1. Additional ML models, tuning and parallelisation.

I would suggest adding some comments in the docstrings to provide a high level summary of what some of the functions are doing. Some more comments in the code itself would be helpful too. I am happy to take some of this on as I review.

It would also be useful to have an automated test to ensure that regressions to the data pipeline have not been introduced:

Show that the same dataset is being generated at the start of training as is generated in the old code (main).
Do tell me if this doesn't make sense, for instance due to any stochasticity in the pipeline.

On the ML side, could you provide a summary of how the performance changes? I noticed you made a CHANGES.txt file. Possibly summarise the improvements to ML performance there for now.

I looked at increasing the size of Ncc as well and it seems that the increase you first tried was beyond the size of the available data. In the CHANGES.txt it still suggests that the Nc size was increase. Could this please be corrected.

ma595 · 2024-09-16T18:00:47Z

d6ea910
The message associated with this commit does not fully reflect what has been changed.

Tools/readvar.py

ma595 · 2024-09-18T10:13:12Z

The performance is starting to look good, well done @tztsai! Just a comment on splitting up:

I plan to split it into 2 or 3 PRs:

1. Testing Rewrite tests to make use of pytest #45
1. Pipeline refactoring (currently not opened).
1. New ML implementations and parameter tuning.

I am unsure whether to split into 2 and 3 as the refactoring and new ML implementation are quite tightly coupled.

ma595 · 2024-11-20T12:27:10Z

OK, as this PR has been sitting here for a few months, I have decided to squash the changes instead of rebasing and cleaning up in the interest of time. I'll use the CHANGES.md as the basis of the squash message.

ma595 · 2024-11-20T12:28:36Z

DEF_Trunk/varlist.json

This needs reverting to the original values

ma595 · 2024-11-20T12:36:23Z

Tools/mapGlobe.py

can you provide a bit more explanation within the PR as to why these changes need to be made to extrp_global?

ma595 · 2024-11-20T12:36:48Z

Tools/readvar.py

Can you give a high level summary of what's happening in this file?

Tools/train.py

ma595 · 2024-11-20T12:38:56Z

main.py

+        res_df = pd.concat(result, keys=Yvar.keys(), names=["comp"])
+        print(res_df)
+
+        scores = res_df.mean()[["R2", "slope"]].to_frame().T


What is this additional code doing here?

Is it all for just checking if performance has been degraded?

Yes, but the MLacc_results.csv format was also changed (including a new column indicating the alg, aggregating scores, etc.)

ma595 · 2024-11-20T12:49:38Z

Tools/ML.py

        Global_Predicted_Y_map = predY
    else:
        Global_Predicted_Y_map, predY = mapGlobe.extrp_global(
            packdata,
            ipft,
            PFT_mask,
-            var_pred_name,
-            Tree_Ens,
+            combine_XY.columns.drop("Y"),


can you explain this change.

Previously, var_pred_name was hand-coded as a list of column names. Now that we have changed the input features in readvar.py (added a few new features), it needs to be updated as combine_XY.columns.drop("Y")

ma595 · 2024-11-20T12:50:08Z

Tools/ML.py

-                                res[f"dim_{i+1}"] = k
-                                res[f"ind_{i+1}"] = v
-                            result.append(res)
+                #         break


Can these be removed

ma595 · 2024-11-20T12:51:46Z

DEF_Trunk/config.py

Can this file be presented in a better state to the user? i.e. A good set of defaults.

ma595 · 2024-11-20T12:52:34Z

Tools/ML.py

Why is MLmap now removed? Is everything now taken care of with MLmap_multidim?

Yes this function was never used

ma595 · 2024-11-20T12:56:05Z

CHANGES.md

+- Implemented multithread parallelization to train a ML model per target variable in parallelization
+- Added standard scaling to preprocess the data before ML training
+- Updated README.md and CONTRIBUTING.md
+- Added explanation of the varlist.json file


Please add a bit more content to this file as outlined below.

ma595 · 2024-12-11T11:30:44Z

A few updates:

I've rebased changes from main and am working through the changes
First observation after rebasing is that none of the tests currently pass:
- Tamp was excluded from packdata (causing test_task1.py to fail.)
- Was there any justification to do this @tztsai?
- All temperature statistics were incorrect because Tair was being overwritten after converting to celsius
- I added Tamp and reordered the code in Cluster.py and now test_init.py, test_task1.py and test_task2.py pass.
test_task4.py still fails - which isn't a big surprise
- Debugging strategy was to output the combine_XY between main and nn branches. Obviously differences emerge:
- We conclude that this is because there are some additional variables:
- [LWdown_std', 'PSurf_std', 'Qair_std', 'SWdown_std', 'Snowf_mean','Snowf_std'] 21 columns vs 27 columns of data.
- All other columns match between nn and main branches.

Index(['Unnamed: 0', 'GS_length', 'LAI0', 'LWdown_mean', 'LWdown_std', 'NPP0',
       'PSurf_mean', 'PSurf_std', 'Pre_GS', 'Qair_mean', 'Qair_std',
       'Rainf_mean', 'Rainf_std', 'SWdown_mean', 'SWdown_std', 'Snowf_mean',
       'Snowf_std', 'Tamp', 'Temp_GS', 'Tmax', 'Tmean', 'Tmin', 'Tstd', 'Y',
       'clay_frac', 'interx1', 'interx2'],
      dtype='object')
Index(['Unnamed: 0', 'GS_length', 'LAI0', 'LWdown_mean', 'NPP0', 'PSurf_mean',
       'Pre_GS', 'Qair_mean', 'Rainf_mean', 'Rainf_std', 'SWdown_mean', 'Tamp',
       'Temp_GS', 'Tmax', 'Tmean', 'Tmin', 'Tstd', 'Y', 'clay_frac', 'interx1',
       'interx2'],
      dtype='object')

tztsai marked this pull request as draft June 10, 2024 17:53

tztsai changed the base branch from main to reformat-data June 10, 2024 17:53

tztsai force-pushed the nn branch from cd84e1b to 59417f5 Compare June 10, 2024 21:59

tztsai changed the base branch from reformat-data to main June 11, 2024 10:47

tztsai changed the base branch from main to reformat-data June 11, 2024 10:47

ma595 force-pushed the reformat-data branch from 4e43a1d to b32f31f Compare June 11, 2024 11:08

tztsai force-pushed the reformat-data branch from 62fc49f to 4e43a1d Compare June 11, 2024 16:48

ma595 force-pushed the reformat-data branch 2 times, most recently from 62fc49f to 36d5274 Compare June 18, 2024 09:50

ma595 changed the base branch from reformat-data to main June 18, 2024 09:55

ma595 force-pushed the nn branch 2 times, most recently from be527fc to fa21916 Compare July 23, 2024 15:58

ma595 reviewed Aug 5, 2024

View reviewed changes

Tools/mapGlobe.py Outdated Show resolved Hide resolved

ma595 reviewed Aug 7, 2024

View reviewed changes

Tools/ML.py Outdated Show resolved Hide resolved

ma595 reviewed Aug 7, 2024

View reviewed changes

Tools/ML.py Outdated Show resolved Hide resolved

ma595 reviewed Aug 28, 2024

View reviewed changes

Tools/ML.py Outdated Show resolved Hide resolved

Tools/mapGlobe.py Show resolved Hide resolved

tztsai marked this pull request as ready for review August 28, 2024 12:14

tztsai self-assigned this Aug 29, 2024

tztsai added 3 commits September 11, 2024 13:07

update tests and requirement.txt

d72354d

enlarge dataset by increasing Nc and expanding year dim

d6ea910

fix minor issues and add CHANGES.txt

60cfbdf

tztsai force-pushed the nn branch from d67c557 to 60cfbdf Compare September 12, 2024 14:33

tztsai added 5 commits September 12, 2024 16:46

remove old test log files

5a8790c

add docstrings and performance check after ML evaluation

c46cb68

Reduce Ncc

f27f764

implement multithread parallelization

ac934f6

add more estimators

d1a934a

ma595 reviewed Sep 18, 2024

View reviewed changes

Tools/readvar.py Outdated Show resolved Hide resolved

Update NN args and added performance to CHANGES.txt

0d51ed0

tztsai added 7 commits September 18, 2024 12:38

Add take_year_average in config

4e69bd7

Reformat CHANGES.txt into markdown

eee043d

Reformat benchmark table

7b09be8

Pass config insteaf of resultpath and loocv to ML functions

2fadec7

Add benchmark for training without BAT

d7bc5de

fix eval plotting

e00b304

tune hyperparams

095303b

tztsai force-pushed the nn branch from 6cbdcce to 095303b Compare September 20, 2024 14:33

tztsai added 7 commits September 23, 2024 19:39

add "best" alg selection

3d21149

fix bug

002124f

rerun benchmark

a37a23c

trying ridge and update CHANGES.md

0cedc5f

update select_best_model

8d25f6c

update select_best_model

e40693b

improve select_best_model

852babd

ma595 mentioned this pull request Sep 25, 2024

Increase the size of the dataset by including the time dimension (X) and increasing equilibrium states (Y) #57

Open

fix MLacc_results index

c229433

ma595 linked an issue Oct 16, 2024 that may be closed by this pull request

Update step 5 (evaluation) #64

Open

ma595 mentioned this pull request Oct 16, 2024

Update step 5 (evaluation) #64

Open

fix issues with ipft and ivar in labels

85be85f

ma595 mentioned this pull request Oct 16, 2024

Add better Docstrings to functions and more comments in code #77

Open

replace np.arr with np.ma

94d289b

dorchard added the iccs label Nov 4, 2024

ma595 requested changes Nov 20, 2024

View reviewed changes

ma595 changed the title ~~Replace Bagging Trees with Neural Networks for MLacc~~ Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54

Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54

tztsai commented Jun 10, 2024 •

edited by ma595

Loading

ma595 commented Jul 18, 2024

ma595 left a comment

ma595 commented Sep 16, 2024 •

edited

Loading

ma595 commented Sep 16, 2024 •

edited

Loading

ma595 commented Sep 18, 2024 •

edited

Loading

ma595 commented Nov 20, 2024 •

edited

Loading

ma595 Nov 20, 2024

ma595 Nov 20, 2024

ma595 Nov 20, 2024

ma595 Nov 20, 2024

ma595 Nov 20, 2024

tztsai Nov 20, 2024

ma595 Nov 20, 2024

tztsai Nov 20, 2024

ma595 Nov 20, 2024

ma595 Nov 20, 2024

ma595 Nov 20, 2024

tztsai Nov 20, 2024

ma595 Nov 20, 2024

ma595 commented Dec 11, 2024 •

edited

Loading

Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54

Are you sure you want to change the base?

Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54

Conversation

tztsai commented Jun 10, 2024 • edited by ma595 Loading

ma595 commented Jul 18, 2024

ma595 left a comment

Choose a reason for hiding this comment

ma595 commented Sep 16, 2024 • edited Loading

ma595 commented Sep 16, 2024 • edited Loading

ma595 commented Sep 18, 2024 • edited Loading

ma595 commented Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ma595 commented Dec 11, 2024 • edited Loading

tztsai commented Jun 10, 2024 •

edited by ma595

Loading

ma595 commented Sep 16, 2024 •

edited

Loading

ma595 commented Sep 16, 2024 •

edited

Loading

ma595 commented Sep 18, 2024 •

edited

Loading

ma595 commented Nov 20, 2024 •

edited

Loading

ma595 commented Dec 11, 2024 •

edited

Loading