Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54

Open
wants to merge 31 commits into
base: main
Choose a base branch
from

Conversation

tztsai
Copy link
Collaborator

@tztsai tztsai commented Jun 10, 2024

In order to implement a neural network to replace the bagging trees for MLacc, we need to

  • For each, ipool, collect all input datasets into a single dataset for NN training
  • Restore the time dimension of the dataset with a monthly frequency to increase the data size
  • Implement and train an NN for each ipool; tune the hyperparameters using a validation dataset
  • Test the evaluation performance and compare it with the previous ML model

Fixes #53 #54 #57

@tztsai tztsai marked this pull request as draft June 10, 2024 17:53
@tztsai tztsai changed the base branch from main to reformat-data June 10, 2024 17:53
@tztsai tztsai changed the base branch from reformat-data to main June 11, 2024 10:47
@tztsai tztsai changed the base branch from main to reformat-data June 11, 2024 10:47
@ma595 ma595 force-pushed the reformat-data branch 2 times, most recently from 62fc49f to 36d5274 Compare June 18, 2024 09:50
@ma595 ma595 changed the base branch from reformat-data to main June 18, 2024 09:55
@ma595
Copy link
Collaborator

ma595 commented Jul 18, 2024

This needs a bit of a clean up / rebase.

@ma595 ma595 force-pushed the nn branch 2 times, most recently from be527fc to fa21916 Compare July 23, 2024 15:58
Tools/mapGlobe.py Outdated Show resolved Hide resolved
Tools/ML.py Outdated Show resolved Hide resolved
Tools/ML.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@ma595 ma595 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tztsai Checking your changes from last week. Can you please clarify for my benefit. Thanks.

Tools/ML.py Outdated Show resolved Hide resolved
Tools/mapGlobe.py Show resolved Hide resolved
@tztsai tztsai marked this pull request as ready for review August 28, 2024 12:14
@tztsai tztsai self-assigned this Aug 29, 2024
@ma595
Copy link
Collaborator

ma595 commented Sep 16, 2024

@tztsai This is really good work. I will run the code and review over the next few days.

Because of the size of this PR and the fact it touches so many files, it might be good to split up into two. This would make it easier to review. My idea is to implement two PRs that do the following:

    1. New refactored data pipeline and associated test to check results are same as before (if relevant). Probably stopping at 60cfbdf and then opening another PR to finish off:
    1. Additional ML models, tuning and parallelisation.

I would suggest adding some comments in the docstrings to provide a high level summary of what some of the functions are doing. Some more comments in the code itself would be helpful too. I am happy to take some of this on as I review.

It would also be useful to have an automated test to ensure that regressions to the data pipeline have not been introduced:

  • Show that the same dataset is being generated at the start of training as is generated in the old code (main).
  • Do tell me if this doesn't make sense, for instance due to any stochasticity in the pipeline.

On the ML side, could you provide a summary of how the performance changes? I noticed you made a CHANGES.txt file. Possibly summarise the improvements to ML performance there for now.

I looked at increasing the size of Ncc as well and it seems that the increase you first tried was beyond the size of the available data. In the CHANGES.txt it still suggests that the Nc size was increase. Could this please be corrected.

@ma595
Copy link
Collaborator

ma595 commented Sep 16, 2024

d6ea910
The message associated with this commit does not fully reflect what has been changed.

Tools/readvar.py Outdated Show resolved Hide resolved
@ma595
Copy link
Collaborator

ma595 commented Sep 18, 2024

The performance is starting to look good, well done @tztsai! Just a comment on splitting up:

I plan to split it into 2 or 3 PRs:

I am unsure whether to split into 2 and 3 as the refactoring and new ML implementation are quite tightly coupled.

@ma595 ma595 linked an issue Oct 16, 2024 that may be closed by this pull request
@dorchard dorchard added the iccs label Nov 4, 2024
@ma595
Copy link
Collaborator

ma595 commented Nov 20, 2024

OK, as this PR has been sitting here for a few months, I have decided to squash the changes instead of rebasing and cleaning up in the interest of time. I'll use the CHANGES.md as the basis of the squash message.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs reverting to the original values

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you provide a bit more explanation within the PR as to why these changes need to be made to extrp_global?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a high level summary of what's happening in this file?

Tools/train.py Show resolved Hide resolved
res_df = pd.concat(result, keys=Yvar.keys(), names=["comp"])
print(res_df)

scores = res_df.mean()[["R2", "slope"]].to_frame().T
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this additional code doing here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it all for just checking if performance has been degraded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but the MLacc_results.csv format was also changed (including a new column indicating the alg, aggregating scores, etc.)

Global_Predicted_Y_map = predY
else:
Global_Predicted_Y_map, predY = mapGlobe.extrp_global(
packdata,
ipft,
PFT_mask,
var_pred_name,
Tree_Ens,
combine_XY.columns.drop("Y"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, var_pred_name was hand-coded as a list of column names. Now that we have changed the input features in readvar.py (added a few new features), it needs to be updated as combine_XY.columns.drop("Y")

res[f"dim_{i+1}"] = k
res[f"ind_{i+1}"] = v
result.append(res)
# break
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be removed

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this file be presented in a better state to the user? i.e. A good set of defaults.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is MLmap now removed? Is everything now taken care of with MLmap_multidim?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this function was never used

- Implemented multithread parallelization to train a ML model per target variable in parallelization
- Added standard scaling to preprocess the data before ML training
- Updated README.md and CONTRIBUTING.md
- Added explanation of the varlist.json file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a bit more content to this file as outlined below.

@ma595 ma595 changed the title Replace Bagging Trees with Neural Networks for MLacc Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method Dec 3, 2024
@ma595
Copy link
Collaborator

ma595 commented Dec 11, 2024

A few updates:

  • I've rebased changes from main and am working through the changes
  • First observation after rebasing is that none of the tests currently pass:
    • Tamp was excluded from packdata (causing test_task1.py to fail.)
    • Was there any justification to do this @tztsai?
    • All temperature statistics were incorrect because Tair was being overwritten after converting to celsius
    • I added Tamp and reordered the code in Cluster.py and now test_init.py, test_task1.py and test_task2.py pass.
  • test_task4.py still fails - which isn't a big surprise
    • Debugging strategy was to output the combine_XY between main and nn branches. Obviously differences emerge:
    • We conclude that this is because there are some additional variables:
    • [LWdown_std', 'PSurf_std', 'Qair_std', 'SWdown_std', 'Snowf_mean','Snowf_std'] 21 columns vs 27 columns of data.
    • All other columns match between nn and main branches.
Index(['Unnamed: 0', 'GS_length', 'LAI0', 'LWdown_mean', 'LWdown_std', 'NPP0',
       'PSurf_mean', 'PSurf_std', 'Pre_GS', 'Qair_mean', 'Qair_std',
       'Rainf_mean', 'Rainf_std', 'SWdown_mean', 'SWdown_std', 'Snowf_mean',
       'Snowf_std', 'Tamp', 'Temp_GS', 'Tmax', 'Tmean', 'Tmin', 'Tstd', 'Y',
       'clay_frac', 'interx1', 'interx2'],
      dtype='object')
Index(['Unnamed: 0', 'GS_length', 'LAI0', 'LWdown_mean', 'NPP0', 'PSurf_mean',
       'Pre_GS', 'Qair_mean', 'Rainf_mean', 'Rainf_std', 'SWdown_mean', 'Tamp',
       'Temp_GS', 'Tmax', 'Tmean', 'Tmin', 'Tstd', 'Y', 'clay_frac', 'interx1',
       'interx2'],
      dtype='object')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update step 5 (evaluation) Replace the bagging trees with a single NN model for MLacc
4 participants