-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54
base: main
Are you sure you want to change the base?
Conversation
62fc49f
to
36d5274
Compare
This needs a bit of a clean up / rebase. |
be527fc
to
fa21916
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tztsai Checking your changes from last week. Can you please clarify for my benefit. Thanks.
@tztsai This is really good work. I will run the code and review over the next few days. Because of the size of this PR and the fact it touches so many files, it might be good to split up into two. This would make it easier to review. My idea is to implement two PRs that do the following:
I would suggest adding some comments in the docstrings to provide a high level summary of what some of the functions are doing. Some more comments in the code itself would be helpful too. I am happy to take some of this on as I review. It would also be useful to have an automated test to ensure that regressions to the data pipeline have not been introduced:
On the ML side, could you provide a summary of how the performance changes? I noticed you made a CHANGES.txt file. Possibly summarise the improvements to ML performance there for now. I looked at increasing the size of Ncc as well and it seems that the increase you first tried was beyond the size of the available data. In the CHANGES.txt it still suggests that the Nc size was increase. Could this please be corrected. |
d6ea910 |
The performance is starting to look good, well done @tztsai! Just a comment on splitting up: I plan to split it into 2 or 3 PRs:
I am unsure whether to split into 2 and 3 as the refactoring and new ML implementation are quite tightly coupled. |
OK, as this PR has been sitting here for a few months, I have decided to squash the changes instead of rebasing and cleaning up in the interest of time. I'll use the CHANGES.md as the basis of the squash message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs reverting to the original values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you provide a bit more explanation within the PR as to why these changes need to be made to extrp_global
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a high level summary of what's happening in this file?
res_df = pd.concat(result, keys=Yvar.keys(), names=["comp"]) | ||
print(res_df) | ||
|
||
scores = res_df.mean()[["R2", "slope"]].to_frame().T |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this additional code doing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it all for just checking if performance has been degraded?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but the MLacc_results.csv format was also changed (including a new column indicating the alg
, aggregating scores, etc.)
Global_Predicted_Y_map = predY | ||
else: | ||
Global_Predicted_Y_map, predY = mapGlobe.extrp_global( | ||
packdata, | ||
ipft, | ||
PFT_mask, | ||
var_pred_name, | ||
Tree_Ens, | ||
combine_XY.columns.drop("Y"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, var_pred_name
was hand-coded as a list of column names. Now that we have changed the input features in readvar.py
(added a few new features), it needs to be updated as combine_XY.columns.drop("Y")
res[f"dim_{i+1}"] = k | ||
res[f"ind_{i+1}"] = v | ||
result.append(res) | ||
# break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these be removed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this file be presented in a better state to the user? i.e. A good set of defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is MLmap now removed? Is everything now taken care of with MLmap_multidim?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this function was never used
- Implemented multithread parallelization to train a ML model per target variable in parallelization | ||
- Added standard scaling to preprocess the data before ML training | ||
- Updated README.md and CONTRIBUTING.md | ||
- Added explanation of the varlist.json file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a bit more content to this file as outlined below.
A few updates:
|
In order to implement a neural network to replace the bagging trees for MLacc, we need to
ipool
, collect all input datasets into a single dataset for NN trainingipool
; tune the hyperparameters using a validation datasetFixes #53 #54 #57