-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54
base: main
Are you sure you want to change the base?
Changes from all commits
d72354d
d6ea910
60cfbdf
5a8790c
c46cb68
f27f764
ac934f6
d1a934a
7556550
fe9c508
cd83184
2f64489
62dd7a5
0d51ed0
4e69bd7
eee043d
7b09be8
2fadec7
d7bc5de
e00b304
095303b
3d21149
002124f
a37a23c
0cedc5f
8d25f6c
e40693b
852babd
c229433
85be85f
94d289b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Summary of Changes | ||
|
||
- Converted dataset format to xarray.Dataset - merged auxil into packdata, etc. | ||
- Reformat documentation and added more comments | ||
- Rewrote tests as pytest files | ||
- Reformated the config file as a python file | ||
- Used ruff for reformatting and linting | ||
- Added pre-commit checks and a Github workflow for CI checks | ||
- Added input features (std of some variables like Qair, Psurf, etc.) | ||
- Increase Nc values (by a factor of 2) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this should be removed as it shouldn't be relied on for improving performance. |
||
- Instead of averaging over the whole time span, only take monthly averages and separate data per year to increase dataset size | ||
- Simplified readvar.py | ||
- Added new ML algorithm options: XGBoost, RandomForest, MLP, Lasso, Stacking Ensemble | ||
- Combined all ML evaluation results into a single CSV table | ||
- Implemented multithread parallelization to train a ML model per target variable in parallelization | ||
- Added standard scaling to preprocess the data before ML training | ||
- Updated README.md and CONTRIBUTING.md | ||
- Added explanation of the varlist.json file | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a bit more content to this file as outlined below. |
||
|
||
## TODO | ||
- Investigate (on bad-performance variables): feature importance, target variable correlation, sample size | ||
|
||
## Performance Benchmark | ||
|
||
### Separated Years | ||
| Algorithm | R2 | slope | | ||
|-----------|--------------------|--------------------| | ||
| bt | 0.6528848053359372 | 0.9542832780095561 | | ||
| rf | **0.6530125681385772** | 0.9540960891785613 | | ||
| gbm | 0.6512312928704228 | 0.950301142771462 | | ||
| lasso | 0.3712954690899892 | **0.9877068027413212** | | ||
| stack | 0.6525452816395577 | 0.9525996582374604 | | ||
|
||
### Separated Years without BAT | ||
| Algorithm | R2 | slope | | ||
|-----------|--------------------|--------------------| | ||
|bt|**0.9324186841654837**|0.9485227574788934| | ||
|rf|0.932251880129098|0.9485025443202926| | ||
|gbm|0.9302603471413085|0.9438685125142494| | ||
|lasso|0.8361626491858671|**0.9557082517475748**| | ||
|stack|0.9314316345860048|0.9493522690938773| | ||
|
||
### Averaged Years | ||
| Algorithm | R2 | slope | | ||
|-----------|--------------------|--------------------| | ||
| bt | 0.31926695871812644| 0.9591009540297856 | | ||
| rf | 0.321636483649443 | 0.9590895662312189 | | ||
| gbm | **0.328685618778433** | 0.9583401949919101 | | ||
| lasso | 0.09302916905772492| **0.9930868709808638** | | ||
| stack | 0.3225905817064779 | 0.9753542956777028 | |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can this file be presented in a better state to the user? i.e. A good set of defaults. |
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs reverting to the original values |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this improve performance? What are Qair, Psurf? Why did we take std of some variables before and not now? Are these now used in training or just stored in Packdata.nc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code did not take std of these variables, and I added them