Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new models (Neural Networks, Ridge Regression, Gradient Boosting) + Ensemble method #54

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
d72354d
update tests and requirement.txt
tztsai Sep 11, 2024
d6ea910
enlarge dataset by increasing Nc and expanding year dim
tztsai Sep 11, 2024
60cfbdf
fix minor issues and add CHANGES.txt
tztsai Sep 12, 2024
5a8790c
remove old test log files
tztsai Sep 12, 2024
c46cb68
add docstrings and performance check after ML evaluation
tztsai Sep 12, 2024
f27f764
Reduce Ncc
tztsai Sep 13, 2024
ac934f6
implement multithread parallelization
tztsai Sep 16, 2024
d1a934a
add more estimators
tztsai Sep 16, 2024
7556550
update CHANGES.txt
tztsai Sep 17, 2024
fe9c508
add standard scaling
tztsai Sep 17, 2024
cd83184
fix sample weight argument
tztsai Sep 17, 2024
2f64489
Updated CHANGES.txt and add model score logging
tztsai Sep 17, 2024
62dd7a5
run benchmarking
tztsai Sep 17, 2024
0d51ed0
Update NN args and added performance to CHANGES.txt
tztsai Sep 18, 2024
4e69bd7
Add take_year_average in config
tztsai Sep 18, 2024
eee043d
Reformat CHANGES.txt into markdown
tztsai Sep 18, 2024
7b09be8
Reformat benchmark table
tztsai Sep 18, 2024
2fadec7
Pass config insteaf of resultpath and loocv to ML functions
tztsai Sep 18, 2024
d7bc5de
Add benchmark for training without BAT
tztsai Sep 18, 2024
e00b304
fix eval plotting
tztsai Sep 19, 2024
095303b
tune hyperparams
tztsai Sep 20, 2024
3d21149
add "best" alg selection
tztsai Sep 23, 2024
002124f
fix bug
tztsai Sep 23, 2024
a37a23c
rerun benchmark
tztsai Sep 24, 2024
0cedc5f
trying ridge and update CHANGES.md
tztsai Sep 24, 2024
8d25f6c
update select_best_model
tztsai Sep 24, 2024
e40693b
update select_best_model
tztsai Sep 24, 2024
852babd
improve select_best_model
tztsai Sep 24, 2024
c229433
fix MLacc_results index
tztsai Oct 15, 2024
85be85f
fix issues with ipft and ivar in labels
tztsai Oct 16, 2024
94d289b
replace np.arr with np.ma
tztsai Oct 17, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Summary of Changes

- Converted dataset format to xarray.Dataset - merged auxil into packdata, etc.
- Reformat documentation and added more comments
- Rewrote tests as pytest files
- Reformated the config file as a python file
- Used ruff for reformatting and linting
- Added pre-commit checks and a Github workflow for CI checks
- Added input features (std of some variables like Qair, Psurf, etc.)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this improve performance? What are Qair, Psurf? Why did we take std of some variables before and not now? Are these now used in training or just stored in Packdata.nc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code did not take std of these variables, and I added them

- Increase Nc values (by a factor of 2)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removed as it shouldn't be relied on for improving performance.

- Instead of averaging over the whole time span, only take monthly averages and separate data per year to increase dataset size
- Simplified readvar.py
- Added new ML algorithm options: XGBoost, RandomForest, MLP, Lasso, Stacking Ensemble
- Combined all ML evaluation results into a single CSV table
- Implemented multithread parallelization to train a ML model per target variable in parallelization
- Added standard scaling to preprocess the data before ML training
- Updated README.md and CONTRIBUTING.md
- Added explanation of the varlist.json file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a bit more content to this file as outlined below.


## TODO
- Investigate (on bad-performance variables): feature importance, target variable correlation, sample size

## Performance Benchmark

### Separated Years
| Algorithm | R2 | slope |
|-----------|--------------------|--------------------|
| bt | 0.6528848053359372 | 0.9542832780095561 |
| rf | **0.6530125681385772** | 0.9540960891785613 |
| gbm | 0.6512312928704228 | 0.950301142771462 |
| lasso | 0.3712954690899892 | **0.9877068027413212** |
| stack | 0.6525452816395577 | 0.9525996582374604 |

### Separated Years without BAT
| Algorithm | R2 | slope |
|-----------|--------------------|--------------------|
|bt|**0.9324186841654837**|0.9485227574788934|
|rf|0.932251880129098|0.9485025443202926|
|gbm|0.9302603471413085|0.9438685125142494|
|lasso|0.8361626491858671|**0.9557082517475748**|
|stack|0.9314316345860048|0.9493522690938773|

### Averaged Years
| Algorithm | R2 | slope |
|-----------|--------------------|--------------------|
| bt | 0.31926695871812644| 0.9591009540297856 |
| rf | 0.321636483649443 | 0.9590895662312189 |
| gbm | **0.328685618778433** | 0.9583401949919101 |
| lasso | 0.09302916905772492| **0.9930868709808638** |
| stack | 0.3225905817064779 | 0.9753542956777028 |
23 changes: 17 additions & 6 deletions DEF_Trunk/config.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this file be presented in a better state to the user? i.e. A good set of defaults.

Original file line number Diff line number Diff line change
Expand Up @@ -2,17 +2,28 @@

logfile = "log.MLacc_Trunk"
tasks = [
2,
# 2,
4,
5,
] # 1=test clustering, 2=clustering, 3=compress forcing, 4=ML, 5=evaluation
results_dir = "./EXE_DIR/"
reference_dir = "/home/surface10/mrasolon/files_for_zenodo/reference/EXE_DIR/"
start_from_scratch = True
start_from_scratch = False
take_year_average = False
smote_bat = False
kmeans_clusters = 4
max_kmeans_clusters = 9
random_seed = 1000
algorithms = [
# "bt",
# "rf",
# "gbm",
# "nn",
# "ridge",
"best",
] # bt: BaggingTrees, rf: RandomForest, nn: MLPRegressor, gbm: XGBRegressor, lasso: Lasso, best: SelectBestModel
leave_one_out_cv = False
repro_test_task_1 = True
repro_test_task_2 = True
repro_test_task_3 = True
repro_test_task_4 = True
repro_test_task_1 = False
repro_test_task_2 = False
repro_test_task_3 = False
repro_test_task_4 = False
2 changes: 1 addition & 1 deletion DEF_Trunk/varlist.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs reverting to the original values

Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
{
"test_K":[2,3,4,5,6,7,8,9],
"pfts":[2,3,4,5,6,7,8,9,10,11,12,13,14,15],
"Ncc":[10,20,10,10,10,20,20,20,10,10,10,10,10,10]
"Ncc": [20, 40, 20, 20, 20, 40, 40, 40, 20, 20, 20, 20, 20, 20]
},
"resp":
{
Expand Down
72 changes: 50 additions & 22 deletions Tools/Cluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,26 @@
from Tools import *


##@param[in] packdata packaged data
##@param[in] PFT_mask PFT mask where PFT fraction >0.01
##@param[in] ipft ith PFT to deal with
##@param[in] var_pred predicting variables
##@param[in] var_pred_name names of predicting variables
##@param[in] K K
##@param[in] Nc number of sites of select
##@retval cluster_dic # to be complete by Yan
##@retval distance # to be complete by Yan
##@retval All_selectedID # to be complete by Yan
def Cluster_Ana(packdata, PFT_mask, ipft, var_pred_name, K, Nc):
"""
Perform clustering analysis on the data for a specific Plant Functional Type (PFT).

Args:
packdata (xarray.Dataset): Dataset containing input variables.
PFT_mask (numpy.ndarray): Mask for Plant Functional Types.
ipft (int): Index of the current Plant Functional Type.
var_pred_name (list): List of predictor variable names.
K (int): Number of clusters.
Nc (int): Number of sites to select from each cluster.

Returns:
tuple:
- cluster_dic (dict): Dictionary containing cluster information.
- distance (float): Sum of squared distances of samples to their closest cluster center.
- All_selectedID (numpy.ndarray): Array of selected site IDs.
"""
if "year" in packdata.dims:
packdata = packdata.mean("year", keep_attrs=True)
if "Ndep_nhx_pft" in var_pred_name:
packdata.Ndep_nhx_pft = packdata.Ndep_nhx[ipft - 1]
if "Ndep_noy_pft" in var_pred_name:
Expand Down Expand Up @@ -56,17 +65,27 @@ def Cluster_Ana(packdata, PFT_mask, ipft, var_pred_name, K, Nc):
SelectedID = locations[RandomS]
else:
SelectedID = locations
print(
f"Selected {len(SelectedID)} ({len(SelectedID)/len(locations):.2%}) sites in cluster {clus}"
)
cluster_dic["clus_%.2i_loc_select" % clus] = SelectedID
All_selectedID = np.append(All_selectedID, SelectedID, axis=0)

return cluster_dic, distance, All_selectedID


##@param[in] packdata packaged data
##@param[in] varlist list of variables, including name of source files, variable names, etc.
##@param[in] logfile logfile
##@retval dis_all # Eulerian (?) distance corresponding to different number of Ks
def Cluster_test(packdata, varlist, logfile):
"""
Test clustering with different K values for all specified PFTs.

Args:
packdata (xarray.Dataset): Dataset containing input variables.
varlist (dict): Dictionary of variable information.
logfile (file): File object for logging.

Returns:
numpy.ndarray: Array of distances for different K values and PFTs.
"""
# 1.clustering def
# Make a mask map according to PFT fractions: nan - <0.00000001; 1 - >=0.00000001
# I used the output 'VEGET_COV_MAX' by ORCHIDEE-CNP with run the spin-up for 1 year.
Expand All @@ -92,22 +111,31 @@ def Cluster_test(packdata, varlist, logfile):
return dis_all


##@param[in] packdata packaged data
##@param[in] varlist list of variables, including name of source files, variable names, etc.
##@param[in] KK K value chosen to do final clustering
##@param[in] logfile logfile
##@retval IDx chosen IDs of pixels for MLacc
##@retval IDloc # to be complete by Yan (just for plotting)
##@retval IDsel # to be complete by Yan (just for plotting)
def Cluster_all(packdata, varlist, KK, logfile):
"""
Perform clustering for all specified PFTs with a chosen K value.

Args:
packdata (xarray.Dataset): Dataset containing input variables.
varlist (dict): Dictionary of variable information.
KK (int): Chosen K value for clustering.
logfile (file): File object for logging.

Returns:
tuple:
- IDx (numpy.ndarray): Array of chosen pixel IDs for MLacc.
- IDloc (numpy.ndarray): Array of cluster locations (for plotting).
- IDsel (numpy.ndarray): Array of selected cluster locations (for plotting).
"""
adict = locals()
kpfts = varlist["clustering"]["pfts"]
Ncc = varlist["clustering"]["Ncc"]
PFT_mask, PFT_mask_lai = genMask.PFT(
packdata, varlist, varlist["PFTmask"]["cluster_thres"]
)

var_pred_name = varlist["pred"]["clustering"]
# var_pred_name = varlist["pred"]["clustering"]
var_pred_name = [k for k, v in packdata.items() if "veget" not in v.dims]
for veg in kpfts:
ClusD, disx, training_ID = Cluster_Ana(
packdata,
Expand Down
Loading