Matbench task model accuracy for tabular data: e.g. lattice parameters, space group, and unit cell volume #51

sgbaird · 2022-06-01T01:44:23Z

The task is to use a hyperparameter-tuned XGBoost model for a Matbench submission on regressing formation energy using only the lattice parameter lengths, angles, and unit cell volume as inputs. This will help us know how "good" the xtal2png representation is from a model accuracy perspective.

@cseeg

sgbaird · 2022-06-01T21:15:58Z

Question brought up during meeting - whether to include compositional information.

sgbaird · 2022-06-01T21:50:31Z

@cseeg Lattice parameters and unit cell volume can be accessed through the pymatgen.core.structure.Structure objects that Matbench gives you. See

xtal2png/src/xtal2png/core.py

Lines 310 to 315 in a5dbbac

    
           latt_a.append(s._lattice.a) 
        
           latt_b.append(s._lattice.b) 
        
           latt_c.append(s._lattice.c) 
        
           angles.append(list(s._lattice.angles)) 
        
           volume.append(s.volume) 
        
           space_group.append(s.get_space_group_info()[1])

sgbaird · 2022-06-01T21:51:29Z

Matbench instructions

Example using structure-based model

sgbaird · 2022-06-10T22:46:07Z

How this fits into the bigger picture

Faris is working on a convolutional neural network that uses the full 64x64 representation and so that will be one of the main comparisons, as well as with the dummy baseline and other Matbench models. If Faris' model performs worse than your model, that would indicate the representation probably just has too many parameters compared with datapoints (64*64 = 4096 features vs. fewer than 10 for yours) to be useful for regression. If Faris' model performs better, then it might be worth adding composition information to your model as a follow-up (i.e. let it know what the chemical formula is), or we might just stop there.

There are a couple ways these results can affect the design decisions of xtal2png. For example, we can see how changes in the design affect regression accuracy and if that correlates well with the performance on the generative benchmark tasks, which are much less established. It has other implications for when we start doing conditional generation, such as whether we could rely on a prediction using the xtal2png representation or if we need to use a separate model (e.g. ALIGNN, MEGNet) to predict properties separate from the generation. My guess is probably the latter, but worth the simple check.

Mostly thinking of it as additional baselines and another perspective on the representation's behavior in a more established space (regression/classification performance).

sgbaird · 2022-06-13T19:52:40Z

Initial notebook using default xgboost parameters at #78, matbench submission to follow soon

sgbaird · 2022-06-18T04:51:34Z

Matbench PR submitted in materialsproject/matbench#152

sgbaird · 2022-06-22T02:25:01Z

Hyperopt submission ready-to-go by @cseeg. Planning to submit a Matbench PR soon.

sgbaird · 2022-07-08T06:37:24Z

@cseeg hyperopt submission notebook is close, but needs to be reworked and rerun. The hyperparameter optimization should occur once for each Matbench fold in the loop.

i.e. remove the hardcoded hyperparameters params=...:

    #Define dictionary of hyperparameters. This came from the HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA)
    params = {'colsample_bytree': 0.7271776258515598, 'learning_rate': 0.032792408056138485, 'max_depth': 19}
    

    #Set up and train XGBoost model
    train = xgb.DMatrix(X, label=y)
    num_round = 100
    my_model = xgb.train(params, train, num_round) # hyperopt should occur here

and before my_model = xgb.train(params, train, num_round), do your hyperparameter optimization (below) within the Matbench fold loop:

# Define regressor and split the dataset into training and validation dataset
X_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(X, y, test_size=0.3, shuffle=True,random_state=42)
regr_xgb = XGBRegressor(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)

#This a dictionary of hyperopt parameters to test through
param_dist_hyperopt = {
    'max_depth': 15 + hp.randint('num_leaves', 5), 
    'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
    'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)
}

#Define and fit model
model = BoostRFA(
    regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,
    n_iter=50, sampling_seed=0
)
model.fit(
    X_regr_train, y_regr_train, trials=Trials(), 
    eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0
    )

model.best_params_

Then my_model = xgb.train(params, train, num_round) should use the optimized hyperparameters with all the training + validation data (still not the test data), e.g.:

my_model = xgb.train(model.best_params_, train, num_round)

To recap, for each Matbench fold, split the train_and_val data into train and val, find optimal hyperparameters, and then fit a new model on train_and_val with the new hyperparameters. Use this newly trained model to predict on test and task.record the predictions. Lmk if you have questions on this.

kjappelbaum · 2022-07-08T06:41:14Z

I didn't check the full notebook - but you might want to check out Optuna as an alternative to hyperopt. It tends to be more efficient than hyperopt and also has a pruning callback for XGBoost (there is some note on this in The Kaggle Book)

sgbaird · 2022-07-08T06:57:02Z

@kjappelbaum oof, I forgot that hyperopt is a package. I've been (in poor taste) using it as an abbreviation for hyperparameter optimization. Glad you mentioned this. I believe @cseeg was using BoostRFA from shap-hypetune which was developed for gradient boosting models like XGBoost and has a sort of sklearn-like interface. I think @cseeg was running out of memory when using the other ones like BoostBoruta, and so went with BoostRFA. That's good to know that Optuna has some support/integration for XGBoost (definitely a number of good examples from https://www.google.com/search?q=optuna+xgboost).

I've enjoyed using RayTune quite a bit, especially given its integration with Ax. It looks like it has Optuna support as well (other link). I should probably give Optuna a try at some point.

cseeg · 2022-07-08T17:22:03Z

Yes what Sterling said is correct. I was looking through this Kaggle post to understand more about shap-hypetune and that's where I came to a conclusion to use hyperopt combined with BoostRFA. I will fix those issues and look into Optuna.

sgbaird · 2022-07-08T17:33:16Z

ah, gotcha, didn't realize shap-hypetune depends on hyperopt (from shap-hypetune):

apply grid-search, random-search, or bayesian-search (from hyperopt);

cseeg · 2022-07-08T17:35:20Z

Ya this image was the best way to help visualize it

sgbaird changed the title ~~Matbench task model accuracy for tabular data: lattice parameters and unit cell volume~~ Matbench task model accuracy for tabular data: e.g. lattice parameters, space group, and unit cell volume Jun 1, 2022

sgbaird added the manuscript-enhancements Interesting things to explore that can enhance the manuscript label Jun 1, 2022

sgbaird assigned cseeg Jun 1, 2022

sgbaird mentioned this issue Jun 14, 2022

Lattice xgboost, a baseline using lattice parameters, space group number, and unit cell volume materialsproject/matbench#152

Merged

sgbaird mentioned this issue Jul 9, 2022

Matbench task model accuracy for xtal2png representation #50

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matbench task model accuracy for tabular data: e.g. lattice parameters, space group, and unit cell volume #51

Matbench task model accuracy for tabular data: e.g. lattice parameters, space group, and unit cell volume #51

sgbaird commented Jun 1, 2022

sgbaird commented Jun 1, 2022

sgbaird commented Jun 1, 2022

sgbaird commented Jun 1, 2022

sgbaird commented Jun 10, 2022

sgbaird commented Jun 13, 2022

sgbaird commented Jun 18, 2022

sgbaird commented Jun 22, 2022

sgbaird commented Jul 8, 2022

kjappelbaum commented Jul 8, 2022 •

edited

Loading

sgbaird commented Jul 8, 2022 •

edited

Loading

cseeg commented Jul 8, 2022

sgbaird commented Jul 8, 2022

cseeg commented Jul 8, 2022

Matbench task model accuracy for tabular data: e.g. lattice parameters, space group, and unit cell volume #51

Matbench task model accuracy for tabular data: e.g. lattice parameters, space group, and unit cell volume #51

Comments

sgbaird commented Jun 1, 2022

sgbaird commented Jun 1, 2022

sgbaird commented Jun 1, 2022

sgbaird commented Jun 1, 2022

sgbaird commented Jun 10, 2022

How this fits into the bigger picture

sgbaird commented Jun 13, 2022

sgbaird commented Jun 18, 2022

sgbaird commented Jun 22, 2022

sgbaird commented Jul 8, 2022

kjappelbaum commented Jul 8, 2022 • edited Loading

sgbaird commented Jul 8, 2022 • edited Loading

cseeg commented Jul 8, 2022

sgbaird commented Jul 8, 2022

cseeg commented Jul 8, 2022

kjappelbaum commented Jul 8, 2022 •

edited

Loading

sgbaird commented Jul 8, 2022 •

edited

Loading