-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matbench task model accuracy for tabular data: e.g. lattice parameters, space group, and unit cell volume #51
Comments
Question brought up during meeting - whether to include compositional information. |
@cseeg Lattice parameters and unit cell volume can be accessed through the pymatgen.core.structure.Structure objects that Matbench gives you. See Lines 310 to 315 in a5dbbac
|
How this fits into the bigger pictureFaris is working on a convolutional neural network that uses the full 64x64 representation and so that will be one of the main comparisons, as well as with the dummy baseline and other Matbench models. If Faris' model performs worse than your model, that would indicate the representation probably just has too many parameters compared with datapoints (64*64 = 4096 features vs. fewer than 10 for yours) to be useful for regression. If Faris' model performs better, then it might be worth adding composition information to your model as a follow-up (i.e. let it know what the chemical formula is), or we might just stop there. There are a couple ways these results can affect the design decisions of Mostly thinking of it as additional baselines and another perspective on the representation's behavior in a more established space (regression/classification performance). |
Initial notebook using default xgboost parameters at #78, matbench submission to follow soon |
Matbench PR submitted in materialsproject/matbench#152 |
Hyperopt submission ready-to-go by @cseeg. Planning to submit a Matbench PR soon. |
@cseeg hyperopt submission notebook is close, but needs to be reworked and rerun. The hyperparameter optimization should occur once for each Matbench fold in the loop. i.e. remove the hardcoded hyperparameters #Define dictionary of hyperparameters. This came from the HYPERPARAM TUNING WITH HYPEROPT + RECURSIVE FEATURE ADDITION (RFA)
params = {'colsample_bytree': 0.7271776258515598, 'learning_rate': 0.032792408056138485, 'max_depth': 19}
#Set up and train XGBoost model
train = xgb.DMatrix(X, label=y)
num_round = 100
my_model = xgb.train(params, train, num_round) # hyperopt should occur here and before # Define regressor and split the dataset into training and validation dataset
X_regr_train, X_regr_valid, y_regr_train, y_regr_valid = train_test_split(X, y, test_size=0.3, shuffle=True,random_state=42)
regr_xgb = XGBRegressor(n_estimators=150, random_state=0, verbosity=0, n_jobs=-1)
#This a dictionary of hyperopt parameters to test through
param_dist_hyperopt = {
'max_depth': 15 + hp.randint('num_leaves', 5),
'learning_rate': hp.loguniform('learning_rate', np.log(0.01), np.log(0.2)),
'colsample_bytree': hp.uniform('colsample_by_tree', 0.6, 1.0)
}
#Define and fit model
model = BoostRFA(
regr_xgb, param_grid=param_dist_hyperopt, min_features_to_select=1, step=1,
n_iter=50, sampling_seed=0
)
model.fit(
X_regr_train, y_regr_train, trials=Trials(),
eval_set=[(X_regr_valid, y_regr_valid)], early_stopping_rounds=6, verbose=0
)
model.best_params_ Then my_model = xgb.train(model.best_params_, train, num_round) To recap, for each Matbench fold, split the |
I didn't check the full notebook - but you might want to check out Optuna as an alternative to hyperopt. It tends to be more efficient than hyperopt and also has a pruning callback for XGBoost (there is some note on this in The Kaggle Book) |
@kjappelbaum oof, I forgot that hyperopt is a package. I've been (in poor taste) using it as an abbreviation for hyperparameter optimization. Glad you mentioned this. I believe @cseeg was using I've enjoyed using RayTune quite a bit, especially given its integration with Ax. It looks like it has Optuna support as well (other link). I should probably give Optuna a try at some point. |
Yes what Sterling said is correct. I was looking through this Kaggle post to understand more about |
ah, gotcha, didn't realize
|
The task is to use a hyperparameter-tuned XGBoost model for a Matbench submission on regressing formation energy using only the lattice parameter lengths, angles, and unit cell volume as inputs. This will help us know how "good" the xtal2png representation is from a model accuracy perspective.
@cseeg
The text was updated successfully, but these errors were encountered: