Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting the elastic net mixing parameter #56

Closed
dhimmel opened this issue Oct 11, 2016 · 5 comments
Closed

Selecting the elastic net mixing parameter #56

dhimmel opened this issue Oct 11, 2016 · 5 comments

Comments

@dhimmel
Copy link
Member

dhimmel commented Oct 11, 2016

Thus far we've been using grid search (cross validation) to select the optimal elastic net mixing parameter. For SGDClassifier, this mixing parameter is set using l1_ratio, where l1_ratio = 0 performs ridge regularization and l1_ratio = 1 performs lasso regularization.

Here's what I'm thinking:

Grid search is not the appropriate way to select the mixing parameter. Ridge (with the optimal regularization penalty, alpha) will always perform better than the optimal Lasso. The reason is that there's a cost for the convenience of sparsity. Lasso makes difficult decisions about which features to select. Therefore the sparsity can aid in model interpretation, but weakens performance because identifying only the predictive features is an impossible task.

For example, see our grid from this notebook (note this used MAD feature selection to select only 500 features which likely accentuates the performance deficit as l1_ratio increases).

grid

So my sense is that l1_ratio should be chosen based on what properties we want the model to have, not based on maximum CV performance. If we only care about performance, we might as well save ourselves the computation time and always go with ridge or the default l1_ratio = 0.15. l1_ratio = 0.15 can still filter ~50% of features with little performance degradation. But if you want real sparsity (lasso) there's going to be a performance cost -- and the user not grid search will have to make this decision.

@dhimmel
Copy link
Member Author

dhimmel commented Oct 11, 2016

Also, I'd rather spend more time optimizing alpha (regularization strength). glmnet in R defaults to trying a sequence of 100 different regularization strengths.

dhimmel added a commit to dhimmel/machine-learning that referenced this issue Oct 11, 2016
Do not optimize `l1_ratio`. Instead use the default of 0.15. Search a denser
grid for `alpha`. See cognoma#56
dhimmel added a commit that referenced this issue Oct 11, 2016
* Begin constructing a MVP machine learner

* Export JSON API input for Hippo pathway

* Ignore __pycache__

* classify() functioning with mock input

* Save output corresponding to hippo-input.json

From the `cognoml` directory, ran:

```
python analysis.py > ../data/api/hippo-output.json
```

* Export model information to JSON output

Also filter zero-variance features.

* Return unselected observations

Unselected observations (samples in the dataset that were not selected
by the user) are now returned. These observations receive predictions
but are missing (-1 encoded) for fields such as `testing` and `status`.

Sorted model parameters by key.

* Save grid_search performance metrics

* Move classifier and pipeline to it's own module

* Add setup.py to make module installable

* Review comments: spacing and results doc

* Check whether pipeline has function before calling

Meant to address https://git.io/vPvtI

* Acquire data from figshare

* Update for sklearn 0.18.0, Fix pipeline

Fix pipeline according to:
scikit-learn/scikit-learn#7536 (comment)

Extract selected feature names according to:
scikit-learn/scikit-learn#7536 (comment)

* Semantic improvements of get_feature_df

* Update API JSON files

* Mention hippo-output-schema.json in docstring

* Address @gwaygenomics review comments

Does not address "Lasso or Ridge only?"

* Grid search: optimize alpha not l1_ratio

Do not optimize `l1_ratio`. Instead use the default of 0.15. Search a denser
grid for `alpha`. See #56
@cgreene
Copy link
Member

cgreene commented Oct 11, 2016

Agree with @dhimmel about ridge/lasso trade-offs. We could ask the user how much they value sparsity vs performance if we can figure out a way that's not too confusing.

@gwaybio
Copy link
Member

gwaybio commented Oct 17, 2016

So my sense is that l1_ratio should be chosen based on what properties we want the model to have, not based on maximum CV performance.

Agreed!

dhimmel added a commit to cognoma/cognoml that referenced this issue Oct 25, 2016
* Begin constructing a MVP machine learner

* Export JSON API input for Hippo pathway

* Ignore __pycache__

* classify() functioning with mock input

* Save output corresponding to hippo-input.json

From the `cognoml` directory, ran:

```
python analysis.py > ../data/api/hippo-output.json
```

* Export model information to JSON output

Also filter zero-variance features.

* Return unselected observations

Unselected observations (samples in the dataset that were not selected
by the user) are now returned. These observations receive predictions
but are missing (-1 encoded) for fields such as `testing` and `status`.

Sorted model parameters by key.

* Save grid_search performance metrics

* Move classifier and pipeline to it's own module

* Add setup.py to make module installable

* Review comments: spacing and results doc

* Check whether pipeline has function before calling

Meant to address https://git.io/vPvtI

* Acquire data from figshare

* Update for sklearn 0.18.0, Fix pipeline

Fix pipeline according to:
scikit-learn/scikit-learn#7536 (comment)

Extract selected feature names according to:
scikit-learn/scikit-learn#7536 (comment)

* Semantic improvements of get_feature_df

* Update API JSON files

* Mention hippo-output-schema.json in docstring

* Address @gwaygenomics review comments

Does not address "Lasso or Ridge only?"

* Grid search: optimize alpha not l1_ratio

Do not optimize `l1_ratio`. Instead use the default of 0.15. Search a denser
grid for `alpha`. See cognoma/machine-learning#56
@patrick-miller
Copy link
Member

If we are performing PCA on the expression matrix to create our features, then I am not sure how important sparsity is going to be in the end classifier. This is probably even more true when the number of components we chose is <= 100.

@rdvelazquez
Copy link
Member

Closed by #114

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants