[WIP] Number of PCA Components to Keep #113

rdvelazquez · 2017-08-29T02:43:39Z

First step in addressing #106. This is still a work in progress but I thought I'd at least check in and post what I'm working on. Any and all input is very welcome and appreciated.

The notebook is a little long because it's basically my working notes but I think there is enough documentation in the notebook that it's fairly self explanatory.

My takeaways thus far:

The performance gain by searching over a larger range of n_components seems to be small (~1%-2% gain in testing AUROC on average), even when the range of n_components is selected for each query based on a heuristic around class balance.
There is a larger performance gain if the l1_ratio is changed from 0.15 to 0 and the range of alpha is expanded (~5%-7% gain in testing AUROC on average) there isn't much performance gain if these parameters are changed independent of each other.

To Do:

Do a similar evaluate for queries with only a subset of diseases or single disease. (this notebook currently only looks at queries with all the samples)
Revise the classifier with the findings (if a revision is warranted)
Revise the evaluation to account for covariates (Less of a priority based on the mixed findings thus far)

rdvelazquez · 2017-09-15T11:17:21Z

My last commit added a notebook number_of_pca_components_(subset_by_disease) that evaluates queries with only a subset of diseases (or a single disease). I also slightly added to the original notebook number_of_pca_components to evaluate the impact of searching over a range of n_components as opposed to using a single value.

Main Takeaways:

Ensuring that n_components < (total_number_of_samples * training% * cross_val_training%) will prevent an error caused by trying to perform PCA with n_components greater than the number of samples.
Setting stratify=y in test train split will prevent potential errors caused by the testing set only having one class.

Hopefully these two notebooks provide some useful quantitative information about how selecting hyperparameters will effect performance (AUROC) across a range of query scenarios. There's a lot more that could be done but I think we are at a point where we can make some decision about how we want to modify the main notebook. The memory usage should be a part of these decisions as well (see #88). Here are some of the decisions that need to be made):

l1_ratio 0 or 0.15: Accuracy vs. Sparcity (see Selecting the elastic net mixing parameter #56)
alpha How large of a range to search over: Accuracy vs. speed/memory (may also depend on the l1_ratio decision
n_components: One value, range, or function to select value/range

My recommendation:

l1_ratio = 0
alpha = [10** x for x in range(-10, 10)]
n_components = 50
A function to select n_components, (see cells 14-17 of number_of_pca_components_(subset_by_disease)) is nice but I'm not sure the benefits are worth the added complexity.

dhimmel · 2017-09-19T18:55:04Z

Setting stratify=y in test train split will prevent potential errors caused by the testing set only having one class.

Ah I didn't realize we weren't setting stratify=y in train_test_split in 2.mutation-classifier.ipynb. Let's change that!

Ensuring that n_components < (total_number_of_samples * training% * cross_val_training%) will prevent an error caused by trying to perform PCA with n_components greater than the number of samples.

Once we settle on all these numbers, we should enforce this on the frontend.

A function to select n_components, (see cells 14-17 of number_of_pca_components_(subset_by_disease)) is nice but I'm not sure the benefits are worth the added complexity.

I think this makes sense. You could also make it very simple, like 3 sample size ranges that map to three different number of PCA components. And I support how you took the min of the positive and negatives and used that as the relevant number. @rdvelazquez can you open a pull request to incorporate the changes you think should go into 2.mutation-classifier.ipynb.

You can also merge this PR when its no longer in progress.

rdvelazquez · 2017-09-20T11:24:55Z

Once we settle on all these numbers, we should enforce this on the frontend.

Agreed. I will try to help out with that as I have time and if it is within my limited front-end abilities.

@rdvelazquez can you open a pull request to incorporate the changes you think should go into 2.mutation-classifier.ipynb.

Will do!

You can also merge this PR when its no longer in progress.

I think @patrick-miller may have been looking at this PR so I'll wait to see if he has any comments.

patrick-miller · 2017-09-21T00:00:01Z

This a very nice study. I agree with most of your takeaways. For alpha, do we have to search over that large of a range on the small side? I saw instances where the max was needed, but not where the min was. Perhaps, we could increase the minimum to reduce the parameter space.

I would've expected a larger positive correlation between optimal alpha and n_components. But, it looks like it doesn't matter too much. I agree with @dhimmel that if you want to use different n_components then something simple based on % of positives would be good, e.g. [20, 40, 80].

rdvelazquez · 2017-09-21T16:50:55Z

I revised the number_of_pca_components_(subset_by_disease) notebook to:

Include a simpler function to select n_components based on the number of positives
- This function performed similarly (in terms of average AUROC) as the more complex function
- I will incorporate this function into the main notebook (2.mutation-classifier) in my next PR
Include a graph do display the range of alphas that were selected (I plotted Log base 10 of alpha to make it easier to understand)
- There were two queries in the number_of_pca_components_(subset_by_disease) notebook that actually selected 10^-10
- We could reduce the alpha range (maybe to 10^-7 - 10^7) and just live with a few queries here or there being at the edge of the range but I think it's best to just keep it as is (10^-10 - 10^10)

@dhimmel and @patrick-miller Thanks for looking at this and providing comments! I'll squash merge this now. If there's anything else we want to look at we can always just open a new PR and revise these notebooks.

Ryan Velazquez added 2 commits August 28, 2017 22:09

Add explore-number-of-pca-components notebook

84931a7

additional evaluation

acfc22f

revise subset_by_disease notebook

95d1ec2

rdvelazquez merged commit 3ad1939 into cognoma:master Sep 21, 2017

rdvelazquez mentioned this pull request Sep 21, 2017

Selecting the number of components returned by PCA #106

Closed

rdvelazquez deleted the explore-number-of-components branch September 21, 2017 17:09

This was referenced Sep 21, 2017

Revise parameter grid #114

Merged

Machine Learning Punch List for Launch #110

Closed

Create benchmark data sets #11

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Number of PCA Components to Keep #113

[WIP] Number of PCA Components to Keep #113

rdvelazquez commented Aug 29, 2017

rdvelazquez commented Sep 15, 2017 •

edited

Loading

dhimmel commented Sep 19, 2017 •

edited

Loading

rdvelazquez commented Sep 20, 2017

patrick-miller commented Sep 21, 2017

rdvelazquez commented Sep 21, 2017

[WIP] Number of PCA Components to Keep #113

[WIP] Number of PCA Components to Keep #113

Conversation

rdvelazquez commented Aug 29, 2017

rdvelazquez commented Sep 15, 2017 • edited Loading

dhimmel commented Sep 19, 2017 • edited Loading

rdvelazquez commented Sep 20, 2017

patrick-miller commented Sep 21, 2017

rdvelazquez commented Sep 21, 2017

rdvelazquez commented Sep 15, 2017 •

edited

Loading

dhimmel commented Sep 19, 2017 •

edited

Loading