Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Number of PCA Components to Keep #113

Merged
merged 3 commits into from
Sep 21, 2017

Conversation

rdvelazquez
Copy link
Member

First step in addressing #106. This is still a work in progress but I thought I'd at least check in and post what I'm working on. Any and all input is very welcome and appreciated.

The notebook is a little long because it's basically my working notes but I think there is enough documentation in the notebook that it's fairly self explanatory.

My takeaways thus far:

  1. The performance gain by searching over a larger range of n_components seems to be small (~1%-2% gain in testing AUROC on average), even when the range of n_components is selected for each query based on a heuristic around class balance.
  2. There is a larger performance gain if the l1_ratio is changed from 0.15 to 0 and the range of alpha is expanded (~5%-7% gain in testing AUROC on average) there isn't much performance gain if these parameters are changed independent of each other.

To Do:

  1. Do a similar evaluate for queries with only a subset of diseases or single disease. (this notebook currently only looks at queries with all the samples)
  2. Revise the classifier with the findings (if a revision is warranted)
  3. Revise the evaluation to account for covariates (Less of a priority based on the mixed findings thus far)

@rdvelazquez
Copy link
Member Author

rdvelazquez commented Sep 15, 2017

My last commit added a notebook number_of_pca_components_(subset_by_disease) that evaluates queries with only a subset of diseases (or a single disease). I also slightly added to the original notebook number_of_pca_components to evaluate the impact of searching over a range of n_components as opposed to using a single value.

Main Takeaways:

  1. Ensuring that n_components < (total_number_of_samples * training% * cross_val_training%) will prevent an error caused by trying to perform PCA with n_components greater than the number of samples.
  2. Setting stratify=y in test train split will prevent potential errors caused by the testing set only having one class.

Hopefully these two notebooks provide some useful quantitative information about how selecting hyperparameters will effect performance (AUROC) across a range of query scenarios. There's a lot more that could be done but I think we are at a point where we can make some decision about how we want to modify the main notebook. The memory usage should be a part of these decisions as well (see #88). Here are some of the decisions that need to be made):

  • l1_ratio 0 or 0.15: Accuracy vs. Sparcity (see Selecting the elastic net mixing parameter #56)
  • alpha How large of a range to search over: Accuracy vs. speed/memory (may also depend on the l1_ratio decision
  • n_components: One value, range, or function to select value/range

My recommendation:

  • l1_ratio = 0
  • alpha = [10** x for x in range(-10, 10)]
  • n_components = 50
    A function to select n_components, (see cells 14-17 of number_of_pca_components_(subset_by_disease)) is nice but I'm not sure the benefits are worth the added complexity.

@dhimmel
Copy link
Member

dhimmel commented Sep 19, 2017

Setting stratify=y in test train split will prevent potential errors caused by the testing set only having one class.

Ah I didn't realize we weren't setting stratify=y in train_test_split in 2.mutation-classifier.ipynb. Let's change that!

Ensuring that n_components < (total_number_of_samples * training% * cross_val_training%) will prevent an error caused by trying to perform PCA with n_components greater than the number of samples.

Once we settle on all these numbers, we should enforce this on the frontend.

A function to select n_components, (see cells 14-17 of number_of_pca_components_(subset_by_disease)) is nice but I'm not sure the benefits are worth the added complexity.

I think this makes sense. You could also make it very simple, like 3 sample size ranges that map to three different number of PCA components. And I support how you took the min of the positive and negatives and used that as the relevant number. @rdvelazquez can you open a pull request to incorporate the changes you think should go into 2.mutation-classifier.ipynb.

You can also merge this PR when its no longer in progress.

@rdvelazquez
Copy link
Member Author

Once we settle on all these numbers, we should enforce this on the frontend.

Agreed. I will try to help out with that as I have time and if it is within my limited front-end abilities.

@rdvelazquez can you open a pull request to incorporate the changes you think should go into 2.mutation-classifier.ipynb.

Will do!

You can also merge this PR when its no longer in progress.

I think @patrick-miller may have been looking at this PR so I'll wait to see if he has any comments.

@patrick-miller
Copy link
Member

This a very nice study. I agree with most of your takeaways. For alpha, do we have to search over that large of a range on the small side? I saw instances where the max was needed, but not where the min was. Perhaps, we could increase the minimum to reduce the parameter space.

I would've expected a larger positive correlation between optimal alpha and n_components. But, it looks like it doesn't matter too much. I agree with @dhimmel that if you want to use different n_components then something simple based on % of positives would be good, e.g. [20, 40, 80].

@rdvelazquez
Copy link
Member Author

I revised the number_of_pca_components_(subset_by_disease) notebook to:

  • Include a simpler function to select n_components based on the number of positives

    • This function performed similarly (in terms of average AUROC) as the more complex function
    • I will incorporate this function into the main notebook (2.mutation-classifier) in my next PR
  • Include a graph do display the range of alphas that were selected (I plotted Log base 10 of alpha to make it easier to understand)

    • There were two queries in the number_of_pca_components_(subset_by_disease) notebook that actually selected 10^-10
    • We could reduce the alpha range (maybe to 10^-7 - 10^7) and just live with a few queries here or there being at the edge of the range but I think it's best to just keep it as is (10^-10 - 10^10)

@dhimmel and @patrick-miller Thanks for looking at this and providing comments! I'll squash merge this now. If there's anything else we want to look at we can always just open a new PR and revise these notebooks.

@rdvelazquez rdvelazquez merged commit 3ad1939 into cognoma:master Sep 21, 2017
@rdvelazquez rdvelazquez deleted the explore-number-of-components branch September 21, 2017 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants