Selecting the number of components returned by PCA #106

rdvelazquez · 2017-07-11T13:14:34Z

A topic that has come up a number of times is how many components should be returned by PCA...

The number of components (n_components) can be a parameter that is searched across in GridSearchCV. This used to cause problems with thrashing (Integrating dimensionality reduction into the pipeline #43) but these problems seem to have been eliminated by using the dasksearcCV implementation of GridSearchCV (Evaluate dask-searchcv to speed up GridSearchCV #94).
Although n_components can be included in GridSearchCV, we would like to limit the range that needs to be searched over based on the specifics of the query (i.e. how many samples are included {the user's filter by disease} and how many positive/negative mutations there are {the user's filter by gene(s)}).
Anecdotally, it seems that the optimal n_components is larger for balanced datasets (equal number of mutated and non-mutated samples) and smaller for unbalanced datasets (typically small number of mutated samples). Using a small n_components for balanced datasets results in low training and testing scores. Using a large n_components for unbalanced datasets results in higher training and lower testing scores (over-fitting).
As @dhimmel pointed out in Evaluate dask-searchcv to speed up GridSearchCV #94,

When working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of StratifiedShuffleSplits. See discussion on #71 by @htcai. We ended up going with:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, `random_state=0)

I'm thinking the next step should be creating a dataset that provides good coverage of the different query scenarios (#11), and perform GridSearchCV on these datasets, searching over a range of n_components to see how changing n_components effects performance (AUROC).

@dhimmel, @htcai, @patrick-miller feel free to comment now or we can discuss at tonight's meetup.

The text was updated successfully, but these errors were encountered:

patrick-miller · 2017-07-11T13:32:24Z

Thanks Ryan, this sounds like a good plan. Because of the increased speed of the pipeline, I'm in favor of just upping the number of components we search over. One thing that I will add is that we haven't optimized for the regularization strength yet (`alpha`). My guess is that the interplay between it and `n_components` is meaningful. I think it makes sense to optimize for them jointly. The process would be the same: - select several balanced and unbalanced genes - optimize over a large range of values for `n_components` and `alpha` - determine the max range for each parameter - decide if we should limit the range based on a heuristic around class balance I won't be there tonight, but I should be free to work on something this weekend.

rdvelazquez · 2017-07-11T13:41:13Z

My guess is that the interplay between it and n_components is meaningful. I think it makes sense to optimize for them jointly.

Agreed. I would assume the interplay between n_components and l1_ratio would also be very important (an l1_ratio of greater than 0 will include less features and likely have a large impact on the optimal n_components). I'll put a reference to #56 here as well as that talks about what l1_ratio to use.

htcai · 2017-07-11T13:45:45Z

@rdvelazquez Thanks for opening this issue! I also think this is a must-do if we want to enable search over optimal n_components for Cognoma web interface. It is in line with my experience that unbalanced mutations usually have a smaller optimal n_components.

In order to limit the range of search for optimal n_components, we may need to experiment with a large variety of genes (including many balanced and many unbalanced) and a lot of values of n_components. Thus, we may be able to select a smaller common set of n_components values for search when users query our datasets.

Although not sure, it will be great if we can select a particular range of n_components based on the pos/neg ratio of each gene.

I will play around with Dark-SearchCV and see how much time / RAM it can save for the pipeline.

rdvelazquez · 2017-07-11T13:51:22Z

Thanks @htcai !

Although not sure, it will be great if we can select a particular range of n_components based on the pos/neg ratio of each gene.

I'm thinking the total sample size (i.e. ~7,000 samples if including all diseases and <7,000 if only including a subset of diseases) may have an impact as well.

I will play around with Dark-SearchCV and see how much time / RAM it can save for the pipeline.

Sounds good. I evaluated the speed increase with dasksearchCV here.

dhimmel · 2017-07-14T23:28:42Z

For the time being, before we get a better indication of the ideal PCA n_components by number of positives and negatives, I'd suggest expanding to something like [10, 20, 50, 100, 250].

As per #56, I suggest sticking with the default l1_ratio but expanding the range of alpha. You can use numpy.logspace or numpy.geomspace to generate the l1_ratio range.

I also think we may want to switch to:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, `random_state=0)

Since non-repeated cross-validation is generally going to be too noisy for our purposes.

patrick-miller · 2017-07-16T15:29:32Z

@dhimmel when you say stick to the default l1_ratio do you mean the sklearn default of 0.15? I made a comment in @56 mentioning that we may want to consider just using ridge regression since we have switched from SelectKBest to PCA. The number of features we are using has been reduced significantly, so I'm not sure what we get from sparsity in the classifier.

dhimmel · 2017-07-16T16:06:19Z

you say stick to the default l1_ratio do you mean the sklearn default of 0.15

That's what I was thinking (l1_ratio = 0.15). But your logic for using ridge (l1_ratio = 0) when we're using PCA makes sense to me. I don't have a strong opinion since I can see some principal components being totally irrelevant to the classification task and hence truly zero. I think we should pick whichever value gives us the most usable / user-friendly results. For example, with elastic net or lasso you may get all zero coefficient models which will likely confuse the user as to what's going on.

patrick-miller · 2017-07-16T23:11:49Z

I'm starting to run into RAM bloating issues again with the switch to StratifiedShuffleSplit and increasing the parameter search space. It balloons up quickly even with just 10 splits. I think we are going to need to move this to EC2.

As for performance with TP53, I have alpha = 1 and n_components = 640 performing the best...but 640 is the edge case (AUROC ~93%).

rdvelazquez · 2017-07-17T13:29:31Z

Nice progress @patrick-miller. Thanks for posting the heatmap. That's very informative.

Just to confirm... the heat map above is with l1_ratio of 0.15 and the RAM issues are using dasksearchCV?

I'm starting to run into RAM bloating issues again with the switch to StratifiedShuffleSplit and increasing the parameter search space. It balloons up quickly even with just 10 splits.

We may want to consider pulling PCA out of CV. I see three places we could do PCA (I'm saying we should consider trying 2 below instead of 3):

Before the test/train split: This is a data leak and could give testing scores that are falsely high. We don't want this.
After test/train split but before CV: This is not a data leak (our testing set is still completely isolated from our training set). The only issue here is that CV won't be done completely correctly so a set of parameters may be selected that are not as good as if PCA was done within CV... I don't think this will have all that significant of an impact and I think the trade-off (gaining speed and memory space) is worth it.
After test/train split and within CV (the current implementation): This is the most "correct" way of doing it but it is slower and more memory intensive, especially as the number of CV splits increases, because PCA is performed on each CV split. (as an example: if you do 10 fold CV, PCA is fit to 90% of the data and the remaining 10% is transformed to this PCA fit and then this process is repeated 9 times, once fore each split)

I may try to implement this in a notebook (a Jupyter Notebook is worth a thousand words).

I think we are going to need to move this to EC2.

Just as an FYI - @dcgoss is getting ml-workers pretty close to up and running. This is the repo where the notebooks will be run in production... on EC2 😄

I'm also working on creating a benchmark dataset that provides good coverage of the different query scenarios. I'll post or PR this once its further along.

patrick-miller · 2017-07-18T23:57:24Z

That is with l1_ratio=0 and using dasksearch.

With respect to (2), it may not be a data leak but it could bias the classifier. There is significant difference in performance between models with different n_components so this could alter which model is chosen.

rdvelazquez · 2017-07-19T01:52:59Z

There is significant difference in performance between models with different n_components so this could alter which model is chosen.

True, but I'm not thinking we wouldn't still search over a range of n_components... I'm thinking we would perform PCA on the training set, the PCA components would be our X Train and we would use something like SelectKBest to search over a range of features (PCA components) to include in the model but instead of performing PCA within CV it's already been performed on the whole training set.

patrick-miller · 2017-07-19T02:37:39Z

Using SelectKBest is in essence doing the same thing as iterating over a variety of n_components after running the decomposition. I need to think about it a little more, but I believe you might still be leaking information here -- from the train partition of the CV fold to the test partition of the CV fold. You would get different principal components if you perform the decomp within cross-validation, so it may be a little less robust.

rdvelazquez · 2017-07-19T02:48:26Z

but I believe you might still be leaking information here -- from the train partition of the CV fold to the test partition of the CV fold.

I agree, but this may not have much of an effect, and I don't think it's a big deal if our training AUROC score is slightly inflated as long as the testing AUROC score is still accurate.

I also agree that the cross-validation will be less robust this way but if it lets us search over a larger range of n_components I think that will far outweigh the small loss of robustness.

P.S. I know this would be a fairly large departure from the way the notebook is currently set up so I'm not saying we should do it this way I'm just thinking about ways to get this thing to run.

htcai · 2017-07-19T13:49:43Z

I also agree that we should fit a PCA instance for each CV fold. A fitted PCA instance is part of the final model, as is the SGDClassifier.

I did PCA together with scaling before Grid Search in #71, just because the former was not feasible given the computation resource I had. Still, we can have a separate notebook, evaluating the loss of robustness.

dhimmel · 2017-07-19T14:17:09Z

A few notes.

The case @patrick-miller describes where n_components = 640 (or more) is optimal will be relatively uncommon, since TP53 is the most mutated gene. In general, there will be fewer positives. I'm okay with not always getting the best model in these extreme cases. The researcher could always iterate by adding larger n_components values to the notebook. The RAM issues are a concern because we want users to be able to run the notebooks locally.

We used to perform option 2 mentioned by @rdvelazquez, but switched to the more correct option 3. I agree that in most cases option 2 will cause little damage with great speedup. Furthermore, the testing values will still be correct, just not the cross-validation scores. However, I'd prefer to use the theoretically correct method, since we don't want to teach users incorrect methods. We'll have to find the delicate balance a grid search that evaluates enough parameter combinations... but not too many.

rdvelazquez · 2017-09-21T17:04:20Z

Closed by #113

We will be using a function to select the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The current function looks like:

if min(num_pos, num_neg) > 500:
    n_components = 100
elif min(num_pos, num_neg) > 250:
    n_components = 50
else:
    n_components = 30

rdvelazquez mentioned this issue Jul 13, 2017

How notebooks will work in production #108

Closed

rdvelazquez mentioned this issue Aug 16, 2017

Machine Learning Punch List for Launch #110

Closed

5 tasks

rdvelazquez mentioned this issue Aug 29, 2017

[WIP] Number of PCA Components to Keep #113

Merged

rdvelazquez closed this as completed Sep 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting the number of components returned by PCA #106

Selecting the number of components returned by PCA #106

rdvelazquez commented Jul 11, 2017

patrick-miller commented Jul 11, 2017 via email •

edited

Loading

rdvelazquez commented Jul 11, 2017

htcai commented Jul 11, 2017

rdvelazquez commented Jul 11, 2017

dhimmel commented Jul 14, 2017

patrick-miller commented Jul 16, 2017

dhimmel commented Jul 16, 2017

patrick-miller commented Jul 16, 2017 •

edited

Loading

rdvelazquez commented Jul 17, 2017 •

edited

Loading

patrick-miller commented Jul 18, 2017

rdvelazquez commented Jul 19, 2017

patrick-miller commented Jul 19, 2017

rdvelazquez commented Jul 19, 2017

htcai commented Jul 19, 2017 •

edited

Loading

dhimmel commented Jul 19, 2017 •

edited

Loading

rdvelazquez commented Sep 21, 2017

Selecting the number of components returned by PCA #106

Selecting the number of components returned by PCA #106

Comments

rdvelazquez commented Jul 11, 2017

patrick-miller commented Jul 11, 2017 via email • edited Loading

rdvelazquez commented Jul 11, 2017

htcai commented Jul 11, 2017

rdvelazquez commented Jul 11, 2017

dhimmel commented Jul 14, 2017

patrick-miller commented Jul 16, 2017

dhimmel commented Jul 16, 2017

patrick-miller commented Jul 16, 2017 • edited Loading

rdvelazquez commented Jul 17, 2017 • edited Loading

patrick-miller commented Jul 18, 2017

rdvelazquez commented Jul 19, 2017

patrick-miller commented Jul 19, 2017

rdvelazquez commented Jul 19, 2017

htcai commented Jul 19, 2017 • edited Loading

dhimmel commented Jul 19, 2017 • edited Loading

rdvelazquez commented Sep 21, 2017

patrick-miller commented Jul 11, 2017 via email •

edited

Loading

patrick-miller commented Jul 16, 2017 •

edited

Loading

rdvelazquez commented Jul 17, 2017 •

edited

Loading

htcai commented Jul 19, 2017 •

edited

Loading

dhimmel commented Jul 19, 2017 •

edited

Loading