Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting the number of components returned by PCA #106

Closed
rdvelazquez opened this issue Jul 11, 2017 · 16 comments
Closed

Selecting the number of components returned by PCA #106

rdvelazquez opened this issue Jul 11, 2017 · 16 comments

Comments

@rdvelazquez
Copy link
Member

A topic that has come up a number of times is how many components should be returned by PCA...

  • The number of components (n_components) can be a parameter that is searched across in GridSearchCV. This used to cause problems with thrashing (Integrating dimensionality reduction into the pipeline #43) but these problems seem to have been eliminated by using the dasksearcCV implementation of GridSearchCV (Evaluate dask-searchcv to speed up GridSearchCV #94).
  • Although n_components can be included in GridSearchCV, we would like to limit the range that needs to be searched over based on the specifics of the query (i.e. how many samples are included {the user's filter by disease} and how many positive/negative mutations there are {the user's filter by gene(s)}).
  • Anecdotally, it seems that the optimal n_components is larger for balanced datasets (equal number of mutated and non-mutated samples) and smaller for unbalanced datasets (typically small number of mutated samples). Using a small n_components for balanced datasets results in low training and testing scores. Using a large n_components for unbalanced datasets results in higher training and lower testing scores (over-fitting).
  • As @dhimmel pointed out in Evaluate dask-searchcv to speed up GridSearchCV #94,

When working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of StratifiedShuffleSplits. See discussion on #71 by @htcai. We ended up going with:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, `random_state=0)

I'm thinking the next step should be creating a dataset that provides good coverage of the different query scenarios (#11), and perform GridSearchCV on these datasets, searching over a range of n_components to see how changing n_components effects performance (AUROC).

@dhimmel, @htcai, @patrick-miller feel free to comment now or we can discuss at tonight's meetup.

@patrick-miller
Copy link
Member

patrick-miller commented Jul 11, 2017 via email

@rdvelazquez
Copy link
Member Author

My guess is that the interplay between it and n_components is meaningful. I think it makes sense to optimize for them jointly.

Agreed. I would assume the interplay between n_components and l1_ratio would also be very important (an l1_ratio of greater than 0 will include less features and likely have a large impact on the optimal n_components). I'll put a reference to #56 here as well as that talks about what l1_ratio to use.

@htcai
Copy link
Member

htcai commented Jul 11, 2017

@rdvelazquez Thanks for opening this issue! I also think this is a must-do if we want to enable search over optimal n_components for Cognoma web interface. It is in line with my experience that unbalanced mutations usually have a smaller optimal n_components.

In order to limit the range of search for optimal n_components, we may need to experiment with a large variety of genes (including many balanced and many unbalanced) and a lot of values of n_components. Thus, we may be able to select a smaller common set of n_components values for search when users query our datasets.

Although not sure, it will be great if we can select a particular range of n_components based on the pos/neg ratio of each gene.

I will play around with Dark-SearchCV and see how much time / RAM it can save for the pipeline.

@rdvelazquez
Copy link
Member Author

Thanks @htcai !

Although not sure, it will be great if we can select a particular range of n_components based on the pos/neg ratio of each gene.

I'm thinking the total sample size (i.e. ~7,000 samples if including all diseases and <7,000 if only including a subset of diseases) may have an impact as well.

I will play around with Dark-SearchCV and see how much time / RAM it can save for the pipeline.

Sounds good. I evaluated the speed increase with dasksearchCV here.

@dhimmel
Copy link
Member

dhimmel commented Jul 14, 2017

For the time being, before we get a better indication of the ideal PCA n_components by number of positives and negatives, I'd suggest expanding to something like [10, 20, 50, 100, 250].

As per #56, I suggest sticking with the default l1_ratio but expanding the range of alpha. You can use numpy.logspace or numpy.geomspace to generate the l1_ratio range.

I also think we may want to switch to:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, `random_state=0)

Since non-repeated cross-validation is generally going to be too noisy for our purposes.

@patrick-miller
Copy link
Member

@dhimmel when you say stick to the default l1_ratio do you mean the sklearn default of 0.15? I made a comment in @56 mentioning that we may want to consider just using ridge regression since we have switched from SelectKBest to PCA. The number of features we are using has been reduced significantly, so I'm not sure what we get from sparsity in the classifier.

@dhimmel
Copy link
Member

dhimmel commented Jul 16, 2017

you say stick to the default l1_ratio do you mean the sklearn default of 0.15

That's what I was thinking (l1_ratio = 0.15). But your logic for using ridge (l1_ratio = 0) when we're using PCA makes sense to me. I don't have a strong opinion since I can see some principal components being totally irrelevant to the classification task and hence truly zero. I think we should pick whichever value gives us the most usable / user-friendly results. For example, with elastic net or lasso you may get all zero coefficient models which will likely confuse the user as to what's going on.

@patrick-miller
Copy link
Member

patrick-miller commented Jul 16, 2017

I'm starting to run into RAM bloating issues again with the switch to StratifiedShuffleSplit and increasing the parameter search space. It balloons up quickly even with just 10 splits. I think we are going to need to move this to EC2.

As for performance with TP53, I have alpha = 1 and n_components = 640 performing the best...but 640 is the edge case (AUROC ~93%).

image

@rdvelazquez
Copy link
Member Author

rdvelazquez commented Jul 17, 2017

Nice progress @patrick-miller. Thanks for posting the heatmap. That's very informative.

Just to confirm... the heat map above is with l1_ratio of 0.15 and the RAM issues are using dasksearchCV?

I'm starting to run into RAM bloating issues again with the switch to StratifiedShuffleSplit and increasing the parameter search space. It balloons up quickly even with just 10 splits.

We may want to consider pulling PCA out of CV. I see three places we could do PCA (I'm saying we should consider trying 2 below instead of 3):

  1. Before the test/train split: This is a data leak and could give testing scores that are falsely high. We don't want this.
  2. After test/train split but before CV: This is not a data leak (our testing set is still completely isolated from our training set). The only issue here is that CV won't be done completely correctly so a set of parameters may be selected that are not as good as if PCA was done within CV... I don't think this will have all that significant of an impact and I think the trade-off (gaining speed and memory space) is worth it.
  3. After test/train split and within CV (the current implementation): This is the most "correct" way of doing it but it is slower and more memory intensive, especially as the number of CV splits increases, because PCA is performed on each CV split. (as an example: if you do 10 fold CV, PCA is fit to 90% of the data and the remaining 10% is transformed to this PCA fit and then this process is repeated 9 times, once fore each split)

I may try to implement this in a notebook (a Jupyter Notebook is worth a thousand words).

I think we are going to need to move this to EC2.

Just as an FYI - @dcgoss is getting ml-workers pretty close to up and running. This is the repo where the notebooks will be run in production... on EC2 😄

I'm also working on creating a benchmark dataset that provides good coverage of the different query scenarios. I'll post or PR this once its further along.

@patrick-miller
Copy link
Member

That is with l1_ratio=0 and using dasksearch.

With respect to (2), it may not be a data leak but it could bias the classifier. There is significant difference in performance between models with different n_components so this could alter which model is chosen.

@rdvelazquez
Copy link
Member Author

There is significant difference in performance between models with different n_components so this could alter which model is chosen.

True, but I'm not thinking we wouldn't still search over a range of n_components... I'm thinking we would perform PCA on the training set, the PCA components would be our X Train and we would use something like SelectKBest to search over a range of features (PCA components) to include in the model but instead of performing PCA within CV it's already been performed on the whole training set.

@patrick-miller
Copy link
Member

Using SelectKBest is in essence doing the same thing as iterating over a variety of n_components after running the decomposition. I need to think about it a little more, but I believe you might still be leaking information here -- from the train partition of the CV fold to the test partition of the CV fold. You would get different principal components if you perform the decomp within cross-validation, so it may be a little less robust.

@rdvelazquez
Copy link
Member Author

but I believe you might still be leaking information here -- from the train partition of the CV fold to the test partition of the CV fold.

I agree, but this may not have much of an effect, and I don't think it's a big deal if our training AUROC score is slightly inflated as long as the testing AUROC score is still accurate.

I also agree that the cross-validation will be less robust this way but if it lets us search over a larger range of n_components I think that will far outweigh the small loss of robustness.

P.S. I know this would be a fairly large departure from the way the notebook is currently set up so I'm not saying we should do it this way I'm just thinking about ways to get this thing to run.

@htcai
Copy link
Member

htcai commented Jul 19, 2017

I also agree that we should fit a PCA instance for each CV fold. A fitted PCA instance is part of the final model, as is the SGDClassifier.

I did PCA together with scaling before Grid Search in #71, just because the former was not feasible given the computation resource I had. Still, we can have a separate notebook, evaluating the loss of robustness.

@dhimmel
Copy link
Member

dhimmel commented Jul 19, 2017

A few notes.

The case @patrick-miller describes where n_components = 640 (or more) is optimal will be relatively uncommon, since TP53 is the most mutated gene. In general, there will be fewer positives. I'm okay with not always getting the best model in these extreme cases. The researcher could always iterate by adding larger n_components values to the notebook. The RAM issues are a concern because we want users to be able to run the notebooks locally.

We used to perform option 2 mentioned by @rdvelazquez, but switched to the more correct option 3. I agree that in most cases option 2 will cause little damage with great speedup. Furthermore, the testing values will still be correct, just not the cross-validation scores. However, I'd prefer to use the theoretically correct method, since we don't want to teach users incorrect methods. We'll have to find the delicate balance a grid search that evaluates enough parameter combinations... but not too many.

@rdvelazquez
Copy link
Member Author

Closed by #113

We will be using a function to select the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The current function looks like:

if min(num_pos, num_neg) > 500:
    n_components = 100
elif min(num_pos, num_neg) > 250:
    n_components = 50
else:
    n_components = 30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants