-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Selecting the number of components returned by PCA #106
Comments
Thanks Ryan, this sounds like a good plan. Because of the increased speed
of the pipeline, I'm in favor of just upping the number of components we
search over.
One thing that I will add is that we haven't optimized for the
regularization strength yet (`alpha`). My guess is that the interplay
between it and `n_components` is meaningful. I think it makes sense to
optimize for them jointly.
The process would be the same:
- select several balanced and unbalanced genes
- optimize over a large range of values for `n_components` and `alpha`
- determine the max range for each parameter
- decide if we should limit the range based on a heuristic around class
balance
I won't be there tonight, but I should be free to work on something this
weekend.
|
Agreed. I would assume the interplay between |
@rdvelazquez Thanks for opening this issue! I also think this is a must-do if we want to enable search over optimal In order to limit the range of search for optimal Although not sure, it will be great if we can select a particular range of I will play around with Dark-SearchCV and see how much time / RAM it can save for the pipeline. |
Thanks @htcai !
I'm thinking the total sample size (i.e. ~7,000 samples if including all diseases and <7,000 if only including a subset of diseases) may have an impact as well.
Sounds good. I evaluated the speed increase with dasksearchCV here. |
For the time being, before we get a better indication of the ideal PCA n_components by number of positives and negatives, I'd suggest expanding to something like As per #56, I suggest sticking with the default I also think we may want to switch to: sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, `random_state=0) Since non-repeated cross-validation is generally going to be too noisy for our purposes. |
@dhimmel when you say stick to the default |
That's what I was thinking ( |
I'm starting to run into RAM bloating issues again with the switch to As for performance with TP53, I have |
Nice progress @patrick-miller. Thanks for posting the heatmap. That's very informative. Just to confirm... the heat map above is with l1_ratio of 0.15 and the RAM issues are using dasksearchCV?
We may want to consider pulling PCA out of CV. I see three places we could do PCA (I'm saying we should consider trying 2 below instead of 3):
I may try to implement this in a notebook (a Jupyter Notebook is worth a thousand words).
Just as an FYI - @dcgoss is getting ml-workers pretty close to up and running. This is the repo where the notebooks will be run in production... on EC2 😄 I'm also working on creating a benchmark dataset that provides good coverage of the different query scenarios. I'll post or PR this once its further along. |
That is with With respect to (2), it may not be a data leak but it could bias the classifier. There is significant difference in performance between models with different |
True, but I'm not thinking we wouldn't still search over a range of |
Using |
I agree, but this may not have much of an effect, and I don't think it's a big deal if our training I also agree that the cross-validation will be less robust this way but if it lets us search over a larger range of P.S. I know this would be a fairly large departure from the way the notebook is currently set up so I'm not saying we should do it this way I'm just thinking about ways to get this thing to run. |
I also agree that we should fit a I did PCA together with scaling before Grid Search in #71, just because the former was not feasible given the computation resource I had. Still, we can have a separate notebook, evaluating the loss of robustness. |
A few notes. The case @patrick-miller describes where We used to perform option 2 mentioned by @rdvelazquez, but switched to the more correct option 3. I agree that in most cases option 2 will cause little damage with great speedup. Furthermore, the testing values will still be correct, just not the cross-validation scores. However, I'd prefer to use the theoretically correct method, since we don't want to teach users incorrect methods. We'll have to find the delicate balance a grid search that evaluates enough parameter combinations... but not too many. |
Closed by #113 We will be using a function to select the number of components based on the number of positive samples in the query (or the number of negatives if it is a rare instance with more positives than negatives). The current function looks like: if min(num_pos, num_neg) > 500:
n_components = 100
elif min(num_pos, num_neg) > 250:
n_components = 50
else:
n_components = 30 |
A topic that has come up a number of times is how many components should be returned by PCA...
n_components
) can be a parameter that is searched across inGridSearchCV
. This used to cause problems with thrashing (Integrating dimensionality reduction into the pipeline #43) but these problems seem to have been eliminated by using the dasksearcCV implementation ofGridSearchCV
(Evaluate dask-searchcv to speed up GridSearchCV #94).n_components
can be included inGridSearchCV
, we would like to limit the range that needs to be searched over based on the specifics of the query (i.e. how many samples are included {the user's filter by disease} and how many positive/negative mutations there are {the user's filter by gene(s)}).n_components
is larger for balanced datasets (equal number of mutated and non-mutated samples) and smaller for unbalanced datasets (typically small number of mutated samples). Using a smalln_components
for balanced datasets results in low training and testing scores. Using a largen_components
for unbalanced datasets results in higher training and lower testing scores (over-fitting).I'm thinking the next step should be creating a dataset that provides good coverage of the different query scenarios (#11), and perform
GridSearchCV
on these datasets, searching over a range ofn_components
to see how changingn_components
effects performance (AUROC).@dhimmel, @htcai, @patrick-miller feel free to comment now or we can discuss at tonight's meetup.
The text was updated successfully, but these errors were encountered: