-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Gene Expression Coefficients for Individual Genes #105
Conversation
…d of Principal Components
@rdvelazquez, yes this would be helpful. Can you first make the changes in the |
n_genes = len(expression_df.columns) | ||
X_train_expression = X_train[X.columns[-n_genes:]] | ||
pca = PCA(n_components = 50, random_state = 0) | ||
pca.fit(X_train_expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm you're refitting the PCA? Shouldn't we extract the components_
from the transformation already fit in the pipeline rather than create a new transformation?
Agreed. I moved the notebook and .py file to We may be able to just keep this as a work in progress pull request for now and then incorporate the changes into the original notebook 2 once we confirm that everything is working ok; no need to add this notebook as a separate notebook. In hind-sight I could have just made the changes in the original notebook 2 rather than creating a copy. |
The other methodology that was discussed at the Meetup was to run a matrix through the fit pipeline to discover the independent contributions of each gene. If the gene expressions were binary (0/1), the input matrix would just be:
Instead, we want something like the following which has 1's on the diagonal and -1's elsewhere:
I don't think this is quite sufficient, because the standardization happens inside the pipeline so this matrix would also be normalized. We want a matrix that looks like this after the StandardScaler. It would be ideal if there was an easy way to run the matrix through the pipeline while skipping this step, but I don't think that is doable. We can pull out the scaler and perform the inverse of it on the above matrix, and I think that would work as well. |
Great summary @patrick-miller.
I think this matrix itself wouldn't be normalized, just each sample would be transformed to the normilzation that was fit on the training set. I think this would be OK but I also like your idea of pulling out the scaler and performing the inverse. After talking about it last night I think we agreed that the current implementation that I have in this PR (taking the dot product of the PCA coefficients and the classifier coefficients and summing the resulting column for each gene) (see the notebook for details) is likely giving us the correct ranking of genes by importance. The "one hot encoding" that you are describing would be a way to confirm that the current implementation is correct but I think the current implementation may be more intuitive and meaningful. Thoughts? Feel free to try the "one hot encoding" method as a check. I'll clean-up my implementation in this notebook and try to run the classifier using just the top [10, 20, 50...] genes (as the current implementation ranks them) as another check. (I may not get to this for a little while) I'm tagging @wisygig, @George-Zipperlen and @htcai as they were also involved in this discussion. Do you know Kyle's github handle? |
My last commit did two things:
Testing AUROC
The fairly high AUROC for the top 10 and 30 genes provides some confirmation that this method is correctly extracting the information about which individual genes contributed the most the the classifier. As an aside: The fact that the top 10 genes from the MAD feature selection notebook performed better than the top 10 genes as ranked by this notebook was surprising to me... When I re-ran this notebook with an l1_ratio of 0 instead of 0.15 (see #56) the model trained on only the top 10 genes had a testing AUROC of 85.5%. This may be another line of evidence supporting the use of an l1_ratio of 0 for models trained on PCA components. |
@patrick-miller any comments on this PR? |
The logic looks good to me. You can probably shorten it up a little bit, and make the style conform to the rest of the notebook (snake vs. camel case), but I'm not sure if we have a style guide for the project (@dhimmel). If he wants, I can make those suggestions. Totally agree on the |
It is definitely important to measure the contribution of each individual gene in the fitted model. Although I still need to work through every detail of For example, two features |
Thanks for looking at this @htcai!
I'm with you until "regardless of positive / negative". I think the sign of the "weight" relates to which direction that "weight" influences the prediction. So in your example, the two weights for feature A (-8 and 9) could be thought of as acting in opposite directions (the -8 strongly predicting no mutation and the 9 strongly predicting a mutation) so these to "predictions" (for lack of a better word) would in essence cancel each other out. If you want to test out your hypothesis, feel free to modify |
@patrick-miller the
|
PEP8. I also use flake8 which is even stricter. I'd recommend function and variable names that are @rdvelazquez can you move the notebook and script export into its own directory in
The function should return the weighting for all genes. Leave the top_n decision to the user at display time. |
Thanks for looking at this @dhimmel! I tried to make the function follow PEP8. I'm not too familiar with docstrings so I just tried to follow pep257 as best I could. I had never heard of flake8 before but it was really easy to use and pretty amazing. I just pulled out the function into a separate file and ran flake8 on the function. (I had lots of extra white space and lines >79 characters) I also moved the files to a new directory in explore.
This is the way the function was originally written but it was just poorly named. I renamed it to reflect the fact that it returns all the gene coefficients not just for the top genes. |
This is now ready to review/merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @rdvelazquez.
Go ahead and open the PR to edit 2.mutation-classifier.ipynb
. I'll review the code to a higher degree in that PR, since it will become part of the reference implementation.
`utils` is just a collection of utility functions that we didn't want
clogging up the notebooks. It is in the same base directory.
…On Jul 16, 2017 2:52 PM, "Haitao Cai" ***@***.***> wrote:
btw What is the package utils? It seems that it is not included in the
environment.yml. According to Google, the only two results containing
fill_spec_with_data are from this repo.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#105 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFPH8qNXpVMwI6QS7gH1FG791sOFaYC2ks5sOlvUgaJpZM4OEtto>
.
|
This notebook is my first attempt at trying to extract the information about which individual genes contributed the most to the classifier. This isn't straightforward because the classifier wasn't trained on the individual genes but on the principal components of the individual genes (the resulting eigenvalues from PCA). Outputting which principal components were most impactful is easy but that information is not very meaningful or useful.
I looked online for examples similar to what we are trying to do and couldn't find any.
I tried to provide comments and print statements to make my method easier to follow; the notebook is fairly self explanatory (I only edited lines 22-25 and included a brief summary after line 25). @patrick-miller and @dhimmel feel free to comment now or we can discuss at Tuesday's meetup.