Add Gene Expression Coefficients for Individual Genes #105

rdvelazquez · 2017-06-25T20:40:46Z

This notebook is my first attempt at trying to extract the information about which individual genes contributed the most to the classifier. This isn't straightforward because the classifier wasn't trained on the individual genes but on the principal components of the individual genes (the resulting eigenvalues from PCA). Outputting which principal components were most impactful is easy but that information is not very meaningful or useful.

I looked online for examples similar to what we are trying to do and couldn't find any.

I tried to provide comments and print statements to make my method easier to follow; the notebook is fairly self explanatory (I only edited lines 22-25 and included a brief summary after line 25). @patrick-miller and @dhimmel feel free to comment now or we can discuss at Tuesday's meetup.

…d of Principal Components

dhimmel · 2017-06-26T14:47:29Z

@rdvelazquez, yes this would be helpful. Can you first make the changes in the explore directory. We want to keep the number of notebooks in the root directory as low as possible.

dhimmel · 2017-06-26T14:51:09Z

2.mutation-classifier_(WIP_Gene-Weights).py

+n_genes = len(expression_df.columns)
+X_train_expression = X_train[X.columns[-n_genes:]]
+pca = PCA(n_components = 50, random_state = 0)
+pca.fit(X_train_expression)


Hmm you're refitting the PCA? Shouldn't we extract the components_ from the transformation already fit in the pipeline rather than create a new transformation?

rdvelazquez · 2017-06-27T15:38:26Z

Can you first make the changes in the explore directory. We want to keep the number of notebooks in the root directory as low as possible.

Agreed. I moved the notebook and .py file to explore.

We may be able to just keep this as a work in progress pull request for now and then incorporate the changes into the original notebook 2 once we confirm that everything is working ok; no need to add this notebook as a separate notebook. In hind-sight I could have just made the changes in the original notebook 2 rather than creating a copy.

patrick-miller · 2017-06-28T15:47:53Z

The other methodology that was discussed at the Meetup was to run a matrix through the fit pipeline to discover the independent contributions of each gene. If the gene expressions were binary (0/1), the input matrix would just be:

n_genes = 10000
gene_matrix = np.eye(n_genes)

Instead, we want something like the following which has 1's on the diagonal and -1's elsewhere:

n_genes = 10000
gene_matrix = np.zeros((n_genes, n_genes)) - 1
np.fill_diagonal(gene_matrix, 1)

I don't think this is quite sufficient, because the standardization happens inside the pipeline so this matrix would also be normalized. We want a matrix that looks like this after the StandardScaler.

It would be ideal if there was an easy way to run the matrix through the pipeline while skipping this step, but I don't think that is doable. We can pull out the scaler and perform the inverse of it on the above matrix, and I think that would work as well.

rdvelazquez · 2017-06-28T16:31:13Z

Great summary @patrick-miller.

I don't think this is quite sufficient, because the standardization happens inside the pipeline so this matrix would also be normalized.

I think this matrix itself wouldn't be normalized, just each sample would be transformed to the normilzation that was fit on the training set. I think this would be OK but I also like your idea of pulling out the scaler and performing the inverse.

After talking about it last night I think we agreed that the current implementation that I have in this PR (taking the dot product of the PCA coefficients and the classifier coefficients and summing the resulting column for each gene) (see the notebook for details) is likely giving us the correct ranking of genes by importance. The "one hot encoding" that you are describing would be a way to confirm that the current implementation is correct but I think the current implementation may be more intuitive and meaningful. Thoughts?

Feel free to try the "one hot encoding" method as a check. I'll clean-up my implementation in this notebook and try to run the classifier using just the top [10, 20, 50...] genes (as the current implementation ranks them) as another check. (I may not get to this for a little while)

I'm tagging @wisygig, @George-Zipperlen and @htcai as they were also involved in this discussion. Do you know Kyle's github handle?

rdvelazquez · 2017-07-09T11:23:54Z

My last commit did two things:

Reduced extracting the gene coefficients into a simplified function. This function can be added to the main notebook 2 if we agree that this is the best way to be extracting the gene coefficients.
Evaluated the extracted gene coefficients by training a model on the top 10 and top 30 individual genes (ranked by absolute value of their "coefficients"). I also include a model trained on the top 10 genes, as ranked by the old MAD feature selection notebook, as a frame of reference. The results:

Testing AUROC

Top 10 genes: 82.1%
Top 30 genes: 90.2%
Top 10 genes (from MAD Notebook): 85.7%

The fairly high AUROC for the top 10 and 30 genes provides some confirmation that this method is correctly extracting the information about which individual genes contributed the most the the classifier.

As an aside: The fact that the top 10 genes from the MAD feature selection notebook performed better than the top 10 genes as ranked by this notebook was surprising to me... When I re-ran this notebook with an l1_ratio of 0 instead of 0.15 (see #56) the model trained on only the top 10 genes had a testing AUROC of 85.5%. This may be another line of evidence supporting the use of an l1_ratio of 0 for models trained on PCA components.

rdvelazquez · 2017-07-11T14:13:00Z

@patrick-miller any comments on this PR?

patrick-miller · 2017-07-11T14:48:04Z

The logic looks good to me. You can probably shorten it up a little bit, and make the style conform to the rest of the notebook (snake vs. camel case), but I'm not sure if we have a style guide for the project (@dhimmel). If he wants, I can make those suggestions.

Totally agree on the l1_ratio comment -- let me know if it gets discussed tonight.

htcai · 2017-07-11T23:55:04Z

It is definitely important to measure the contribution of each individual gene in the fitted model. Although I still need to work through every detail of IdentifyTopGenesAfterPCA, I am wondering whether it makes more sense to take the absolute or squared values of combinedDF before taking the column sum.

For example, two features A and B, which have "weights" [-8, 9] and [2, 2] respectively. I may intend to believe that A has a higher weight than does B, though the former has a sum of 1 whereas the latter 4. I informally understand "weight" in this scenario as the extent to which the fitted model (or equivalently, prediction) is influenced by a feature, regardless of positive / negative.

rdvelazquez · 2017-07-12T12:33:54Z

Thanks for looking at this @htcai!

I informally understand "weight" in this scenario as the extent to which the fitted model (or equivalently, prediction) is influenced by a feature, regardless of positive / negative.

I'm with you until "regardless of positive / negative". I think the sign of the "weight" relates to which direction that "weight" influences the prediction. So in your example, the two weights for feature A (-8 and 9) could be thought of as acting in opposite directions (the -8 strongly predicting no mutation and the 9 strongly predicting a mutation) so these to "predictions" (for lack of a better word) would in essence cancel each other out.

If you want to test out your hypothesis, feel free to modify IdentifyTopGenesAfterPCA to use the absolute values and see how the top genes returned by that modified function do. This notebook is already set up in a way that would make that fairly easy to do.

rdvelazquez · 2017-07-12T12:36:58Z

@patrick-miller the l1_ratio didn't really get discussed last night. Where @dhimmel and I left this was:

@dhimmel will review this PR and provide comments.
I'll address his comments and we will add this notebook into explore
I'll add the IdentifyTopGenesAfterPCA function to utils.py and use the function to show the top genes in notebook 2

dhimmel · 2017-07-13T23:57:41Z

I'm not sure if we have a style guide for the project

PEP8. I also use flake8 which is even stricter. I'd recommend function and variable names that are all_lowercase.

@rdvelazquez can you move the notebook and script export into its own directory in explore?

I'll add the IdentifyTopGenesAfterPCA function to utils.py and use the function to show the top genes in notebook 2.

The function should return the weighting for all genes. Leave the top_n decision to the user at display time.

rdvelazquez · 2017-07-14T04:24:57Z

Thanks for looking at this @dhimmel!

I tried to make the function follow PEP8. I'm not too familiar with docstrings so I just tried to follow pep257 as best I could. I had never heard of flake8 before but it was really easy to use and pretty amazing. I just pulled out the function into a separate file and ran flake8 on the function. (I had lots of extra white space and lines >79 characters)

I also moved the files to a new directory in explore.

The function should return the weighting for all genes. Leave the top_n decision to the user at display time.

This is the way the function was originally written but it was just poorly named. I renamed it to reflect the fact that it returns all the gene coefficients not just for the top genes.

rdvelazquez · 2017-07-14T12:46:37Z

This is now ready to review/merge.

dhimmel

Thanks for this work @rdvelazquez.

Go ahead and open the PR to edit 2.mutation-classifier.ipynb. I'll review the code to a higher degree in that PR, since it will become part of the reference implementation.

patrick-miller · 2017-07-16T20:55:48Z

`utils` is just a collection of utility functions that we didn't want clogging up the notebooks. It is in the same base directory.

…

On Jul 16, 2017 2:52 PM, "Haitao Cai" ***@***.***> wrote: btw What is the package utils? It seems that it is not included in the environment.yml. According to Google, the only two results containing fill_spec_with_data are from this repo. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#105 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFPH8qNXpVMwI6QS7gH1FG791sOFaYC2ks5sOlvUgaJpZM4OEtto> .

Ryan Velazquez added 2 commits June 25, 2017 16:23

WIP - Adding Gene Expression Coefficinets for Individual Genes instea…

15d757f

…d of Principal Components

add .py file

61a3f68

dhimmel reviewed Jun 26, 2017

View reviewed changes

move files to explore and extract pca from pipeline

5480609

Reduce to function and test top genes

5f09bbb

rename func and follow pep 8

ca85c45

Ryan Velazquez added 3 commits July 14, 2017 00:36

move to new folder

e65838b

export .py file and delete duplicate

ded267c

replace .py file with correct version

f5795e0

dhimmel approved these changes Jul 14, 2017

View reviewed changes

dhimmel changed the title ~~WIP - Adding Gene Expression Coefficients for Individual Genes~~ Add Gene Expression Coefficients for Individual Genes Jul 14, 2017

dhimmel merged commit 5b6167b into cognoma:master Jul 14, 2017

rdvelazquez deleted the gene-coefficients branch July 16, 2017 03:02

rdvelazquez mentioned this pull request Jul 16, 2017

Get gene coefficients #109

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Gene Expression Coefficients for Individual Genes #105

Add Gene Expression Coefficients for Individual Genes #105

rdvelazquez commented Jun 25, 2017

dhimmel commented Jun 26, 2017

dhimmel Jun 26, 2017

rdvelazquez commented Jun 27, 2017

patrick-miller commented Jun 28, 2017 •

edited

Loading

rdvelazquez commented Jun 28, 2017 •

edited

Loading

rdvelazquez commented Jul 9, 2017

rdvelazquez commented Jul 11, 2017

patrick-miller commented Jul 11, 2017

htcai commented Jul 11, 2017 •

edited

Loading

rdvelazquez commented Jul 12, 2017 •

edited

Loading

rdvelazquez commented Jul 12, 2017

dhimmel commented Jul 13, 2017

rdvelazquez commented Jul 14, 2017

rdvelazquez commented Jul 14, 2017

dhimmel left a comment

patrick-miller commented Jul 16, 2017 via email

Add Gene Expression Coefficients for Individual Genes #105

Add Gene Expression Coefficients for Individual Genes #105

Conversation

rdvelazquez commented Jun 25, 2017

dhimmel commented Jun 26, 2017

dhimmel Jun 26, 2017

Choose a reason for hiding this comment

rdvelazquez commented Jun 27, 2017

patrick-miller commented Jun 28, 2017 • edited Loading

rdvelazquez commented Jun 28, 2017 • edited Loading

rdvelazquez commented Jul 9, 2017

rdvelazquez commented Jul 11, 2017

patrick-miller commented Jul 11, 2017

htcai commented Jul 11, 2017 • edited Loading

rdvelazquez commented Jul 12, 2017 • edited Loading

rdvelazquez commented Jul 12, 2017

dhimmel commented Jul 13, 2017

rdvelazquez commented Jul 14, 2017

rdvelazquez commented Jul 14, 2017

dhimmel left a comment

Choose a reason for hiding this comment

patrick-miller commented Jul 16, 2017 via email

patrick-miller commented Jun 28, 2017 •

edited

Loading

rdvelazquez commented Jun 28, 2017 •

edited

Loading

htcai commented Jul 11, 2017 •

edited

Loading

rdvelazquez commented Jul 12, 2017 •

edited

Loading