Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Gene Expression Coefficients for Individual Genes #105

Merged
merged 8 commits into from
Jul 14, 2017

Conversation

rdvelazquez
Copy link
Member

This notebook is my first attempt at trying to extract the information about which individual genes contributed the most to the classifier. This isn't straightforward because the classifier wasn't trained on the individual genes but on the principal components of the individual genes (the resulting eigenvalues from PCA). Outputting which principal components were most impactful is easy but that information is not very meaningful or useful.

I looked online for examples similar to what we are trying to do and couldn't find any.

I tried to provide comments and print statements to make my method easier to follow; the notebook is fairly self explanatory (I only edited lines 22-25 and included a brief summary after line 25). @patrick-miller and @dhimmel feel free to comment now or we can discuss at Tuesday's meetup.

@dhimmel
Copy link
Member

dhimmel commented Jun 26, 2017

@rdvelazquez, yes this would be helpful. Can you first make the changes in the explore directory. We want to keep the number of notebooks in the root directory as low as possible.

n_genes = len(expression_df.columns)
X_train_expression = X_train[X.columns[-n_genes:]]
pca = PCA(n_components = 50, random_state = 0)
pca.fit(X_train_expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm you're refitting the PCA? Shouldn't we extract the components_ from the transformation already fit in the pipeline rather than create a new transformation?

@rdvelazquez
Copy link
Member Author

Can you first make the changes in the explore directory. We want to keep the number of notebooks in the root directory as low as possible.

Agreed. I moved the notebook and .py file to explore.

We may be able to just keep this as a work in progress pull request for now and then incorporate the changes into the original notebook 2 once we confirm that everything is working ok; no need to add this notebook as a separate notebook. In hind-sight I could have just made the changes in the original notebook 2 rather than creating a copy.

@patrick-miller
Copy link
Member

patrick-miller commented Jun 28, 2017

The other methodology that was discussed at the Meetup was to run a matrix through the fit pipeline to discover the independent contributions of each gene. If the gene expressions were binary (0/1), the input matrix would just be:

n_genes = 10000
gene_matrix = np.eye(n_genes)

Instead, we want something like the following which has 1's on the diagonal and -1's elsewhere:

n_genes = 10000
gene_matrix = np.zeros((n_genes, n_genes)) - 1
np.fill_diagonal(gene_matrix, 1)

I don't think this is quite sufficient, because the standardization happens inside the pipeline so this matrix would also be normalized. We want a matrix that looks like this after the StandardScaler.

It would be ideal if there was an easy way to run the matrix through the pipeline while skipping this step, but I don't think that is doable. We can pull out the scaler and perform the inverse of it on the above matrix, and I think that would work as well.

@rdvelazquez
Copy link
Member Author

rdvelazquez commented Jun 28, 2017

Great summary @patrick-miller.

I don't think this is quite sufficient, because the standardization happens inside the pipeline so this matrix would also be normalized.

I think this matrix itself wouldn't be normalized, just each sample would be transformed to the normilzation that was fit on the training set. I think this would be OK but I also like your idea of pulling out the scaler and performing the inverse.

After talking about it last night I think we agreed that the current implementation that I have in this PR (taking the dot product of the PCA coefficients and the classifier coefficients and summing the resulting column for each gene) (see the notebook for details) is likely giving us the correct ranking of genes by importance. The "one hot encoding" that you are describing would be a way to confirm that the current implementation is correct but I think the current implementation may be more intuitive and meaningful. Thoughts?

Feel free to try the "one hot encoding" method as a check. I'll clean-up my implementation in this notebook and try to run the classifier using just the top [10, 20, 50...] genes (as the current implementation ranks them) as another check. (I may not get to this for a little while)

I'm tagging @wisygig, @George-Zipperlen and @htcai as they were also involved in this discussion. Do you know Kyle's github handle?

@rdvelazquez
Copy link
Member Author

My last commit did two things:

  1. Reduced extracting the gene coefficients into a simplified function. This function can be added to the main notebook 2 if we agree that this is the best way to be extracting the gene coefficients.

  2. Evaluated the extracted gene coefficients by training a model on the top 10 and top 30 individual genes (ranked by absolute value of their "coefficients"). I also include a model trained on the top 10 genes, as ranked by the old MAD feature selection notebook, as a frame of reference. The results:

Testing AUROC

  • Top 10 genes: 82.1%
  • Top 30 genes: 90.2%
  • Top 10 genes (from MAD Notebook): 85.7%

The fairly high AUROC for the top 10 and 30 genes provides some confirmation that this method is correctly extracting the information about which individual genes contributed the most the the classifier.

As an aside: The fact that the top 10 genes from the MAD feature selection notebook performed better than the top 10 genes as ranked by this notebook was surprising to me... When I re-ran this notebook with an l1_ratio of 0 instead of 0.15 (see #56) the model trained on only the top 10 genes had a testing AUROC of 85.5%. This may be another line of evidence supporting the use of an l1_ratio of 0 for models trained on PCA components.

@rdvelazquez
Copy link
Member Author

@patrick-miller any comments on this PR?

@patrick-miller
Copy link
Member

The logic looks good to me. You can probably shorten it up a little bit, and make the style conform to the rest of the notebook (snake vs. camel case), but I'm not sure if we have a style guide for the project (@dhimmel). If he wants, I can make those suggestions.

Totally agree on the l1_ratio comment -- let me know if it gets discussed tonight.

@htcai
Copy link
Member

htcai commented Jul 11, 2017

It is definitely important to measure the contribution of each individual gene in the fitted model. Although I still need to work through every detail of IdentifyTopGenesAfterPCA, I am wondering whether it makes more sense to take the absolute or squared values of combinedDF before taking the column sum.

For example, two features A and B, which have "weights" [-8, 9] and [2, 2] respectively. I may intend to believe that A has a higher weight than does B, though the former has a sum of 1 whereas the latter 4. I informally understand "weight" in this scenario as the extent to which the fitted model (or equivalently, prediction) is influenced by a feature, regardless of positive / negative.

@rdvelazquez
Copy link
Member Author

rdvelazquez commented Jul 12, 2017

Thanks for looking at this @htcai!

I informally understand "weight" in this scenario as the extent to which the fitted model (or equivalently, prediction) is influenced by a feature, regardless of positive / negative.

I'm with you until "regardless of positive / negative". I think the sign of the "weight" relates to which direction that "weight" influences the prediction. So in your example, the two weights for feature A (-8 and 9) could be thought of as acting in opposite directions (the -8 strongly predicting no mutation and the 9 strongly predicting a mutation) so these to "predictions" (for lack of a better word) would in essence cancel each other out.

If you want to test out your hypothesis, feel free to modify IdentifyTopGenesAfterPCA to use the absolute values and see how the top genes returned by that modified function do. This notebook is already set up in a way that would make that fairly easy to do.

@rdvelazquez
Copy link
Member Author

@patrick-miller the l1_ratio didn't really get discussed last night. Where @dhimmel and I left this was:

  1. @dhimmel will review this PR and provide comments.
  2. I'll address his comments and we will add this notebook into explore
  3. I'll add the IdentifyTopGenesAfterPCA function to utils.py and use the function to show the top genes in notebook 2

@dhimmel
Copy link
Member

dhimmel commented Jul 13, 2017

I'm not sure if we have a style guide for the project

PEP8. I also use flake8 which is even stricter. I'd recommend function and variable names that are all_lowercase.

@rdvelazquez can you move the notebook and script export into its own directory in explore?

I'll add the IdentifyTopGenesAfterPCA function to utils.py and use the function to show the top genes in notebook 2.

The function should return the weighting for all genes. Leave the top_n decision to the user at display time.

@rdvelazquez
Copy link
Member Author

Thanks for looking at this @dhimmel!

I tried to make the function follow PEP8. I'm not too familiar with docstrings so I just tried to follow pep257 as best I could. I had never heard of flake8 before but it was really easy to use and pretty amazing. I just pulled out the function into a separate file and ran flake8 on the function. (I had lots of extra white space and lines >79 characters)

I also moved the files to a new directory in explore.

The function should return the weighting for all genes. Leave the top_n decision to the user at display time.

This is the way the function was originally written but it was just poorly named. I renamed it to reflect the fact that it returns all the gene coefficients not just for the top genes.

@rdvelazquez
Copy link
Member Author

This is now ready to review/merge.

Copy link
Member

@dhimmel dhimmel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @rdvelazquez.

Go ahead and open the PR to edit 2.mutation-classifier.ipynb. I'll review the code to a higher degree in that PR, since it will become part of the reference implementation.

@dhimmel dhimmel changed the title WIP - Adding Gene Expression Coefficients for Individual Genes Add Gene Expression Coefficients for Individual Genes Jul 14, 2017
@dhimmel dhimmel merged commit 5b6167b into cognoma:master Jul 14, 2017
@rdvelazquez rdvelazquez deleted the gene-coefficients branch July 16, 2017 03:02
@patrick-miller
Copy link
Member

patrick-miller commented Jul 16, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants