-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
06 explore infercnv for Wilms tumor -06 #828
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few quick comments from skimming the code, but before I do a larger review here (though fyi, I think this will be fairly quick!), I wanted to ask about your next steps with i3
. Specifically, it seems to me that you should be able to re-use your existing infercnv script as well as this template notebook, but with a few more arguments, for both i3 and i6 models?. In that case, we'd want to tell the script which model to run as an input option to handle with optparse
, and we'd want to tell the notebook which model was run with a param
. I am less familiar with running infercnv than you are, so please tell me if you think this doesn't make sense!
If it does make sense, then as part of this PR, we should also take the opportunity to the code ready for that next step, including:
- Updating the infercnv script to take an argument for whether you're running i3 or i6, and update the output file names to reflect which model was run. You can choose how you want to handle that organization, as long as the file name indicates if it was i3 or i6.
- Renaming the current result files to have that i6 name in them - you don't need to re-run the script, just rename the files and sync back up with results. This will save a lot of time!!
- Adding a parameter to the rmarkdown file which will help determine which infercnv result file to read in, and make sure that the knitted HTML output file name will contain i3 or i6 too. You'd also want to state clearly in the notebook introduction if it's exploring i3 or i6 results.
I recommend making these changes now since it will affect code in this notebook, and it makes sense to do the script changes simultaneously. It will make this PR a bit larger, but I'm totally ok with that for this one :) Then, your next PR can be running the code with i3 and committing result notebooks. Again let me know if you think this plan doesn't actually make sense for how infercnv works!
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
Thank you @sjspielman for the review! The good news is that it took 2 hours to run |
so I will now
Hopefully done by tomorrow 😃 |
FYI, there's a neat trick in Rmarkdown that might help with this! Inside of the chunk definition backticks, you can use a keyword
When syncing your S3 results bucket, you might want to run the |
Hi @sjspielman , So I adapted the Results are uploaded to the s3 bucket researcher-008971640512-us-east-2. For the next step, after discussing with team members, I would like to try for the same five samples to:
If the results are comparable with the one obtained with their own normal cells ech, we could use this spike-in strategy for every sample in the dataset, as we need imho a normal reference, and it should be mixed as possible of cell types to avoid artifact due to cell type specific regions on some chromosome (17 for immune cells for example). Thank you 😃 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The i3/i6 code looks good and results are clearly generated, so that's good to see! I left a few small comments. As far as your next steps go, I have a few thoughts:
- First, which findings in particular to you hope to be able to see when combining "normal" cells across references, specifically in terms of validating this approach? Having some hypotheses here will be a helpful guide: What will results look like if it's a good strategy to combine normal cells across patients? What will the results look like if it's not a good strategy?
- Second, I'm not entirely sure how to think of the influence batch effects & patient variability for this analysis. I might expect different profiles of "normal" cells across patients just driven by biological variability, and it might not provide a good baseline and/or be very influenced by batch effects. That said, we probably don't want to be correct for batch effects data here since that will smooth out biological signal which I assume is needed for this kind of analysis (right?). This is another reason why it would be good to have a clear hypothesis in mind - if you have a particular trend you're looking for already, it will help interpret whether the merging strategy itself is reliable or if it's producing artifacts.
- You wrote in your notebook that you wonder if some endothelial cells are misclassified. Indeed, that's always a concern! Something we did not do is look in-depth at the scores for some of these classifications. I wonder if exploring the scores would be helpful for you to pick which cells to use as normal. I don't really know how these scores are calculated, but I assume that we might be more confident in cell labels if their scores are higher above some threshold? This might be worth checking - if many of the endothelial cells, for example, have low scores, perhaps that suggests they are not a reliable reference? This might also be something to discuss with your team.
Also, anytime you want, you can always feel free to open a Discussion to get feedback more generally for your next steps so other Data Lab team members and/or contributors have a chance to weigh in too :)
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Show resolved
Hide resolved
### Heatmap of infercnv results | ||
|
||
Here we plot the output of `infercnv` as heatmaps of CNV. | ||
We first look at the png file generated by the `infercnv` function. | ||
We then used the `infercnv object` to look at mean CNV value across compartments (immune, endothelial, stroma and fetal nephron). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At first I was going to leave a review here that it's worth noting these results are generated without HMM. But then I wondered - are these results identical regardless of HMM, and specifying an HMM just gives you additional information on top of what running without would give you? Is that your experience with infercnv? If yes, then this simplifies how you can run 06_infercnv.R
. You would never need to run it without an HMM, since that would just duplicate the main CNV results. If no, and results are different, then why don't we show the heatmaps here for HMM results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, HMM prediction is an additional step happening after the standard inferCNV processing routines.
I manually checked the heatmaps 06_infercnv_{sample_id}_reference-both_HMM-no_heatmap.png
and 06_infercnv_{sample_id}_reference-both_HMM-i3_heatmap.png
, they look indeed identical.
The reason why I first ran infercnv
without the additional HMM CNV prediction model was to quickly compare the choice of the normal reference. I think the HMM model is not required to look at the variability between the CNV profiles when taking no, or immune and/or endothelial cells as normal reference.
From thee 4 heatmaps, I conclude that we need to use normal cells as reference. Else, the mean of expression is taking as the normal and the more abundant cells will thus look normal, as illustrate with the example in sample -179:
Here, fetal nephron, which we know/think contains cancer cells, looks the more normal. Also chr17 seems to be gained in immune and endothelial cells, which is an artefact due to the fact that chr17 encodes lot of chemokines secreted by the 2 cells types.
Then, taking the endothelial cells as reference, it seems that all compartments have a chr3 and chr5 loss.
This is however an artefact due to the fact that key endothelial genes are encoded in chr3 and chr5. This artefact is corrected when using both immune and endothelial cells in the reference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you would never need to run it without an HMM, since that would just duplicate the main CNV results.
I agree for the condition -- reference "both"
, I could only run it with the HMM model.
For the other --reference
conditions, I would however not spend too much time and resources calculating the HMM model.
Is that what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why I first ran infercnv without the additional HMM CNV prediction model was to quickly compare the choice of the normal reference.
This makes total sense to me as the right place to start!
Is that what you mean?
Yes, this basically what I mean! At this point, we know that the results are the same with and without HMM. I also looked a bit through inferCNV's code, and it indeed seems that the HMM doesn't affect the other code. There is just an if
statement to add on the HMM, but it doesn't change the other calculations. Therefore, you never need to run without HMM. Even if this only saves a small amount of time and resources, especially for running in CI.
|
||
`infercnv` was run using the `06_inferCNV.R` script with and without a normal reference. We also tested the impact of the subselection of normal cells using either immune, and/or endothelial cells as healthy reference. | ||
|
||
In this notebook, we just want to compare the heatmaps of CNV profiles, and evaluate how comparable are the methods and how sensible they are to key parameters such as selection of healthy reference. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also mention that you are exploring the use of the HMM, as well, and briefly explain the difference among "no", "i6",and "i3" conditions.
analyses/cell-type-wilms-tumor-06/scripts/explore-cnv-methods.R
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
I 💯 % agree with your concern regarding the batch effect and patient variability and therefore I don't know if this approach will work. But this is the only strategy I can imagine so far for the patient with very few normal cells, as we showed that in the Wilms tumor dataset, I hope (🤞) that as the reference is then the mean of expression of all reference cells, batch and patient effect induced by merging the immune and endothelial cells from multiple patients will be smoothed while taking the mean. (would that make sense?)
I was thinking to the following strategy:
We were thinking that independently of the approach, the best validation would be to have a ground truth (SNP array?), or at least some information for some samples. I think looking at 1q in clinic for Wilms tumor patient should be a standard in the COG protocol. Getting this 1q status for some patients would allow us to check (even if not completely) our |
Hi @sjspielman , 🎯 1. Improve the For the latest 3rd objective, I generated the I hope these are not too much changes at once, I hope this won’t become too confusing 😕 … Do not hesitate to reach out to me if things need to be clarified! I needed to make point 2 and 3 clearer in my mind to better see where we go. I am now quite happy with the stage of this PR and will stop modifying before having your feedback. For the next steps now, I was thinking to.
This should cover quite some cells and leave very few unknown 🤞. Thank you!! |
Hi @maud-p, just a quick update - we're out of town at a conference this week, so I will be in and out of being able to review. I left some comments on your other PR looking at scores which mostly focus on the need to have more of a justification for why you have chosen 0.85. Once we have a threshold identified in that PR, we can follow up here on actually using that threshold. I've briefly looked at the changes to your inferCNV script, and looks mostly fine (but I haven't run it yet). I do recommend adding a bit more documentation for what "pull" means, since it is not obvious from any comments in the code yet. In terms of the external validation you mention, it seems potentially promising but I imagine it would be quite a bit of additional work that would not be able to happen before the cell type submission deadline. This might be something to follow up on after you generate your first cell type annotation table as a way to refine the analysis in the future - these modules can certainly be "living" analyses well into the future! |
Hi @sjspielman, thank you! I'll add few lines about the Regarding the deadline, I wanted to let you know that my deadline is the 29/10 as I will be traveling after this and won't be able to work/focus on the analysis the last 3 days... 😕 However, I think we are on a good track to have within the next weeks solid annotations and I would definitely like to continue working on it after the deadline. Also to then use the annotations to answer more specific questions 🚀 |
Hi @maud-p, I don't see any new commits to clarify the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've returned to this review by trying to run through the notebook code, and I encountered quite a few bugs running the code from a fresh R session so I was not able to run it. Here is some of what I observed:
- It appears you need the
ggdist
dependency for some of your plotting. To add this, you'll need to addlibrary(ggdist)
to the filecomponents/dependencies.R
, and then re-runrenv::snapshot()
to save it to the lockfile. - The
visualize_feature()
function is not working properly. It looks like you are using thefeatures
argument as two separate variables - as both a character vector of features to plot, and the plot title. When you use this function, the argument you are passing in corresponds to the plot title, not a vector of features to plot. Something here needs to be fixed, probably both with the function itself and how you call it. - I wasn't able to make most of the plots for the Seurat object from the
pull
reference condition, but I'm not 100% sure what's going on there - maybe something with how you constructed the Seurat object itself in the06b_build-normal_reference.R
is not consistent with the other Seurat objects, so code cannot be generalized?
Generally speaking, this notebook does have quite a bit of repeated code where you explore the same plots for each of the inferCNV results. It would be nice to firm this up to avoid future bugs. For this, you'd want to make a wrapper function to perform all the tasks you need for each result. All of the test you have explaining the plots can be added in a separate section that provides the overall context. On one hand, doing this may actually help you fix existing bugs! But on the other hand, I recognize you are short for time right now, so instead it might be best to file an issue to circle back to to improve the molecularity of this notebook. This way we'll have this fully tracked for later. But again, this may help you debug in the first place, so it could be worth doing now. Let me know if you want to chat about this more!!
Separate from the notebook, I'm confused about the result files you have here specifically. In the results
directory, you have files, e.g. for sample SCPCS000184
, which are not saved in the 06_infercnv
directory.
06_infercnv_HMM-i3_SCPCS000184_reference-both.rds
06_infercnv_HMM-i3_SCPCS000184_reference-pull.rds
06_infercnv_HMM-i6_SCPCS000184_reference-both.rds
It isn't immediately clear to me how these files differ from files in the 06_infercnv
folder, and further which ones are meant to be used when. More documentation will help here! It would be really helpful to document this in results/README
, and more importantly, in the 06_infercnv.R
script itself. At the top of that script, please explain which files it exports.
Let me know where I can clarify or help here!
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Show resolved
Hide resolved
Hi @sjspielman , could be that I haven't pushed correctly all the modifications. Tbh I have a bit of troubles working on different branches. I'll try to correct this! In the meantime, I realized that I was actually meaning pool and not pull .... sorry for the confusion, my English is not the best... 😕 |
add ggdist dependencie
Let me know if I can help in any way! Happy to hop on Slack and chat, feel free to DM me there! Still a bit of time today and tomorrow :)
Ah this makes sense, but it's no problem! I was thinking it meant to "pull" annotations from one dataset to another, so it made sense to me this way too! |
Thank you! I try to take your comments one by one and keep going, I'll let you know if I get stuck! Thank you 😃 |
@sjspielman I am not sure to understand the issue here. As I am only using the |
@sjspielman the transfer to the s3 bucket just finished. |
It ran through for me with SCPCS000194! I'm going to redownload SCPCS000184 and try again with that sample, too. |
I figured out the issue! I had regenerated label transfer results locally with the current data release, but it appears that you used an older data release version for this analysis, and similarly the results in the bucket were probably generated a while ago. When I re-download the label transfer results (namely, the 02b Rds files) from the results bucket, the code works. Given that code works consistently with files generated from the same data release, this code should all fine. We in the Data Lab will address any issues on our end as we continue to work on updating the Azimuth code, but you are fine to continue here!! |
Thank you so much for your efforts!!! |
@maud-p Before I post my review here - please pull this code to update locally, since I merged in main to fix the conflict. Let me know if you have any issues after doing that! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks as always for all your hard work on this PR!! I have left a final round of comments, all very small changes. Before accepting any suggestions, please remember to pull first and make sure everything is ok. Then, you'll want to re-render the notebooks so they have the spacing changes, and after that I expect to approve :)
analyses/cell-type-wilms-tumor-06/scripts/06b_build-normal_reference.R
Outdated
Show resolved
Hide resolved
analyses/cell-type-wilms-tumor-06/notebook_template/06_cnv_infercnv_exploration.Rmd
Outdated
Show resolved
Hide resolved
I encounter some issue to pull, I have some conflicts. So I am cloning in a new folder the repo locally and will go from here, just take a bit of time ;) |
just to have it tracked somewhere, I did : |
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
…erence.R Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
…erence.R Co-authored-by: Stephanie Spielman <[email protected]>
…ercnv_exploration.Rmd Co-authored-by: Stephanie Spielman <[email protected]>
Please feel free to DM me on Slack if I can help resolve the conflicts! |
should be running :) |
Looks like conflicts are resolved! But, please still re-render the notebooks since there was a change there, and re-request my review when they are pushed :) Then, I expect this will be ready to merge! |
update html reports
@sjspielman thank you so much for the very productive day today!! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!!!
Code in this PR does not (currently) run in CI, so we can merge this before the final check is complete since we know that |
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
This PR is following the discussion from the PR#776.
It explores the
infercnv
results generated in PR#801.Please link to the GitHub issue that this pull request addresses.
#790
What is the goal of this pull request?
We create a notebook template to explore the
infercnv
results generated using the script06_infercnv.R
Briefly describe the general approach you took to achieve this goal.
infercnv was run using the 06_inferCNV.R script with and without a normal reference. We thus tested the impact of the subselection of normal cells using either immune, and/or endothelial cells as healthy reference.
In this notebook, we want to compare the heatmaps of CNV profiles, and evaluate how comparable are the methods and how sensible they are to key parameters such as selection of healthy reference.
We also explore tools to better visualize and interprete CNV results.
If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes, I think the next one will be to test to run
infercnv
with a HMM i3 model, check the running time (costs) and explore the results, in a similar way than in this PR. I would like to try to stick to the use of both endothelial and immune cells as normal cells for reference. As we don't know the behavior of the model in extreme case with only one-five normal cells, I would suggest to try for one sample to graduatly use 1, 5, 10, 50 nornal cells as reference and compare the result. I guess it would at least avoid to take the mean of expression as reference and smooth out CNV present in the majority of cells.Opinion very welcome 😃 🚀
Provide directions for reviewers
I think the present notebook could maybe be improved with the calculation of a general CNV score and see if this single score could help us identifying normal cells.
I am however not sure what would be the best way of calculating this score (as simple as the sum of all loss/dupli in each chromosome?)
What are the software and computational requirements needed to be able to run the code in this PR?
The script to generated the
infercnv
result required a lot of resource, but the notebook to explore the result do not require much 😄Author checklists
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.