-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scripts to run copykat
and infercnv
for a subselection of Wilms tumor from SCPCP000006
#801
Conversation
Hi, |
Hi @maud-p , I'm looking into the Dockerfile issue, and it is indeed strange that it is failing now when it has previously built fine, since the packages you added do not appear related to the build failure. Can you add this line into your Dockerfile right after the line that installs
(by the way, while you are there, you can also update the line in the Dockerfile that installs Also, just as an FYI: I am going to be filing a different issue soon that will be relevant to your module. We have been trying to find ways to get the OpenScPCA test data to work with Azimuth, but it has proven challenging. So, instead, we're going to propose to update your module to entirely use Seurat for the label transfer step; we in the Data Lab will take care of writing that code and updating the module, and this should not interfere with your progress! We don't anticipate this will affect change results much since you're earlier exploration previously Azimuth and Seurat they were very similar. I will tag you in the issue I file, and we can discuss more there if you like :) |
Second comment: In terms of the "Run module" workflow that is failing, I am pretty sure that is a system dependency problem where the
Please let me know if you need help with this; I can also file a PR to your branch to make this change too! |
Wooo I am impressed, it works on my side, let's see how it is on github Action. Thank you so much!!!! |
Co-authored-by: Joshua Shapiro <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Note: we'll get this PR ready to go before moving onto #802, since #802 depends on this code. Once this PR goes in, you can then update #802 to be consistent with the changes made here, and then we'll look at that one.)
It's good to see that Docker is building now without errors! But, the reason I believe this is still not running in CI is because the data for input to copykat/infercnv is not generated in CI. This code therefore needs to also be inside an if
statement in 00_run_workflow.R
where it's only run if we are not testing. In addition, the notebooks processing that output don't exist in this PR, so that will cause an error too.
Since we'd also like to keep 00_run_workflow.R
as modular as possible, here's what I recommend:
- Create a new R script that contains all the new code you added to
00_run_workflow.R
, and call that script from00_run_workflow.R
(inside anif
statement since we don't want it to run in CI) - But, do not include the code that runs the notebooks. You can add it back in as part of the next PR when those notebooks actually get added to the repo.
- Note that I also included a comment to create a separate script to download the gene order file, and the step to run that script should be here too.
For the first round of review, I left some inline comments for the inferCNV script, and here is some feedback for the copyKat script. Some of the concepts here apply to the inferCNV code as well, so please read the comments here first and then the inline comments, and let me know if anything needs to be clarified!!
First, this script uses a ton of resources. You have a default of 16 cores and you are calling it with 32 is using quite a few resources (32 cpus). This needs to be documented somewhere. At this point, we should add a scripts/README.md
file that briefly explains what each script in there does as well as what resources is requires. In addition, resource requirements should be noted in the overall module README
file.
We also should nail down the file/directory naming more precisely. I suggest we save files as follows:
- It is fine to have nested directories within
results/{sample_id}/05_copykat
for the different conditions, but we still want the file names to communicate the content without needing to know what directory it's in. So, please name files something like:05_copykat_{sample_id}_{ref/noref}_{distance}.rds
- As part of updating this, please make sure that all code to create directories and file names is in the same place during setup steps (in other words, not later in the script when you actually run copyKAT). You will also want to update the results README to communicate this file scheme.
In addition, rather than relying on copyKAT's automatic saving, we should more explicitly save results of interest to files. We should add code to save the literal copyKAT object created to an RDS file. (That said, please let me know if you think I'm mistaken and we actually explicitly need the files that are automatically written out, vs just saving the full object.) This way, we will always be running code from the same directory, and we will have control over how and where files are saved.
I am hoping it's possible to do something similar for inferCNV too - save all results in results/06_inferCNV
, and rather than relying out automatic export, actually save the inferCNV
object to a file named 06_infercnv_{sample_id}_reference-{whatever the reference is}.rds
. Note that if the reference
is NULL
, you'll need to use a different string to communicate that since NULL
won't write as a string.
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Co-authored-by: Stephanie Spielman <[email protected]>
Hi @sjspielman , thank you for the review! I will commit soon the changes, mostly the scripts structure and file saving + README.md for the scripts and results folder. I checked that the script is running on few sample/condition, I will load these few and incomplete result on the s3 bucket so you can start assessing the code and output if you like. The Thank you! |
Also, for the output of |
changes to PR#801
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just had a quick peek at some of the changes you've made so far, looks on the right track!! Left a couple very small comments in the meantime :)
# change working directory of the script to the scratch directory | ||
# this ensures copykat files get saved to the right location | ||
# there is no option to specify an output directory when running copykat | ||
setwd(scratch_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With full file paths defined, we shouldn't need to change directories here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure, because there is no option to specify an output directory when running copykat, and it will save automatically quite some files.
Or did I missed something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are always going to copy files that we're saving, it shouldn't matter where we run the code from. For any files that copyKAT automatically saved that we don't need to keep, we can just delete those files. Are there any additional files copyKAT is saving that we are not copying into results
?
output_jpeg_ref <- file.path(result_dir, "05_copyKAT", "ref", opts$distance, glue::glue("05_copykat_",opts$sample_id,"_ref_distance-", opts$distance, "_copykat_heatmap.jpeg")) | ||
output_jpeg_noref <- file.path(result_dir, "05_copyKAT", "noref", opts$distance, glue::glue("05_copykat_",opts$sample_id,"_noref_distance-", opts$distance, "_copykat_heatmap.jpeg")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to save these jpeg files? One reason I ask if that our pre-commit hooks currently are only allowing PNG images. Either way, unless the results are getting specifically used, you do not have to save each individual file that copyKAT
outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to save the heatmaps to have then a quick look at the outputs with the different parameters (choice of normal and distance), as done for the Ewing sarcoma analysis (lines 116 and 120)
What about converting to png and then save the png?
# We build the gene position file reference for infercnv ------------------------ | ||
system(command = glue::glue("Rscript ", file.path(module_base, "scripts", "06a_build-geneposition.R"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's instead call this directly from the 00_run_workflow.R
script instead. This way, there are no "hidden" steps being called within scripts.
Hi @sjspielman thank you for the review! I would like to share with you some thoughs, as I am not sure what to do in which order. Now that the code is better organized, it is easy to run/change different parameter, change the saved output etc BUT it takes forever to run. What do you think if
Maybe one example to illustrate. I was trying to check what to do with I also read that using a subcluster analysis mode and HMM --> I definitely would love to explore these Also, I am really not convinced by
I am also not sure what result to upload to s3 bucket as I now ran different additional parameters (like HMM) that generated different outputs. Let me know what do you think :) Thank you! |
changes to PR801
changes to PR#801
Hi @sjspielman ,
For each of the 5 selected samples and each condition (reference and distance), we saved in
With the next PR, we will chose 1 r 2 methods that are the best to explore the data. I think that the CNV heatmap of Major point is that this I hope I am +/- clear, please do not hesitate to reach out if something isn't that clear 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @maud-p, sorry for a bit of a pause here, I was out of the office last Friday. I think the code is about there, but just a few more comments when I think about next steps which I am now realizing; sorry for not thinking of this sooner! When we eventually run copykat and/or infercnv on more samples, we'll be calling the 05 & 06 scripts directly from the script that runs the workflow. The copykat script currently assumes that it should be run under two conditions (reference and no reference), but this will not necessarily match how we will use the script next. So, let's simplify it a bit: please include an option with optparse in that script that specifies whether or not it should use the normal reference (this is similar to what you do with infercnv). This means your script will define only 1 set of output file names (which can be named depending on if using the reference is true or false), and it will only need to run copykat one time which you can do a little more simply using the ifelse
function directly in the code, for example...
copykat.result<- copykat(rawmat=exp.rawdata,
sam.name=opts$sample_id,
distance=opts$distance,
# Use ifelse to only provide normal cell names if reference is true
norm.cell.names=ifelse(opt$use_reference, normal_cell, ""),
genome="hg20",
n.cores= opts$n_core,
id.type = "E",
plot.genes = FALSE,
output.seg = FALSE,
KS.cut = 0.05)
To define file names, you can have only one if
statement and then populate everything accordingly, for example:
if (opt$use_reference) {
ref_string <- "ref"
} else {
ref_string <- "noref"
}
Then, use the ref_string
variable when defining file names.
For now, do not worry about re-running the code after making these changes, since it will take a long time. We can take care of assuring the code works as expected on our side when we go to figure the test data situation with Azimuth (still forthcoming, stay tuned!).
EDIT: note that later in discussion here we decided to leave this code as is, since we almost certainly will not be using copyKAT further.
I am now going to start looking carefully at your results and will come back and comment on that soon. First, let me just say - I don't think it's going to be worth using the HMM with the very long run times you mention, and it may start to get expensive. What I recommend doing is noting in comments in the script that the HMM is also an option, but has a very long run time so it was not chosen. This way in the future, someone can come back to try it later - the module isn't going anywhere, and it can always be further explored in the future! But for now, I don't want that run time to hold you up further.
Also, fyi - I tagged you in a separate discussion post (#808) which another OpenScPCA contributor who is working on a different set of Wilm's samples opened to discuss using CNV methods. There may be some useful insights for you there, as well as insights you might like to share based on their exploration. If you'd like to join in that discussion, please feel free!
# change working directory of the script to the scratch directory | ||
# this ensures copykat files get saved to the right location | ||
# there is no option to specify an output directory when running copykat | ||
setwd(scratch_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are always going to copy files that we're saving, it shouldn't matter where we run the code from. For any files that copyKAT automatically saved that we don't need to keep, we can just delete those files. Are there any additional files copyKAT is saving that we are not copying into results
?
fs::file_copy(scratch_prediction, output_prediction_noref, overwrite = TRUE) | ||
fs::file_copy(scratch_CNA, output_CNA_noref, overwrite = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than file_copy
, we probably want file_move
- this way there will not still be an original copy of the file left behind.
fs::file_copy(scratch_prediction, output_prediction_noref, overwrite = TRUE) | |
fs::file_copy(scratch_CNA, output_CNA_noref, overwrite = TRUE) | |
fs::file_move(scratch_prediction, output_prediction_noref, overwrite = TRUE) | |
fs::file_move(scratch_CNA, output_CNA_noref, overwrite = TRUE) |
fs::file_copy(scratch_prediction, output_prediction_ref, overwrite = TRUE) | ||
fs::file_copy(scratch_CNA, output_CNA_ref, overwrite = TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fs::file_copy(scratch_prediction, output_prediction_ref, overwrite = TRUE) | |
fs::file_copy(scratch_CNA, output_CNA_ref, overwrite = TRUE) | |
fs::file_move(scratch_prediction, output_prediction_ref, overwrite = TRUE) | |
fs::file_move(scratch_CNA, output_CNA_ref, overwrite = TRUE) |
|
||
# We only run the CNV HMM prediction model if the reference is "both" | ||
HMM_logical = FALSE | ||
if(opts$reference == "both){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if(opts$reference == "both){ | |
if(opts$reference == "both"){ |
# We only run the CNV HMM prediction model if the reference is "both" | ||
HMM_logical = FALSE | ||
if(opts$reference == "both){ | ||
HMM_logical = TRUE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HMM_logical = TRUE | |
HMM_logical <- TRUE |
analyses/cell-type-wilms-tumor-06/scripts/06a_build-geneposition.R
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually rename this to explore-cnv-methods.R
, since if/when we run cnv methods on other samples, we'll probably run the other scripts directly since we'll be looping over samples from the module run script.
@maud-p - So, I just left you a review about making the copyKAT script more flexible, and since then I've started to look more in-depth into the actual copyKAT results. I am seeing radically different results among conditions, so I am very hesitant to trust any of these results. The heatmaps do not appear to show a strong visual signal, so I went ahead and made some quick comparisons across inferences. Below, I have pasted some contingency tables comparing predictions between conditions. Here tables comparing results from SPEARMAN distance between reference and no reference, where each table is a sample:
Here's how predictions look between spearman and euclidean WITH NO reference:
Finally, predictions between spearman and euclidean WITH reference:
While it is clear that all parameter settings here agree on which cells not to classify, there is serious disagreement over which cells are diploid vs. aneuploid. Given that there is such little consistency across conditions, I do not think we should proceed with |
Dear @sjspielman, Thank you very much for looking into it! I didn't wanted to influence and overconclude too rapidly but I definitly think that we shouldn't spend too much time on Looking at the CNV heatmaps, there is no obvious CNV and I don't see how We also ran internally I am much more happier with the I am not sure if we can really avoid working with HMM, I don't know if we can extract Would it be an alternative option to downsample the One though: looking at As you suggested, I let the copykat a bit on the side, reports generated for the next PR show that we can't conclude much fro it. We can keep them as an info that Thank you! |
…on.R Co-authored-by: Stephanie Spielman <[email protected]>
I think this could be an interesting alternative approach, but in order for it to be feasible and reliable, there would be quite a few more things we'd need to do. First, if clusters will truly form the basis of the results themselves, we'd want to revisit clustering in the first place to ensure they are "optimal." The quality of clusters we have was not rigorously explored, including a systematic varying of clustering parameters. (This is actually something we think about in the Data Lab a lot, since it's really hard to choose the "best" way to build clusters. In fact, I'm actually in the process of writing an analysis module that demonstrates how to use functions in package we wrote for OpenScPCA-analysis to thoroughly explore cluster quality to identify a more reliable clustering result.) Next, the success of that analysis would also depend on infercnv's power to detect signal from smaller datasets, as well as how much dataset size influences results, so we'd possibly want to do some of bootstrapping with infercnv to get reliable cluster-wide results. Plus, how much we can generalize those results will also depend on how "tight" clusters are - some clusters will be more heterogeneous than others, and those annotations would be much less reliable compared to very compact clusters where cells are more similar. All this is to say - it's definitely a very interesting idea to explore, but I think it's actually a much bigger analysis! It might be something to pursue later in general, but I don't think right now is the time for it. I did have a look at the inferCNV heatmaps, and I agree they do look a bit better than copyKAT, but I also I'm not sure how this will work out across the whole cohort since not every sample will have a set of normal cells to use as reference. Further, as I mentioned before in the context of copyKAT, having agreement among different conditions is a good way to evaluate how reliable a method is, and if we get really different results for the same sample run with and without a reference, that makes me a bit worried. I have not personally run inferCNV unfortunately, so I don't have many specific suggestions there, but I do see that their wiki says their 3-state model is similar to a different method Either way, let's go ahead and get this code in and move onto the next notebooks where we can explore more carefully to "officially" rule out copyKAT, and make decisions about inferCNV. For that, please go ahead and accept the suggestions I left, and also rename that script as I commented; being called "run_cnv.R" implies that it's the main script that will run CNV across samples, but it's really just meant to explore the methods to get a sense of what we should run across samples. Then, I'll approve this PR and get it merged, so that the other branch with your notebooks can be updated. Once that PR is updated so its tests pass, we'll start that review! |
changes to PR#801
I should have done the requested changes, hopefully I haven't mess up things (I somehow loosed track of my branches 😖 ) Once this is merged, I'd like to (in separated PR)
I think that the predicted cell types from the kidney atlas will be the most informative. Just thinking, when comparing fetal with a mature kidney. I am not 100% sure but I would say a kidney from a borned child should be closer to mature, while Wilms tumor is known to be more resembling to fetal kidney. When we look at the composition of mature versus fetal kidney: If we now look at more differentiated and functional units of kidney, they are present in different proportion in embryonic and mature kidney. Cells annotated as "kidney cells" or "podocytes" can be either cancer epithelial cells or normal kidney cells.
I'll stop here the examples but I have the impression that combining info that we already have might be helpful 😄 I'll try to generate a notebook template to explore this better. |
Thank you for sharing, I don't know about it. I had a quick look and I am note sure if we wouldn't need the bam files to run |
Good to know, I'll keep it in mind then, we can re-talk about it later if needed! Thank you for sharing your thoughts on it 😃 |
Yeah, i3 would be worth trying both for speed reasons, and for "data-related" reasons. Since we expect some, but not a huge amount, of CNV in this data, I worry that an i6 model might be too parameter-rich for the level of signal in the data. The i3 model might be a better fit for the amount of information our data has. But, we'll think about that in the next PR :) |
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
We follow the issue 790 and the PR 776
What is the goal of this pull request?
On a #776 (comment), I want to try to infer aneuploidy and/or CNV to help identifying normal and cancer cells.
For this, I compare the use of copykat and infercnv. For each of the method, I wanted to compare few parameters.
copykat
In copykat, few parameters can be used to fine-tuned the results. Especially, we can try running copykat with or without a set of normal cells. It is important to note that , CopyKAT had difficulty in predicting tumor and normal cells in the cases of pediatric and liquid tumors that have a few CNAs. CopyKAT provides two ways to bypass this to give certain output instead of being dead staright: 1) input a vector of cell names of known normal cells from the same dataset 2) or try to search for T cells. (see copykat). I thus tested both with and without a reference but I am quite convinced that giving few normal cells help the function.
One parameter I also wanted to test is the clustering method. In copykat, parameters for clustering include "euclidean" distance and correlational distance, ie. 1-"pearson" and "spearman" similarity. In general, corretional distances tend to favor noisy data, while euclidean distance tends to favor data with larger CN segments. I thus tested eucliedean and spearman. In our dataset, I think euclidean (default) is perfoming best.
infercnv
Another way of inferring copy number alterations from tumor single cell RNA-Seq data is using infercnv. In a previous discussion, we were not sure about the impact of the definition of the heatly reference. For that reason, I ran infercvn with no normal cells as reference or immune and/or endothelial cells as reference.
Briefly describe the general approach you took to achieve this goal.
In this PR, we run the scripts
05_copyKAT.R
and06_infercnv.R
for 5 samples.We selected previously 5 samples to test for these parameters:
This is generated outputs that are stores in
results
as described in theresults\READ.ME
If known, do you anticipate filing additional pull requests to complete this analysis module?
Results
For each of the samples, we ran
05_copyKAT.R
script, defining thedistance
parameters.The outputs of
copykat
are saved in theresults\{sample_id}\05_copykat
folder.For each of the samples, we ran
06_infercnv.R
script, defining either no (none) normal cells as reference, or immune and/or endothelial cells.The outputs of
infercnv
are saved in theresults\{sample_id}\06_infercnv
folder.What is the name of your results bucket on S3?
researcher-008971640512-us-east-2
What types of results does your code produce (e.g., table, figure)?
For now, we mostly look at the heatmap output that is generated and saved into jpeg file.
However, for later analysis, we might need one of the table generated. For that reason, I kept every input generated so far.
We can then decide what is useful and what can be removed in a second step.
What is your summary of the results?
See next PR, the results will be explored to compare the different methods.
Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
Are there particularly areas you'd like reviewers to have a close look at?
Is there anything that you want to discuss further?
Author checklists
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.[?] Maybe I'll update the
README.md
once we chose 1 or 2 methods for CNV inference?Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.