-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
celltype/tumor annotation for non-ETP T-ALL (SCPCP000003) #789
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @UTSouthwesternDSSR — thank you for this contribution! I am still reviewing the results but wanted to provide some initial feedback on the code.
To summarize my comments at a high level:
- Your module assumes a 1:1 relationship between sample and library identifiers, but that isn’t always true in ScPCA data. From what I saw of the ETP-ALL module, I believe you’d like to use this code for other sets of samples, so making this change for safety may make things easier later. I’ve made suggestions about using the library identifier as the primary ID, but I may not have caught every reference to a sample identifier.
analyses/cell-type-nonETP-ALL-03/Azimuth_BM_level1.csv
could have more descriptive column names that make it easier for someone looking at the code for the first time to understand what it contains.- I want to discuss whether or not using an Excel file is necessary and moving the gene set list generation outside of individual functions so you can run it once instead of multiple times every time you annotate a sample.
- We want to tie the scType code to a particular permalink of that repository’s code.
- It would be best to create the
results/rds
directory programmatically instead of assuming it already exists, which causes failures on other machines. - There is at least one package missing from the
renv.lock
file, which I think could be captured by changing thelibrary()
calls in the annotation script to no longer uselapply()
.
And then for science and display questions, as I’m still working through that part of the review:
- Is the same cell type label being assigned to all cells in a Leiden cluster?
- Would it be possible to combine the UMAP plots into one multi-panel plot with something like
patchwork
? If so, I think folks would have an easier time with the review and interpretation of your results. If this change will take a lot of time, we can skip it.
Thank you again for your contribution. Please let me know if you have any questions or if there’s any way we can help!
@@ -43,7 +43,7 @@ run_sam <- function(ind.sample, ind.library){ | |||
final.obj[["ADT"]]@data <- logcounts(altExp(sce, "adt")) | |||
final.obj[["ADT"]] <- AddMetaData(final.obj[["ADT"]], as.data.frame(rowData(altExp(sce, "adt")))) | |||
|
|||
saveRDS(final.obj, file.path(out_loc,"scratch",paste0(ind.sample,".rds"))) | |||
saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.sample,".rds"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need a directory creation step somewhere in this script now that you've changed the output location based on my local testing and potentially the CI error. That can be accomplished with:
dir.create(file.path(out_loc, "results/rds", recursive = TRUE, showWarnings = FALSE))
And will re-run without error or warning if it already exists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In keeping with library ID as the primary identifier:
saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.sample,".rds"))) | |
saveRDS(final.obj, file.path(out_loc,"results/rds",paste0(ind.library,".rds"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, for the ETP-ALL samples, Sample SCPCS000066 has two libraries, and I actually merge the two rds file without doing any batch effect correction. I write another function of run_multisam
for those with multiple libraries. It requires a little bit of manual code to provide input to that function, but it works perfectly fine for ETP-ALL samples, since there is only one case.
That's the reason why I thought saving as sampleID making more sense, because technically we should merge multiple libraries together if they belong to the same sample? Isn't this will make the analysis more powerful, since there are more cells added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it seems like the batch effect with that case is not bad, I would still lean toward leaving the two libraries separate. Having more cells may not always result in better results (especially if there were a batch effect), but that is largely an empirical question; I think we would want to evaluate what the tradeoffs are for different parts of the analysis.
To me, keeping the libraries separate seems like the more conservative (and simpler) approach, so it is the one I would start with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I will remove the run_multisam
function from the ETP-ALL module script.
# load gene set preparation function | ||
#source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/gene_sets_prepare.R") | ||
# load cell type annotation function | ||
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to tie this to a specific commit in case any update that's not backward compatible with this code is pushed to master
. The following worked for me locally:
source("https://raw.githubusercontent.com/IanevskiAleksandr/sc-type/master/R/sctype_score_.R") | |
source("https://github.com/IanevskiAleksandr/sc-type/raw/6db9eef49f185cf4d79bfec92a20fcf1edcccafb/R/sctype_score_.R") |
#Based on the annotated B cells, we provide them as the normal cells for CopyKat to identify tumor cells. | ||
#The outputs are final rds file and two umap plots showing cell type and tumor/normal cells status. | ||
|
||
lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you'll need openxlsx
anymore, so I've removed it in the suggestion. HGNChelper
is not in the renv.lock
file and doesn't get captured when I tried renv::snapshot()
locally, so I think we need to remove the lapply()
here for that to work as expected:
lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T) | |
library(dplyr) | |
library(Seurat) | |
library(HGNChelper) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on my cursory reading of the annotation script, I understand that geneSymbolmore1
is a gene set up in a cell type, and geneSymbolmore2
is a gene set down in a cell type. Can we rename these columns to be more descriptive about their contents? For example, ensembl_id_up
and ensembl_id_down
. We want it to be easy for folks to understand if they are seeing the file for the first time.
I realize this could be because you get the table from Azimuth, but that's nicely documented in your README, so I'm not worried about differing from Azimuth too much.
@@ -42,7 +42,7 @@ The `scripts/00-01_processing_rds.R` requires the processed SingleCellExperiment | |||
../../download-results.py --projects SCPCP000003 --modules doublet-detection | |||
``` | |||
|
|||
As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type, and *TMEM56* is not detected in our dataset, thus it is being removed as part of the markers for Late Eryth. As of now, there is no negative marker genes provided under `geneSymbolmore2`. | |||
As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `geneSymbolmore2`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you take my other suggestion:
As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `geneSymbolmore1` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `geneSymbolmore2`. | |
As for the annotation, `scripts/02-03_annotation.R` requires cell type marker gene file, `Azimuth_BM_level1.csv`, as an input for ScType. This excel file contains a list of positive marker genes in Ensembl ID under `ensembl_id_up` for each cell type; *TMEM56* and *CD235a* are not detected in our dataset, thus they are being removed as part of the markers for Late Eryth and Pre Eryth respectively. As of now, there is no negative marker genes provided under `ensembl_id_down`. |
But having read this, maybe ensembl_id_positive_marker
and ensembl_id_negative_marker
would be even better!
file = file.path(out_loc,"results",paste0(ind.sam,"_metadata.txt"))) | ||
final.obj$sctype_classification <- seu$sctype_classification | ||
final.obj$lowConfidence_annot <- seu$lowConfidence_annot | ||
saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.sam,".rds"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.sam,".rds"))) | |
saveRDS(final.obj, file = file.path(out_loc,"results/rds",paste0(ind.lib,".rds"))) |
cL_results <- do.call("rbind", lapply(unique([email protected]$leiden_clusters), function(cl){ | ||
es.max.cl = sort(rowSums(es.max[ ,rownames([email protected][[email protected]$leiden_clusters==cl, ])]), decreasing = !0) | ||
head(data.frame(cluster = cl, type = names(es.max.cl), scores = es.max.cl, ncells = sum([email protected]$leiden_clusters==cl)), 10) | ||
})) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comments about what this code is doing please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have added the comment as follow:
merge by cluster [repetitively finding top 10 cell types with the highest total ScType score for each cluster]
es.max.cl - stores the descending value of total ScType score for each cell type in a particular cluster
The following line prints out the top 10 cell types with the highest total ScType score for a cluster, along with #cells in that cluster
which is then stored in cL_results (_sctype_top10_celltypes_perCluster.txt) - table with 4 columns (clusterID, cell type, total ScType score, #cells)
lapply(c("dplyr","Seurat","HGNChelper","openxlsx"), library, character.only = T) | ||
library(copykat) | ||
library(ggplot2) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might consider setting the number of cores based on what machine you are on with something like:
n_cores <- parallel::detectCores() - 1
copykat.test <- copykat(rawmat=seu@assays[["RNA"]]@counts, id.type="Ensemble", | ||
ngene.chr=5, win.size=25, KS.cut=0.1, sam.name=ind.sam, | ||
distance="euclidean", norm.cell.names=norm.cells, | ||
output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=1) | |
output.seg="FALSE", plot.genes="TRUE", genome="hg20",n.cores=n_cores) |
seu$copykat.pred <- copykat.test$prediction$copykat.pred[idx] | ||
DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") + | ||
ggtitle(paste0(ind.sam,": copykat prediction")) | ||
ggsave(file.path(out_loc,"plots",paste0(ind.sam,"_copykatPred.pdf")), width = 7, height = 7) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ggsave(file.path(out_loc,"plots",paste0(ind.sam,"_copykatPred.pdf")), width = 7, height = 7) | |
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.pdf")), width = 7, height = 7) |
|
I have uploaded the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the updates, @UTSouthwesternDSSR! This looks close to being ready to merge in. I was not clear when I asked for a multipanel plot – I was looking for all UMAPs from an individual library in one multipanel figure. I've added a comment to the multipanel_plot.R
file with something that I tested locally and it appeared to work.
I think with that change, we can remove the plotting of individual panels from the annotation script. I've made suggestions to that effect throughout.
I also had a question about whether or not we should be saving more output from CopyKAT, which I think is somewhat related to what I'll lay out below.
To respond to your question:
For those cells that are labeled as "Unknown", do we have to dig it more (as in a manual way) trying to identify their cell types?
We generally want annotations assigned as part of this project to be of higher quality than those we already include on the Single-cell Pediatric Cancer Atlas Portal. Because many cancer types exist in the ScPCA Portal and we thought it was important to supply users with a "first pass" at cell type labels to save a little time, we used two automated methods to assign cell types: SingleR and CellAssign. (More information about that in the ScPCA Portal docs: https://scpca.readthedocs.io/en/stable/processing_information.html#cell-type-annotation and more information about where to find those assignments here: https://scpca.readthedocs.io/en/stable/sce_file_contents.html#singlecellexperiment-cell-metrics and https://scpca.readthedocs.io/en/stable/sce_file_contents.html#anndata-cell-metrics)
So, we'd like to see some demonstration that the labels assigned are an improvement on the SingleR and CellAssign classifications we already have, i.e., more valuable to people who might want to use the data. One way to get an improvement is to pick more appropriate references – that's where something like the blast signature from the literature you've included comes in! At the very least, we want to understand how ScType assignments differ from existing ones.
To directly answer your question, yes, I might expect a complementary (and more manual) approach may be required to meet our goals.
When it comes to ScType assignments and the nature of how those are applied (i.e., to all cells in a cluster with >25% of cells in that cluster having the highest score as some cell type, as I understand it), it appears that the cluster assignments matter a lot. So, we'd probably want to know:
- Are the clusters relatively stable? Are they well-separated? You can see some of how we're thinking about cluster evaluation in a package being written in a feature branch here: https://github.com/AlexsLemonade/OpenScPCA-analysis/blob/e2c71af46616547846aa1fa816bbd5c409043ae1/packages/rOpenScPCA/R/evaluate-clusters.R
- Will ScType assign cell type labels even if the cluster assignments are not informative? What happens if an overwhelmingly dominant cell type exists in an individual sample with poorly separated clusters? An example of how to get to this question: if one were to shuffle the cluster labels, would you still get high-confidence assignments, and how frequently does that happen?
I would also like to know what cell types CopyKAT calls aneuploid, and it might be interesting to examine the differences in blast module scores grouped by CopyKAT calls.
We have had some worrisome experiences with CopyKAT in our hands (#537), but I'd expect performance to vary very much between cancer types!
I don't expect you to add anything further to this pull request unless you'd like to refactor anything to make it easier to get at these questions in a later pull request!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my fault for not being specific! I was looking for all the UMAPs for an individual library in the same plot, so something like the following:
#!/usr/bin/env Rscript
#This script combines multiple UMAP plots from an individual library into one multi-panel plot
library(Seurat)
library(ggplot2)
multiplot <- function(annot.obj,
library.id,
ct.colors,
n.row = 2,
variables.to.plot = c("leiden_clusters",
"sctype_classification",
"lowConfidence_annot",
"copykat.pred")){
plot.list <- list()
for (plot.type in variables.to.plot){
if (plot.type %in% c("sctype_classification", "lowConfidence_annot")) {
clrs <- ct.colors
} else {
clrs <- NULL
}
tryCatch({
plot.list[[plot.type]] <- DimPlot(annot.obj,
reduction = "Xumap_",
group.by = plot.type,
label = T,
cols = clrs,
repel = T) +
NoLegend()
}, error=function(e){})
}
cowplot::plot_grid(plotlist = plot.list, nrow = n.row) +
cowplot::draw_figure_label(library.id,
position = "top",
size = 18,
fontface = "bold")
ggsave(file.path(out_loc,"plots",paste0("multipanels_",library.id,".png")),
width = 12,
height = 12,
bg = "white")
}
project_root <- rprojroot::find_root(rprojroot::is_git_root)
projectID <- "SCPCP000003"
out_loc <- file.path(project_root, "analyses/cell-type-nonETP-ALL-03")
data_loc <- file.path(project_root, "data/current",projectID)
metadata <- read.table(file.path(data_loc,"single_cell_metadata.tsv"), sep = "\t", header = T)
metadata <- metadata[which(metadata$scpca_project_id == projectID &
metadata$diagnosis == "Non-early T-cell precursor T-cell acute lymphoblastic leukemia"), ]
libraryID <- metadata$scpca_library_id
ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy",
"chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80")
names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage",
"Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown")
seu.list <- list()
for (lib_iter in 1:length(libraryID)){
seu.list[[lib_iter]] <- readRDS(file.path(out_loc,"results/rds",paste0(libraryID[lib_iter],".rds")))
names(seu.list)[lib_iter] <- libraryID[lib_iter]
}
purrr::walk2(
seu.list,
names(seu.list),
~ multiplot(annot.obj = .x,
library.id = .y,
ct.colors = ct_color)
)
p1 <- DimPlot(seu, reduction = "Xumap_", group.by = "sctype_classification", cols = ct_color) + | ||
ggtitle(paste0(ind.lib,": cell type")) | ||
p2 <- DimPlot(seu, reduction = "Xumap_", group.by = "lowConfidence_annot", cols = ct_color) + | ||
ggtitle("low confidence annotation") | ||
p1 + p2 | ||
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_celltype.png")), width = 10, height = 4, dpi = 150) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p1 <- DimPlot(seu, reduction = "Xumap_", group.by = "sctype_classification", cols = ct_color) + | |
ggtitle(paste0(ind.lib,": cell type")) | |
p2 <- DimPlot(seu, reduction = "Xumap_", group.by = "lowConfidence_annot", cols = ct_color) + | |
ggtitle("low confidence annotation") | |
p1 + p2 | |
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_celltype.png")), width = 10, height = 4, dpi = 150) |
DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") + | ||
ggtitle(paste0(ind.lib,": copykat prediction")) | ||
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.png")), width = 6, height = 6, dpi = 150) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DimPlot(seu, reduction = "Xumap_", group.by = "copykat.pred") + | |
ggtitle(paste0(ind.lib,": copykat prediction")) | |
ggsave(file.path(out_loc,"plots",paste0(ind.lib,"_copykatPred.png")), width = 6, height = 6, dpi = 150) |
ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy", | ||
"chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80") | ||
names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage", | ||
"Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ct_color <- c("darkorchid","skyblue2","dodgerblue2","gold","beige","sienna1","green4","navy", | |
"chocolate4","red","darkred","#6A3D9A","maroon","yellow4","grey35","black","lightpink","grey80") | |
names(ct_color) <- c("B","CD4 T","CD8 T","DC","HSPC","Mono","NK","Other T","Macrophage", | |
"Early Eryth","Late Eryth","Plasma","Platelet","Stromal","Blast","Cancer","Pre Eryth","Unknown") |
norm.cells <- colnames(seu)[which(seu$sctype_classification=="B")] | ||
n_cores <- parallel::detectCores() - 1 | ||
if (length(norm.cells) > 0){ #the sample has B cells annotated | ||
copykat.test <- copykat(rawmat=seu@assays[["RNA"]]@counts, id.type="Ensemble", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there anything besides the final call from copyKAT that we want to save to use in a later analysis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on their test output, I uploaded the _copykat_heatmap.jpeg
and _copykat_CNA_results.txt
under results/copykat_output
(in S3 bucket). I also added _copykat_CNA_raw_results_gene_by_cell.txt
. I am not sure if this file will be useful, but there is gene annotation.
I added the following lines in 02-03_annotation.R
dir.create(file.path(out_loc, "results/copykat_output"), showWarnings = FALSE)
setwd(file.path(out_loc, "results/copykat_output"))
write.table(data.frame(FetchData(seu, vars = voi)), sep = "\t", quote = F, | ||
file = file.path(out_loc,"results",paste0(ind.lib,"_metadata.txt"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we save the cell identifiers as a column rather than rownames here? Perhaps using something like the following:
voi_df <- data.frame(FetchData(seu, vars = voi)) |>
tibble::rownames_to_column(var = "barcode")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have made changes as following:
voi_df <- data.frame(FetchData(seu, vars = voi)) |> tibble::rownames_to_column(var = "barcode")
write.table(voi_df, sep = "\t", quote = F, row.names = F,
file = file.path(out_loc,"results",paste0(ind.lib,"_metadata.txt")))
# multipanel plots are not committed | ||
/plots/multipanels_* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These + the dot plot are the only ones I would commit. I've suggested a change to the annotation script to remove plotting of individual UMAPs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify, do you mean to upload the multipanels_* to the repository? Even if I change the dpi = 150, the size of these mutipanels_*png ranges from 192kb to 676kb (only one file is below 200kb). Would it be fine if I save all of these multipanels plot as JPEG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, good point! Yes, let's try JPEGs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with JPEGs (dpi = 150), the range is from 320kb to 652kb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any other ideas on how to make the size lower than 200kb?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just merged in #792, so if you merge AlexsLemonade:main
into this branch, you should be able to commit the larger PNGs. Please let us know if that doesn't work as expected!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have merged it, but the multipanels_
png files do not show up in the right panel of GitKraken Desktop (the changes section). So I cannot commit them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you remove this line from the .gitignore
, do they show up in GitKraken?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it works! Thank you so much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ready to go in 👍🏻 Thank you for this contribution! Please let us know if you'd like to discuss the next steps. We are happy to correspond via Discussions or set up a short call if needed.
Purpose/implementation Section
The goal of this PR is to annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).
Please link to the GitHub issue that this pull request addresses.
What is the goal of this pull request?
Annotate cell types with ScType and identify tumor cells with CopyKat using the annotated B cells in the same sample (if there is any).
Briefly describe the general approach you took to achieve this goal.
If known, do you anticipate filing additional pull requests to complete this analysis module?
Out of 11 samples, 7 samples have B cells annotated (except SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100). Although SCPCS000099 contains B cells, these cells are not useful for the identification of tumor cells, as shown by the extremely large number of not.defined cells with default parameters.
I may try to merge the samples, and use the annotated B cells as the normal cell to run CopyKat, hoping that we could identify tumor cells from those samples without B cells. This could be another pull request, if it works.
Results
What is the name of your results bucket on S3?
rds
objects are found ins3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/rds
s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/results/
s3://researcher-650251722463-us-east-2/cell-type-nonETP-ALL-03/plots
What types of results does your code produce (e.g., table, figure)?
rds
objects_metadata.txt
(cell ID, leiden clusters, cell type annotation, low confidence cell type annotation, CopyKat prediction [for the 7 samples]) and_sctype_top10_celltypes_perCluster.txt
(top 10 possible cell types with their respective sctype score in each cluster)umap
plots showing leiden clustering, cell type, and CopyKat prediction respectivelydot
plots showing the average expression of group of markers for each cell type usingAddModuleScore()
What is your summary of the results?
With the default threshold of having sctype score > 25% of ncells in a cluster (
sctype_classification
), there are a large number of cells being annotated as "Unknown" in each sample, reaching to as high as ~45% (disregarding SCPCS000099: 48% and SCPCS000100: 52%, which have much lower number of features detected).Therefore, I relaxed the threshold from 25% to 10% (
lowConfidence_annot
), and the percentage of "Unknown" is now dropped to 30% (disregarding SCPCS000099: 33% and SCPCS000100: 48%), with 5 samples having no "Unknown".I only ran CopyKat for 7 samples, excluding SCPCS000091, SCPCS000092, SCPCS000098, and SCPCS000100. However,
tumor prediction does not work for SCPCS000099, given that it has 36% of "not.defined" cells, while the other samples have less than 1%. Even if I relaxed the cutoff of ngene.chr from 5 to 1, the number of "not.defined" cells decreases, but they all get annotated as "diploid".
Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
renv.lock
andconda.lock
.Are there particularly areas you'd like reviewers to have a close look at?
Is there anything that you want to discuss further?
Thank you so much for any suggestions/feedback!
Author checklists
Check all those that apply.
Note that you may find it easier to check off these items after the pull request is actually filed.
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.