-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update clustering report to include leiden clustering #895
Update clustering report to include leiden clustering #895
Conversation
Just noting that a74fe36 is because I was seeing a failure in the CNV annotation workflow and CIT wasn't passing. After looking into it, InferCNV was now detecting annotations on chrKT. I made a change to just include chr1-22 rather for summarizing results but it's unrelated to other changes here. |
I've only looked at your opening comment so far, but I wanted to note before I dive into the code - the resolution parameter for CPM should probably be way lower, like a few orders of magnitude. I'd try it in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me overall! The main recommendation I have is, as I commented earlier, to play around with lower resolution values for CPM.
- Average cluster purity: This metric also evaluates cluster separation and tells us the proportion of neighboring cells that are assigned to the same cluster. | ||
Purity values range from 0-1 with higher purity values indicating clusters that are well separated. | ||
- Cluster stability: This evaluates how stable the clustering is to input data. | ||
Stability values range from 01- with higher values of cluster stability indicating more reproducible clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stability values range from 01- with higher values of cluster stability indicating more reproducible clusters. | |
Stability values range from 0-1 with higher values of cluster stability indicating more reproducible clusters. |
|
||
- Average silhouette width: This metric evaluates cluster separation. | ||
Cells with large positive silhouette widths are closer to other cells in the same cluster than to cells in different clusters. | ||
Higher values indicate tighter clusters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might note it goes [-1,1]
resolution = ~ glue::glue("{.}-res"))) + | ||
theme( | ||
aspect.ratio = 1, | ||
legend.position = "none" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a lot of UMAPs here and it's hard to see where one stops and the other starts in the grid. I think adding a border will help!
legend.position = "none" | |
legend.position = "none", | |
panel.border = element_rect(color = "black", fill = NA)) |
I went ahead and updated the resolution range used for leiden-CPM to test .001, .005, and .01. I tried .0001 and everything was just one cluster so that didn't seem helpful and above .01 things start to get assigned to all individual clusters for the most part. But I kept the plots grouped in the same way so there's one plot for louvain, one plot for leiden-cpm, and one plot for leiden-modularity for each metric. To me that makes the most sense rather than looking at two different objective functions with very different resolution ranges in the same plot. Note that I also updated the UMAPs to have borders and fixed an issue in the AUCell/SingleR workflow in d7ed51b. When generating reports for that workflow, there were samples where marker gene expression was now 0 (due to the new rounding), so plots couldn't be made. I just added a message that the plots are missing and accounted for libraries with no marker gene expression. Here's an updated report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me! I left a few small comments, the main one being that it might make sense to use actual Rmd params for the algorithm parameters.
I realize that I don't think these clustering notebooks are being tested in CI right now. Remind me where we see these notebooks being slotted into the workflow eventually? It may be fine to not have in CI for this PR, but it should happen sooner rather than later, which is a thing I say from experience 🫠
|
||
- Objective function: CPM and modularity | ||
- Nearest neighbors: 5, 10, 15, 20, 25, 30, 35, and 40 | ||
- Resolution: 0.5, 1, 1.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs the new values used for CPM
|
||
## Calculate clusters | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might define a chunk here for parameters, e.g. nn <- seq(5, 40, 5)
. Or we could actually make these actual Rmd params with the values you use here as defaults, which would allow some flexibility if it's needed when running this as a template across samples.
facet_grid(rows = vars(nn), | ||
cols = vars(resolution), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I might actually flip these so the overall look is more horizontal? You may also need to change the r chunk fig width/height.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are more nn
values used than resolution I think having a wider plot would be hard and make the UMAPs really small. So I'm not super inclined to change this.
@sjspielman I updated this so that the parameter values being tested are provided as params to the notebook.
The plan is for this to be a workflow that renders this report on all samples in the project (see #686). My next step was to create that workflow and then add running that workflow to CI. I didn't make the plot change only because it made some really small umaps that were hard to read. This should be ready for another look! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Purpose/implementation Section
Please link to the GitHub issue that this pull request addresses.
Closes #878
What is the goal of this pull request?
Here I am updating the clustering report in the Ewing's module to use the functions in
rOpenScPCA
. While doing that I also made some changes to the clustering parameters tested to look at both louvain and leiden clustering.Briefly describe the general approach you took to achieve this goal.
rOpenScPCA::sweep_clusters()
function. I varied the nearest neighbors (5-40 with increments of 5 as we were doing previously), the resolution (.5, 1, 1.5) and the objective function (CPM and modularity).rOpenScPCA
when it made sense, but mostly I still have the functions I wrote that get the results across a variety of parameters tested. I think I'm going to file issues to add this functionality torOpenScPCA
in the future.clustering-functions.R
script in this module.There are two other sections of this report that I ended up moving to a completely different report that I will update in a subsequent PR. These sections compare cluster assignments to cell type assignments and look at marker gene expression across clusters. I started by trying to create those plots across all clustering parameters tested but that resulted in a very long report and also a lot of plots that weren't very meaningful. From the metrics output it's pretty obvious which clustering is bad (looking at you leiden-CPM) and which parameters may be better. Because of that I am reframing this slightly so that the first report generates cluster results and metrics across all possible parameters. Then this report can be evaluated to identify an algorithm, nn, and resolution (or smaller range of nn and resolution) that could be used as input to the second report. This second report would help validate that these clusters are indeed separating by cell type and gene set as expected. This will also make it easy to move the first report with just the metrics here to the
hello-clusters
module since it is not dataset specific while the second report is a little more specific to the Ewing module.I wanted to keep the code for this second report so I really just copied over a lot of the existing code and noted a TODO about actually finishing the report next.
If known, do you anticipate filing additional pull requests to complete this analysis module?
Yes! Next up I'm going to do two things:
Results
What types of results does your code produce (e.g., table, figure)?
This is just a template report so no real results yet, but I'm including a rendered example report for easier review.
01-clustering-metrics.html.zip
For reference here's the previous version of the report:
01-clustering.html.zip
Provide directions for reviewers
What are the software and computational requirements needed to be able to run the code in this PR?
I was able to generate the report locally.
Are there particularly areas you'd like reviewers to have a close look at?
I think the main focus of the review here should be on two things:
01-clustering-metrics.Rmd
can be improved? Note that the only thing that changed from the previous plots is the faceting, but the content and type of plots were not changed.Is there anything that you want to discuss further?
Just another note that you will not need to review/ go line by line for
02-clustering-celltypes.Rmd
. This report contains the second half of the original clustering report just copied over and I have not made any other adjustments to the code. I plan on fully updating that report in a later PR.Author checklists
Analysis module and review
README.md
has been updated to reflect code changes in this pull request.Reproducibility checklist
Dockerfile
.environment.yml
file.renv.lock
file.