Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hello-clusters notebook: Perform and evaluate clustering #874

Merged

Conversation

sjspielman
Copy link
Member

Closes #796

This PR adds the first notebook to the hello-clusters module. A first round of high-level review might be good to start with for comments on organization (including within the notebook, and the notebook's location itself), content, and scope. Or, go for a fuller review if you think it's reasonable enough already!
Here is the rendered notebook to help with review:
01_perform-evaluate-clustering.nb.html.zip

In addition to the notebook, I updated the module README and activated the module workflow for testing this notebook.

@sjspielman sjspielman removed the request for review from jaclyn-taroni November 12, 2024 18:12
@sjspielman sjspielman requested a review from jashapiro November 12, 2024 21:55
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High level comments:

I think the overall content is fine here, but I found the integration of SCE and Seurat somewhat confusing/distracting. I think I would arrange it to do all of the content with the PCA matrix directly (assuming that I am correct about the organization), and then have separate sections about the considerations for input and output with SCE or Seurat objects.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation. The real evaluation would come with some comparisons among different clustering parameters, which I would expect in a later notebook.

I also want to quibble with your repeated strong recommendation of setting the seed for every function with a random component. Since R uses a global RNG, this can be a bit of a dangerous practice. It is often better to set the seed once in a notebook, rather than continuously resetting it. In this case it may not matter, but if there is any looping (for example if bootstrapping and calculating calculating statistics on each round) you can end up causing trouble.

@jashapiro
Copy link
Member

then have separate sections about the considerations for input and output with SCE or Seurat objects.

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity. It seems likely that people will use default Seurat functions to calculate clusters and then may want to look at those statistics for the Seurat-calculated clusters. Similarly, people may want to look at the default clusters that our SCE objects include.

@sjspielman
Copy link
Member Author

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity.

Great call, incoming.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation.

I agree they are not really the most informative without a full evaluation/comparison. Would you suggest removing plots altogether here then and just focusing on the function usage?

@sjspielman
Copy link
Member Author

This is now ready for another look! Changes broadly include:

  • Code now uses a pca matrix throughout, except for the new section towards the end that shows how to use an object
  • A section for evaluating existing cluster results from Seurat or the ScPCA ones as examples
  • The Seurat object used throughout the examples is now generated via a Seurat pipeline from the raw counts, which is more realistic for how contributors would be using a Seurat object (based on our experience so far). I figure in the future, we can replace the conversion code here with a function we add to rOpenScPCA for doing the conversion.
  • I pitched evaluation more as "calculating QC metrics" rather than evaluating per se

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

@sjspielman sjspielman requested a review from jashapiro November 15, 2024 19:45
@sjspielman
Copy link
Member Author

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

@jashapiro
Copy link
Member

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

@jashapiro
Copy link
Member

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

Hardcoding the seed here seems fine.

@sjspielman
Copy link
Member Author

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

Boo, sorry, I'll regenerate.
That said, I did remove the plots which is tentatively what I took your review to mean (see #874 (comment)). But, can easily restore!

@sjspielman
Copy link
Member Author

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good.

I had a few relatively small comments, with the most recurrent one (I stopped after a couple) about printing out large tables of results, which I think we should probably avoid. I wasn't actually saying not to include plots at all; I do think they are useful for showing the range of each statistic. I would just keep the plotting code as simple as possible, which probably means not trying to include median lines, etc.

I also suggest moving all the Seurat content together, rather than building the object then abandoning it for a while. I'd also show the "using previous" results in that context; in a section where you are already working with Seurat or SCE objects.

Finally, I think we can simplify some of the end where you are adding results to an object to just show adding a single column; the renaming a table and joining seems like it is straying from the main goal of the notebook.

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved
Comment on lines 84 to 86
# Convert to a Seurat object
seurat_obj <- CreateSeuratObject(counts = counts(sce), assay = "RNA")

Copy link
Member Author

@sjspielman sjspielman Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated with rOpenScPCA::sce_to_seurat() once AlexsLemonade/rOpenScPCA#15 goes in.
Also, move this down into the Seurat clusters section.

@sjspielman
Copy link
Member Author

This has now be revived and refreshed based on reviews and ready for another look! The code has generally be simplified and rearranged according to reviews, and I restored some plots (with a green fill that I accept might not survive).
01_perform-evaluate-clustering.nb.html.zip

Note that I do want to eventually link to the forthcoming Seurat module #945 in the part where I actually use sce_to_seurat(), but that will probably be a separate PR due to relative timing?

@sjspielman sjspielman requested a review from jashapiro December 16, 2024 15:20
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good to go in. I will insist on changing colors only to make sure you are using different colors for silhouette, purity, and stability. I don't care what they are, as long as they are distinct.

```{r violin purity}
ggplot(purity_results) +
aes(x = cluster, y = purity) +
geom_violin(fill = "darkolivegreen3") +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't care much about color choice, but I do care that when you are plotting different metrics you should use different colors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do care that when you are plotting different metrics you should use different colors.

Now that I can do 🎨

Comment on lines 438 to 440
algorithm = "louvain",
weighting = "jaccard",
nn = 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to show setting these programmatically, as that is what would really be best...

Suggested change
algorithm = "louvain",
weighting = "jaccard",
nn = 20
algorithm = tolower(metadata(sce)$cluster_algorithm),
weighting = tolower(metadata(sce)$cluster_weighting),
nn = metadata(sce)$cluster_nn

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this, I think I would like it if the rOpenScPCA functions were case-agnostic. But then you can't use match.arg, which kind of sucks. So maybe not worth doing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel fairly sure the answer here isn't "let's add another dependency!", but I did think this was useful to know about https://cran.r-project.org/web/packages/strex/vignettes/argument-matching.html

Comment on lines 388 to 395
# Print the clustering algorithm used
metadata(sce)$cluster_algorithm

# Print the weighting scheme
metadata(sce)$cluster_weighting

# Print the number of nearest neighbors
metadata(sce)$cluster_nn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for printing reasons, maybe:

metadata(sce)[c("cluster_algorithm", "cluster_weighting", "cluster_nn")]

I don't love it but I think it might work if you had a sentence that explicitly stated what these variables mean.

@sjspielman sjspielman merged commit e351f58 into AlexsLemonade:main Dec 17, 2024
5 checks passed
@sjspielman sjspielman deleted the sjspielman/796-hello-clusters-nb1 branch December 17, 2024 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add notebook to demonstrate clustering with rOpenScPCA
2 participants