hello-clusters notebook: Perform and evaluate clustering #874

sjspielman · 2024-11-12T18:12:11Z

Closes #796

This PR adds the first notebook to the hello-clusters module. A first round of high-level review might be good to start with for comments on organization (including within the notebook, and the notebook's location itself), content, and scope. Or, go for a fuller review if you think it's reasonable enough already!
Here is the rendered notebook to help with review:
01_perform-evaluate-clustering.nb.html.zip

In addition to the notebook, I updated the module README and activated the module workflow for testing this notebook.

…gether

… some TODO comments to come back to

jashapiro

High level comments:

I think the overall content is fine here, but I found the integration of SCE and Seurat somewhat confusing/distracting. I think I would arrange it to do all of the content with the PCA matrix directly (assuming that I am correct about the organization), and then have separate sections about the considerations for input and output with SCE or Seurat objects.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation. The real evaluation would come with some comparisons among different clustering parameters, which I would expect in a later notebook.

I also want to quibble with your repeated strong recommendation of setting the seed for every function with a random component. Since R uses a global RNG, this can be a bit of a dangerous practice. It is often better to set the seed once in a notebook, rather than continuously resetting it. In this case it may not matter, but if there is any looping (for example if bootstrapping and calculating calculating statistics on each round) you can end up causing trouble.

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

jashapiro · 2024-11-13T14:05:19Z

then have separate sections about the considerations for input and output with SCE or Seurat objects.

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity. It seems likely that people will use default Seurat functions to calculate clusters and then may want to look at those statistics for the Seurat-calculated clusters. Similarly, people may want to look at the default clusters that our SCE objects include.

…s for pca names, and add missing wording

sjspielman · 2024-11-14T18:01:20Z

A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity.

Great call, incoming.

My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation.

I agree they are not really the most informative without a full evaluation/comparison. Would you suggest removing plots altogether here then and just focusing on the function usage?

…glmGamPoi is now needed in renv

sjspielman · 2024-11-15T19:45:44Z

This is now ready for another look! Changes broadly include:

Code now uses a pca matrix throughout, except for the new section towards the end that shows how to use an object
A section for evaluating existing cluster results from Seurat or the ScPCA ones as examples
The Seurat object used throughout the examples is now generated via a Seurat pipeline from the raw counts, which is more realistic for how contributors would be using a Seurat object (based on our experience so far). I figure in the future, we can replace the conversion code here with a function we add to rOpenScPCA for doing the conversion.
I pitched evaluation more as "calculating QC metrics" rather than evaluating per se

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

sjspielman · 2024-11-15T19:54:10Z

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

jashapiro · 2024-11-15T19:55:10Z

Here is the current version of the rendered notebook:
01_perform-evaluate-clustering.nb.html.zip

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

jashapiro · 2024-11-15T19:56:35Z

Small question here: I'm on the fence for keeping params$seed vs hardcoding a seed in there, which would be "visually more appealing" in the html. Do you have a thought?

Hardcoding the seed here seems fine.

sjspielman · 2024-11-15T19:57:23Z

This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo?

Boo, sorry, I'll regenerate.
That said, I did remove the plots which is tentatively what I took your review to mean (see #874 (comment)). But, can easily restore!

sjspielman · 2024-11-15T20:03:41Z

Better! 01_perform-evaluate-clustering.nb.html.zip

jashapiro

This looks pretty good.

I had a few relatively small comments, with the most recurrent one (I stopped after a couple) about printing out large tables of results, which I think we should probably avoid. I wasn't actually saying not to include plots at all; I do think they are useful for showing the range of each statistic. I would just keep the plotting code as simple as possible, which probably means not trying to include median lines, etc.

I also suggest moving all the Seurat content together, rather than building the object then abandoning it for a while. I'd also show the "using previous" results in that context; in a section where you are already working with Seurat or SCE objects.

Finally, I think we can simplify some of the end where you are adding results to an object to just show adding a single column; the renaming a table and joining seems like it is straying from the main goal of the notebook.

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

sjspielman · 2024-12-13T19:17:24Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+# Convert to a Seurat object
+seurat_obj <- CreateSeuratObject(counts = counts(sce), assay = "RNA")



This needs to be updated with rOpenScPCA::sce_to_seurat() once AlexsLemonade/rOpenScPCA#15 goes in.
Also, move this down into the Seurat clusters section.

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

Co-authored-by: Joshua Shapiro <[email protected]>

sjspielman · 2024-12-16T15:20:22Z

This has now be revived and refreshed based on reviews and ready for another look! The code has generally be simplified and rearranged according to reviews, and I restored some plots (with a green fill that I accept might not survive).
01_perform-evaluate-clustering.nb.html.zip

Note that I do want to eventually link to the forthcoming Seurat module #945 in the part where I actually use sce_to_seurat(), but that will probably be a separate PR due to relative timing?

jashapiro

I think this looks good to go in. I will insist on changing colors only to make sure you are using different colors for silhouette, purity, and stability. I don't care what they are, as long as they are distinct.

jashapiro · 2024-12-16T21:07:21Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+```{r violin purity}
+ggplot(purity_results) +
+  aes(x = cluster, y = purity) +
+  geom_violin(fill = "darkolivegreen3") +


I don't care much about color choice, but I do care that when you are plotting different metrics you should use different colors.

I do care that when you are plotting different metrics you should use different colors.

Now that I can do 🎨

jashapiro · 2024-12-16T21:15:45Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+  algorithm = "louvain",
+  weighting = "jaccard",
+  nn = 20


I wonder if we want to show setting these programmatically, as that is what would really be best...

Suggested change

algorithm = "louvain",

weighting = "jaccard",

nn = 20

algorithm = tolower(metadata(sce)$cluster_algorithm),

weighting = tolower(metadata(sce)$cluster_weighting),

nn = metadata(sce)$cluster_nn

Looking at this, I think I would like it if the rOpenScPCA functions were case-agnostic. But then you can't use match.arg, which kind of sucks. So maybe not worth doing.

I feel fairly sure the answer here isn't "let's add another dependency!", but I did think this was useful to know about https://cran.r-project.org/web/packages/strex/vignettes/argument-matching.html

jashapiro · 2024-12-16T21:31:11Z

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd

+# Print the clustering algorithm used
+metadata(sce)$cluster_algorithm
+
+# Print the weighting scheme
+metadata(sce)$cluster_weighting
+
+# Print the number of nearest neighbors
+metadata(sce)$cluster_nn


Just for printing reasons, maybe:

metadata(sce)[c("cluster_algorithm", "cluster_weighting", "cluster_nn")]

I don't love it but I think it might work if you had a sentence that explicitly stated what these variables mean.

sjspielman added 10 commits November 8, 2024 15:44

WIP: began sketching out notebook to eval clustering

3b2cc56

Continued WIP: begin to flesh out sections, some reorg as it comes to…

d6414e9

…gether

WIP: much progress. Code basically complete and most text complete

ecf1d09

final touchups to complete the first draft of this notebook, and left…

aa3d207

… some TODO comments to come back to

update README

25778f2

add line to run this notebook, and chmod

bd59389

Turn on GHA on PRs, update data download, and run module script

7b61e9b

samples flag is plural

adc43bf

dont need repo name with renv::update

a6623e6

bump ropenscpca for calculate_stability usage

039ad4c

sjspielman requested a review from jaclyn-taroni as a code owner November 12, 2024 18:12

sjspielman removed the request for review from jaclyn-taroni November 12, 2024 18:12

sjspielman added 3 commits November 12, 2024 15:56

update ropenscpca

a262ffc

need igraph deps

a99e962

missing a quote. sad.

464fc2a

sjspielman requested a review from jashapiro November 12, 2024 21:55

jashapiro reviewed Nov 13, 2024

View reviewed changes

sjspielman added 3 commits November 13, 2024 15:21

Merge branch 'main' into sjspielman/796-hello-clusters-nb1

dab73e2

response to reviews: add parentheses for functions, just use backtick…

4f68ca9

…s for pca names, and add missing wording

one seed to rule them all, and fix yaml

0a7ed18

sjspielman added 7 commits November 14, 2024 14:50

WIP: rearranging notebook

26045f0

Continuing notebook reorg, added code for a seurat section for which …

6534d6d

…glmGamPoi is now needed in renv

WIP: delete extra text, and dont do stability with seurat

7b9ea6c

Finish notebook rearrangement

15f4a24

Merge branch 'main' into sjspielman/796-hello-clusters-nb1

de4b1ce

fix header depth

32a401f

fix wording

a8c3f29

sjspielman requested a review from jashapiro November 15, 2024 19:45

sjspielman added 2 commits November 15, 2024 15:00

Add missing chunk name, and regenerate with script

cddab18

rm seed param and hardcode, and regenerate for real

3953290

jashapiro reviewed Nov 19, 2024

View reviewed changes

sjspielman commented Dec 13, 2024

View reviewed changes

analyses/hello-clusters/01_perform-evaluate-clustering.Rmd Outdated Show resolved Hide resolved

sjspielman and others added 10 commits December 13, 2024 14:21

Apply suggestions from code review

a447bdf

Co-authored-by: Joshua Shapiro <[email protected]>

Merge branch 'main' into sjspielman/796-hello-clusters-nb1

d037180

update renv to most recent ropenscpca

d222606

respond to reviews, and use rOpenScPCA:: throughout

ac1febf

rerender notebook

614c926

rearrange calculating qc metrics on existing clusters

66f651f

speeling and less printing

22e3670

Merge branch 'main' into sjspielman/796-hello-clusters-nb1

302d17e

add some simple plots with a color i like that may not survive review

b9f0579

simpler saving

34a2218

sjspielman requested a review from jashapiro December 16, 2024 15:20

jashapiro approved these changes Dec 16, 2024

View reviewed changes

sjspielman added 3 commits December 17, 2024 08:29

Merge branch 'main' into sjspielman/796-hello-clusters-nb1

a481799

moar colors

dad9669

update how metadata clustering params are presented, printed, and used

3996b35

sjspielman merged commit e351f58 into AlexsLemonade:main Dec 17, 2024
5 checks passed

sjspielman deleted the sjspielman/796-hello-clusters-nb1 branch December 17, 2024 15:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hello-clusters notebook: Perform and evaluate clustering #874

hello-clusters notebook: Perform and evaluate clustering #874

sjspielman commented Nov 12, 2024

jashapiro left a comment

jashapiro commented Nov 13, 2024

sjspielman commented Nov 14, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro commented Nov 15, 2024

jashapiro commented Nov 15, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro left a comment

sjspielman Dec 13, 2024 •

edited

Loading

sjspielman commented Dec 16, 2024

jashapiro left a comment

jashapiro Dec 16, 2024

sjspielman Dec 16, 2024

jashapiro Dec 16, 2024

jashapiro Dec 16, 2024

sjspielman Dec 17, 2024

jashapiro Dec 16, 2024

		# Convert to a Seurat object
		seurat_obj <- CreateSeuratObject(counts = counts(sce), assay = "RNA")

-  algorithm = "louvain",
-  weighting = "jaccard",
-  nn = 20
+  algorithm = tolower(metadata(sce)$cluster_algorithm),
+  weighting = tolower(metadata(sce)$cluster_weighting),
+  nn = metadata(sce)$cluster_nn

hello-clusters notebook: Perform and evaluate clustering #874

hello-clusters notebook: Perform and evaluate clustering #874

Conversation

sjspielman commented Nov 12, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro commented Nov 13, 2024

sjspielman commented Nov 14, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro commented Nov 15, 2024

jashapiro commented Nov 15, 2024

sjspielman commented Nov 15, 2024

sjspielman commented Nov 15, 2024

jashapiro left a comment

Choose a reason for hiding this comment

sjspielman Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

sjspielman commented Dec 16, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Dec 16, 2024

Choose a reason for hiding this comment

sjspielman Dec 16, 2024

Choose a reason for hiding this comment

jashapiro Dec 16, 2024

Choose a reason for hiding this comment

jashapiro Dec 16, 2024

Choose a reason for hiding this comment

sjspielman Dec 17, 2024

Choose a reason for hiding this comment

jashapiro Dec 16, 2024

Choose a reason for hiding this comment

sjspielman Dec 13, 2024 •

edited

Loading