-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hello-clusters notebook: Perform and evaluate clustering #874
hello-clusters notebook: Perform and evaluate clustering #874
Conversation
… some TODO comments to come back to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
High level comments:
I think the overall content is fine here, but I found the integration of SCE and Seurat somewhat confusing/distracting. I think I would arrange it to do all of the content with the PCA matrix directly (assuming that I am correct about the organization), and then have separate sections about the considerations for input and output with SCE or Seurat objects.
My other main thought is that the statistics here are all kind of hard to interpret on their own. I would probably couch this notebook as a demonstration of the evaluation functions rather than an actual evaluation. The real evaluation would come with some comparisons among different clustering parameters, which I would expect in a later notebook.
I also want to quibble with your repeated strong recommendation of setting the seed for every function with a random component. Since R uses a global RNG, this can be a bit of a dangerous practice. It is often better to set the seed once in a notebook, rather than continuously resetting it. In this case it may not matter, but if there is any looping (for example if bootstrapping and calculating calculating statistics on each round) you can end up causing trouble.
A secondary thought on this component is that I think we probably want to cover how to use existing cluster assignments (particularly for Seurat) for the silhouette width and purity. It seems likely that people will use default Seurat functions to calculate clusters and then may want to look at those statistics for the Seurat-calculated clusters. Similarly, people may want to look at the default clusters that our SCE objects include. |
…s for pca names, and add missing wording
Great call, incoming.
I agree they are not really the most informative without a full evaluation/comparison. Would you suggest removing plots altogether here then and just focusing on the function usage? |
…glmGamPoi is now needed in renv
This is now ready for another look! Changes broadly include:
Here is the current version of the rendered notebook: |
Small question here: I'm on the fence for keeping |
This version doesn't have any results/plots in it, and the version in the repo is out of date. Can you update the rendered version in the repo? |
Hardcoding the seed here seems fine. |
Boo, sorry, I'll regenerate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good.
I had a few relatively small comments, with the most recurrent one (I stopped after a couple) about printing out large tables of results, which I think we should probably avoid. I wasn't actually saying not to include plots at all; I do think they are useful for showing the range of each statistic. I would just keep the plotting code as simple as possible, which probably means not trying to include median lines, etc.
I also suggest moving all the Seurat
content together, rather than building the object then abandoning it for a while. I'd also show the "using previous" results in that context; in a section where you are already working with Seurat or SCE objects.
Finally, I think we can simplify some of the end where you are adding results to an object to just show adding a single column; the renaming a table and joining seems like it is straying from the main goal of the notebook.
# Convert to a Seurat object | ||
seurat_obj <- CreateSeuratObject(counts = counts(sce), assay = "RNA") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be updated with rOpenScPCA::sce_to_seurat()
once AlexsLemonade/rOpenScPCA#15 goes in.
Also, move this down into the Seurat clusters section.
Co-authored-by: Joshua Shapiro <[email protected]>
This has now be revived and refreshed based on reviews and ready for another look! The code has generally be simplified and rearranged according to reviews, and I restored some plots (with a green fill that I accept might not survive). Note that I do want to eventually link to the forthcoming Seurat module #945 in the part where I actually use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good to go in. I will insist on changing colors only to make sure you are using different colors for silhouette, purity, and stability. I don't care what they are, as long as they are distinct.
```{r violin purity} | ||
ggplot(purity_results) + | ||
aes(x = cluster, y = purity) + | ||
geom_violin(fill = "darkolivegreen3") + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't care much about color choice, but I do care that when you are plotting different metrics you should use different colors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do care that when you are plotting different metrics you should use different colors.
Now that I can do 🎨
algorithm = "louvain", | ||
weighting = "jaccard", | ||
nn = 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to show setting these programmatically, as that is what would really be best...
algorithm = "louvain", | |
weighting = "jaccard", | |
nn = 20 | |
algorithm = tolower(metadata(sce)$cluster_algorithm), | |
weighting = tolower(metadata(sce)$cluster_weighting), | |
nn = metadata(sce)$cluster_nn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this, I think I would like it if the rOpenScPCA
functions were case-agnostic. But then you can't use match.arg
, which kind of sucks. So maybe not worth doing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel fairly sure the answer here isn't "let's add another dependency!", but I did think this was useful to know about https://cran.r-project.org/web/packages/strex/vignettes/argument-matching.html
# Print the clustering algorithm used | ||
metadata(sce)$cluster_algorithm | ||
|
||
# Print the weighting scheme | ||
metadata(sce)$cluster_weighting | ||
|
||
# Print the number of nearest neighbors | ||
metadata(sce)$cluster_nn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for printing reasons, maybe:
metadata(sce)[c("cluster_algorithm", "cluster_weighting", "cluster_nn")]
I don't love it but I think it might work if you had a sentence that explicitly stated what these variables mean.
Closes #796
This PR adds the first notebook to the
hello-clusters
module. A first round of high-level review might be good to start with for comments on organization (including within the notebook, and the notebook's location itself), content, and scope. Or, go for a fuller review if you think it's reasonable enough already!Here is the rendered notebook to help with review:
01_perform-evaluate-clustering.nb.html.zip
In addition to the notebook, I updated the module README and activated the module workflow for testing this notebook.