Add workflow for evaluating clustering to Ewing's module #908

allyhawkins · 2024-11-22T17:42:35Z

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #897
Closes #686

What is the goal of this pull request?

This PR takes the notebook that was modified in #895 and moves it into a workflow that can be run to perform and evaluate clustering across all samples in the Ewing's project (SCPCP000015). The workflow includes two steps:

Run clustering across a set of parameters
Render a report summarizing the clustering metrics across all parameters tested, generating one report per library

Briefly describe the general approach you took to achieve this goal.

I first moved the actual clustering out of the notebook and into its own script. Now the script takes in different parameter values to test and an SCE object and outputs a TSV file with the cluster results from all parameters tested. In the script I put in arguments for each louvain, leiden-cpm, and leiden-modularity to specify running each of them, but I could also be convinced to remove those arguments and just run those by default?
The notebook 01-clustering-metrics.Rmd now reads in this TSV file and creates the plots looking at individual metrics.
I added a evaluate-clusters.sh workflow that has been added to CI and I tested locally with the test data. This workflow runs the clustering script and renders the clustering report for all samples in the project.
I updated the necessary README files including the main README to describe what is currently in the workflow.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes. Next I am going to work on the second notebook in the workflow (see #896). This notebook will take the clustering results TSV and a set of parameters to look at and compare clusters to cell types and marker gene set scores. I'm still thinking about exactly how this notebook will look and where it will fit into the workflow, but the idea is to be able to look at the output from the metrics report rendered and choose a set of parameters to narrow in on and look at those clusters in a biological context.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Should be able to run this locally without any issues.

Are there particularly areas you'd like reviewers to have a close look at?

Just a note that the actual workflow itself has the exact same logic as the aucell-singler-annotation.sh and a lot of the same code. The biggest code changes that will need to be reviewed is the clustering script since that is brand new.

Author checklists

Analysis module and review

This analysis module uses the analysis template and has the expected directory structure.
The analysis module README.md has been updated to reflect code changes in this pull request.
The analytical code is documented and contains comments.
Any results and/or plots this code produces have been added to your S3 bucket for review.

Reproducibility checklist

Code in this pull request has been added to the GitHub Action workflow that runs this module.
The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

sjspielman

I did a quick-end-of-friday review, but I plan to do a more careful look next Monday! I think it's probably all there now as is, but I don't 100% trust my brain at the moment to be sure of that 🥴

In the script I put in arguments for each louvain, leiden-cpm, and leiden-modularity to specify running each of them, but I could also be convinced to remove those arguments and just run those by default?

I definitely understand why you did this, but I think since this script is very specific to this module and this circumstance, it would be fine to rip them out. But it doesn't hurt to keep them, I think. I did think a little bit about whether there might be a different way to specify parameters and algorithms, and I couldn't come up with a better approach given how all the different parameters interact, and again since this is quite specific to your module I think it's fine.

analyses/cell-type-ewings/README.md

analyses/cell-type-ewings/evaluate-clusters.sh

analyses/cell-type-ewings/scripts/clustering-workflow/01-clustering.R

Co-authored-by: Stephanie Spielman <[email protected]>

allyhawkins · 2024-11-25T16:28:19Z

In the script I put in arguments for each louvain, leiden-cpm, and leiden-modularity to specify running each of them, but I could also be convinced to remove those arguments and just run those by default?

I definitely understand why you did this, but I think since this script is very specific to this module and this circumstance, it would be fine to rip them out. But it doesn't hurt to keep them, I think. I did think a little bit about whether there might be a different way to specify parameters and algorithms, and I couldn't come up with a better approach given how all the different parameters interact, and again since this is quite specific to your module I think it's fine.

I too was struggling with how best to specify the parameters! I decided to keep in the algorithm flags for now and figured it would be easier to remove in the future if we need to, but also will allow for flexibility if we ever want to use this script to just run one algorithm.

I made the minor changes you found and then updated to use the helper function for reading in the list of parameters. This should be ready for another look!

sjspielman

LGTM!

I made one suggestion to fix a bug I probably introduced with my earlier suggestion, and I have one other comment that I can't make directly here - Looking at the resulting plots in an HTML, I realize we probably want some explicit breaks in the x-axis across nn, so there's a tick/value for each parameter value. Right now ggplot2 is choosing breaks for you since it's treating nn as numeric. So, we should either add some breaks for the actual nn value, or tell R to treat it like a factor instead of numeric. But I don't think I need to see again!

sjspielman · 2024-11-25T17:50:46Z

analyses/cell-type-ewings/scripts/clustering-workflow/01-clustering.R

+  param |>
+    stringr::str_split_1(param, ",") |>


probably my bad...

Suggested change

param |>

stringr::str_split_1(param, ",") |>

param |>

stringr::str_split_1(",") |>

…ing-clustering-workflow

allyhawkins added 10 commits November 21, 2024 14:53

script to get clusters

740dff5

remove extra cluster code

276031b

account for only 1 cluster in test data

ae81c28

workflow for clustering

92ee018

save report as html

4913373

use comma separated list for range variables

f23d8cd

update scripts and notebook readmes

bb2fe6d

update main readme

cc76c07

add clustering workflow to GHA

1717896

new line

dc08502

allyhawkins requested a review from jaclyn-taroni as a code owner November 22, 2024 17:42

allyhawkins requested review from sjspielman and removed request for jaclyn-taroni November 22, 2024 17:42

sjspielman reviewed Nov 22, 2024

View reviewed changes

allyhawkins and others added 3 commits November 25, 2024 10:19

Apply suggestions from code review

76bc4c5

Co-authored-by: Stephanie Spielman <[email protected]>

add jaccard to readme

b045eb8

use helper function for params

c104fe3

allyhawkins requested a review from sjspielman November 25, 2024 16:28

string -> stringr

2a622df

sjspielman approved these changes Nov 25, 2024

View reviewed changes

allyhawkins added 4 commits November 25, 2024 12:33

fix helper function errors

dbebaab

set breaks for x-axis to be all nn

58180cb

check that clustering results file exists

e5efab7

Merge remote-tracking branch 'AlexsLemonade/main' into allyhawkins/ew…

fd3fbc5

…ing-clustering-workflow

allyhawkins merged commit 0b9ce4c into AlexsLemonade:main Nov 26, 2024
3 checks passed

allyhawkins deleted the allyhawkins/ewing-clustering-workflow branch November 26, 2024 14:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add workflow for evaluating clustering to Ewing's module #908

Add workflow for evaluating clustering to Ewing's module #908

allyhawkins commented Nov 22, 2024

sjspielman left a comment

allyhawkins commented Nov 25, 2024

sjspielman left a comment •

edited

Loading

sjspielman Nov 25, 2024

Add workflow for evaluating clustering to Ewing's module #908

Add workflow for evaluating clustering to Ewing's module #908

Conversation

allyhawkins commented Nov 22, 2024

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

What is the goal of this pull request?

Briefly describe the general approach you took to achieve this goal.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Are there particularly areas you'd like reviewers to have a close look at?

Author checklists

Analysis module and review

Reproducibility checklist

sjspielman left a comment

Choose a reason for hiding this comment

allyhawkins commented Nov 25, 2024

sjspielman left a comment • edited Loading

Choose a reason for hiding this comment

sjspielman Nov 25, 2024

Choose a reason for hiding this comment

sjspielman left a comment •

edited

Loading