Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add workflow for evaluating clustering to Ewing's module #908

Merged

Conversation

allyhawkins
Copy link
Member

Purpose/implementation Section

Please link to the GitHub issue that this pull request addresses.

Closes #897
Closes #686

What is the goal of this pull request?

This PR takes the notebook that was modified in #895 and moves it into a workflow that can be run to perform and evaluate clustering across all samples in the Ewing's project (SCPCP000015). The workflow includes two steps:

  1. Run clustering across a set of parameters
  2. Render a report summarizing the clustering metrics across all parameters tested, generating one report per library

Briefly describe the general approach you took to achieve this goal.

  • I first moved the actual clustering out of the notebook and into its own script. Now the script takes in different parameter values to test and an SCE object and outputs a TSV file with the cluster results from all parameters tested. In the script I put in arguments for each louvain, leiden-cpm, and leiden-modularity to specify running each of them, but I could also be convinced to remove those arguments and just run those by default?
  • The notebook 01-clustering-metrics.Rmd now reads in this TSV file and creates the plots looking at individual metrics.
  • I added a evaluate-clusters.sh workflow that has been added to CI and I tested locally with the test data. This workflow runs the clustering script and renders the clustering report for all samples in the project.
  • I updated the necessary README files including the main README to describe what is currently in the workflow.

If known, do you anticipate filing additional pull requests to complete this analysis module?

Yes. Next I am going to work on the second notebook in the workflow (see #896). This notebook will take the clustering results TSV and a set of parameters to look at and compare clusters to cell types and marker gene set scores. I'm still thinking about exactly how this notebook will look and where it will fit into the workflow, but the idea is to be able to look at the output from the metrics report rendered and choose a set of parameters to narrow in on and look at those clusters in a biological context.

Provide directions for reviewers

What are the software and computational requirements needed to be able to run the code in this PR?

Should be able to run this locally without any issues.

Are there particularly areas you'd like reviewers to have a close look at?

Just a note that the actual workflow itself has the exact same logic as the aucell-singler-annotation.sh and a lot of the same code. The biggest code changes that will need to be reviewed is the clustering script since that is brand new.

Author checklists

Analysis module and review

Reproducibility checklist

  • Code in this pull request has been added to the GitHub Action workflow that runs this module.
  • The dependencies required to run the code in this pull request have been added to the analysis module Dockerfile.
  • If applicable, the dependencies required to run the code in this pull request have been added to the analysis module conda environment.yml file.
  • If applicable, R package dependencies required to run the code in this pull request have been added to the analysis module renv.lock file.

@allyhawkins allyhawkins requested review from sjspielman and removed request for jaclyn-taroni November 22, 2024 17:42
Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick-end-of-friday review, but I plan to do a more careful look next Monday! I think it's probably all there now as is, but I don't 100% trust my brain at the moment to be sure of that 🥴

In the script I put in arguments for each louvain, leiden-cpm, and leiden-modularity to specify running each of them, but I could also be convinced to remove those arguments and just run those by default?

I definitely understand why you did this, but I think since this script is very specific to this module and this circumstance, it would be fine to rip them out. But it doesn't hurt to keep them, I think. I did think a little bit about whether there might be a different way to specify parameters and algorithms, and I couldn't come up with a better approach given how all the different parameters interact, and again since this is quite specific to your module I think it's fine.

analyses/cell-type-ewings/README.md Outdated Show resolved Hide resolved
analyses/cell-type-ewings/README.md Outdated Show resolved Hide resolved
analyses/cell-type-ewings/evaluate-clusters.sh Outdated Show resolved Hide resolved
@allyhawkins
Copy link
Member Author

In the script I put in arguments for each louvain, leiden-cpm, and leiden-modularity to specify running each of them, but I could also be convinced to remove those arguments and just run those by default?

I definitely understand why you did this, but I think since this script is very specific to this module and this circumstance, it would be fine to rip them out. But it doesn't hurt to keep them, I think. I did think a little bit about whether there might be a different way to specify parameters and algorithms, and I couldn't come up with a better approach given how all the different parameters interact, and again since this is quite specific to your module I think it's fine.

I too was struggling with how best to specify the parameters! I decided to keep in the algorithm flags for now and figured it would be easier to remove in the future if we need to, but also will allow for flexibility if we ever want to use this script to just run one algorithm.

I made the minor changes you found and then updated to use the helper function for reading in the list of parameters. This should be ready for another look!

Copy link
Member

@sjspielman sjspielman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

I made one suggestion to fix a bug I probably introduced with my earlier suggestion, and I have one other comment that I can't make directly here - Looking at the resulting plots in an HTML, I realize we probably want some explicit breaks in the x-axis across nn, so there's a tick/value for each parameter value. Right now ggplot2 is choosing breaks for you since it's treating nn as numeric. So, we should either add some breaks for the actual nn value, or tell R to treat it like a factor instead of numeric. But I don't think I need to see again!

Comment on lines 107 to 108
param |>
stringr::str_split_1(param, ",") |>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably my bad...

Suggested change
param |>
stringr::str_split_1(param, ",") |>
param |>
stringr::str_split_1(",") |>

@allyhawkins allyhawkins merged commit 0b9ce4c into AlexsLemonade:main Nov 26, 2024
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/ewing-clustering-workflow branch November 26, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants