Decrease size of docker image for Ewing's module #884

allyhawkins · 2024-11-15T18:03:59Z

I've made a change for now, but if we could make this smaller, that'd be swell. I prefer how we were handling permissions.

Originally posted by @jaclyn-taroni in #881 (comment)

It's probably a good idea to make the Ewing image smaller so we don't have to use the bigger runner for the GHA.

Currently, we have two versions of the tumor cell annotation workflow. We are only using one of those moving forward. The workflow that we aren't using includes a conda environment, which we don't really need for the other workflow. So I think we could create a smaller image that doesn't have that conda environment. We should go through the code again and double check that this is a reasonable approach.

allyhawkins · 2024-11-20T20:36:55Z

After looking into this more there are a few options we can take. For some added context, we currently have two separate workflows that are being run in the Ewing's module for CI with another workflow on the way.

CNV annotation: This workflow was used to do initial exploration of annotating tumor cells with different methods, including marker gene annotation, CellAssign, CopyKAT, and InferCNV. First we manually annotate all tumor cells using a gene expression cutoff and then we compare those annotations to the output from CellAssign, CopyKAT, and InferCNV.
This workflow uses both R and Python and requires a docker image that contains both the renv and conda environments.
AUCell annotation: This workflow is used to annotate tumor cells using a combination of AUCell and SingleR. This workflow only uses R and is not dependent on the output from the CNV annotation workflow.

Note that we are going to be adding a workflow to assign and evaluate clusters that will depend on the output from the AUCell workflow and only use R.

After running the CNV workflow on a subset of samples we pretty much stopped using CellAssign as the tumor cell assignments didn't make sense. We did see that CopyKAT and InferCNV worked decently a small portion of the time but we were having trouble with it once we expanded to use more samples. Because of this we turned to using AUCell and SingleR which resulted in tumor cell annotations we feel much more confident in. As of right now we are only using the output from the second workflow for producing annotations.

Based on this information I think we have two options to consider:

Option 1: The easiest option would be to remove running CellAssign from the CNV workflow and thus remove the dependency on having Python and the conda environment in the Docker image used for that workflow. Here we would remove the part of the workflow that generates the CellAssign report and remove installing conda and the conda lock file from the Dockerfile. But we would still keep the conda-lock.yml file, the script that runs CellAssign, and the template notebook that is rendered. This is the least reproducible option but the least time consuming. The only reason I'm suggesting this is because we have completely moved away from using CellAssign in this module.

Option 2: We could keep all the workflows as is but create two separate docker images. Because the CNV annotation workflow uses both R and Python we would need to also create two separate renv.lock files. This would make it so only packages needed to run InferCNV/CopyKAT are in the image used for the CNV annotation workflow and packages needed for AUCell/SingleR/clustering would live in a separate R-only image. We could take a similar approach to what we do in scpcaTools where we create a minimal lock file and then expanded lock files that include a specific set of packages (see https://github.com/AlexsLemonade/scpcaTools/blob/main/docker/make-requirements.sh). With this option we would be able to create two jobs in the GHA that each run on a separate container as the jobs are not dependent on each other. We would also need to figure out how to name the images if we have multiple.

Option 3: We could move running the python script for CellAssign out of the cnv-annotation.sh workflow and run it in its own workflow on a separate GHA job with a python only docker container. The one caveat here is that the CNV annotation workflow includes rendering a report that takes as input the results from CellAssign and marker gene annotation. We need R for that part, so we can't run the full CellAssign workflow in a python only container unless we have one script (and GHA job) that just runs CellAssign and then the output from CellAssign is used as input to another script (and GHA job) that renders the report. If we took this route, we would need to either use artifacts in GHA to pass the output from CellAssign between jobs OR take a similar approach to option 2 and create separate lock files with separate R packages for each workflow (one for the CellAssign workflow, one for the other workflows).

Both option 2 and option 3 are more involved than option 1, but also are more reproducible and help maintain everything being run in CI. Because of that I think option 2 would be the best approach. With that approach we keep the workflows the same but create separate images for each workflow.

Tagging the science team in case any of you have any thoughts or opinions: @jashapiro @jaclyn-taroni @sjspielman
I will plan to start implementing this next sprint (starting 12/1).

jaclyn-taroni · 2024-11-22T15:57:38Z

I agree we wouldn't want to do option 1. Option 2 and 3 seem more labor-intensive than I was hoping, and therefore, I conclude that this is probably not worth doing right now. What we had to do to use the bigger disk on PRs is not (my) ideal solution, but it is workable.

sjspielman · 2024-11-25T16:23:59Z

Option 2 seems the best one to me. If/when we do this, we might also want to consider adding docs for how to manage/name multiple Dockerfiles. Or, this might be a rare enough occurrence that we prefer to deal with it on a case-by-case basis.

That said, I don't think Option 1 is the worst temporary solution until there is more time to fully address this.

allyhawkins · 2024-11-25T16:38:05Z

Option 2 seems the best one to me. If/when we do this, we might also want to consider adding docs for how to manage/name multiple Dockerfiles. Or, this might be a rare enough occurrence that we prefer to deal with it on a case-by-case basis.

I think the benefit of doing this at some point is figuring out how we want to deal with having multiple images in a module. I agree that having some docs/ rules around this would be helpful to establish!

That being said, I also agree that this shouldn't be a priority right now given how much work it probably will be. So maybe this is just something to play around with whenever we have some extra time (or develop another module that needs two docker images, whichever comes first).

allyhawkins self-assigned this Nov 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decrease size of docker image for Ewing's module #884

Decrease size of docker image for Ewing's module #884

allyhawkins commented Nov 15, 2024

allyhawkins commented Nov 20, 2024

jaclyn-taroni commented Nov 22, 2024

sjspielman commented Nov 25, 2024

allyhawkins commented Nov 25, 2024

Decrease size of docker image for Ewing's module #884

Decrease size of docker image for Ewing's module #884

Comments

allyhawkins commented Nov 15, 2024

allyhawkins commented Nov 20, 2024

jaclyn-taroni commented Nov 22, 2024

sjspielman commented Nov 25, 2024

allyhawkins commented Nov 25, 2024