Skip to content

Lab 06: Calling Containers with Nextflow

Ryan edited this page Jan 12, 2024 · 5 revisions

Calling Containers with Nextflow

Now that we've covered the basics of channels, processes, workflows, and choosing containers, it's time to put what we've learned into a framework that we can actually apply to common sequence analysis pipelines.

Practice Dataset

For this pipeline, we'll be using a practice RNA-Seq dataset from the Griffith Lab. This dataset is ideal for practice because it consists of 6 human RNA samples subset down to the 22nd chromosome only.

Input Fastq Files with fromFilePairs

Let's head over to exercises/06_nextflow_containers/pipeline to view a Nextflow pipeline setup for use of singularity containers.

In our main.nf file, we see a new operator, fromFilePairs.

ch_fastqs = Channel.fromFilePairs("${params.fastq_seqs}/*read{1,2}.fastq.gz", checkIfExists: true, flat:true)

Previously, we used the .fromPath() operator to load our fasta sequence files into a queue channel. Nextflow also includes operators designed specifically for paired-end fastq sequence files using the fromFilePairs operator.

The fromFilePairs operator automatically uses glob patterns to detect which files are forward and reverse, and the output channel consists of tuples with the following information:

[sample_a, [/my/data/sample_a_1.fastq, /my/data/sample_a_2.fastq]]
[sample_b, [/my/data/sample_b_1.fastq, /my/data/sample_b_2.fastq]]
[sample_c, [/my/data/sample_c_1.fastq, /my/data/sample_c_2.fastq]]

We can see that the first element is the sample base name (stripped of the R1/R2 and fastq suffix) and the second element is a nested tuple with the full path to the R1 and R2 reads in order. Very useful!

If we inspect the process where this ch_fastqs channel is passed in modules/fastqc.nf, we see this format is being types as a tuple, and each of its components is also typed as a val or a path.

input:
    tuple val(id), path(r1), path(r2)

Process Directives for Container Use

While we're in modules/fastqc.nf, let's inspect the new process directive container, directly above the input portion of the FASTQC process.

container = "quay.io/biocontainers/fastqc:0.12.1--hdfd78af_0"

This specifies that the script portion of the process will be called inside a container, specifically the quay.io biocontainers image specified. Our previous process examples have relied only upon commands and tools available in the working shell environment, which is where Nextflow defaults to if no container or package is specified.

⭐ How do we know which container software is being used?

⭐ Where is the container image being stored?

Nextflow Configuration Settings for Containers with Singularity

Let's look at the nextflow.config file for this pipeline.

Midway down the file, we see new settings defining a singularity scope.

singularity {
    enabled = true
    cacheDir = "${HOME}/singularity/"
    autoMounts = true
}

In this case, singularity is enabled to call our process containers, the cache directory is set to save image data to $HOME/singluarity, and automounts are enabled.