Skip to content

Lab 04: Structuring Projects and Inputs

Ryan edited this page Jan 8, 2024 · 9 revisions

Structuring Projects and Inputs

Inspect project organization

Go to exercise directory 04_structure_and_input and change to the pipeline directory.

This project is organized as follows:

.
├── fasta_seqs
│   ├── seqs_1.fna
│   ├── seqs_2.fna
│   ├── seqs_3.fna
│   ├── seqs_4.fna
│   └── seqs_5.fna
├── pipeline
│   ├── main.nf
│   ├── modules
│   │   └── fasta_utils.nf
│   └── nextflow.config
└── run_01.sh

Notice that we now have a modules subdirectory containing 'fasta_utils.nf'.

We also have a directory of 5 fasta sequence files to practice with.

Importing processes

Previously, we included our processes within the "main.nf" file. Although this is possible, it can quickly become very crowded as processes are added, so we will now practice writing our Nextflow processes in separate files and importing them into main.nf:

include { GET_HEADERS    } from "./modules/fasta_utils.nf"

Here we import the process "GET_HEADERS" from the file "fasta_utils.nf" and it is available within the scope of the main.nf workflow.

Workflow conditionals

This "main.nf" workflow contains a new type of syntax in the form of an if statement. If statements follow the groovy language convention of if () { code to perform if condition is true }. For example:

if (params.fasta_seqs) { 
    if (params.fasta_seqs) {                                                    
        ch_fastas = Channel.fromPath("${params.fasta_seqs}/*fna", checkIfExists: true)
        GET_HEADERS(ch_fastas)                                                  
    }

This "if block" is going to first test whether there is a parameter (from the "parameters" scope) called "fasta_seqs". If this variable is defined, then then a channel called ch_fastas will be produced using the fromPath operator, and this channel will be passed into the GET_HEADERS process we included earlier.

Publish directories

As we noticed from the "work" output earlier, we didn't actually have access to our results in a useful way. If we inspect the "modules/fasta_utils.nf", we'll see that the process begins with publishDir'. The lines preceding the Nextflow inputline are calledDirectives`. Many directives exist, and they can add further customization to your processes including which container/package to use, the number of cpus, exectutor type (local, slurm, etc), and many other memory and critical runtime behaviors.

process GET_HEADERS {
    publishDir(path: "${publish_dir}/headers", mode: "symlink")

    input:
        path fasta_file

    output:
        path "*headers.txt", emit: ch_headers

    script:
        """
        grep "^>" ${fasta_file} > ${fasta_file.baseName}_headers.txt
        """
}

In the above example, the publishDir directive is going to take all values from the output channel (files ending in "headers.txt") and it will write them to the specified $publish_dir/headers directory. Notice the dollar sign before "publish_dir", signifying that it is a parameter that we have defined a value for somewhere? Here we can see that we didn't send "publish_dir" as an input variable, so it must have been defined elsewhere. If you refer to the lecture on Nextflow scope, we covered that variables in the process scope (and many others) can be defined in a configuration file.

Configuration files

Let's inspect the included configuration file for this project, nextflow.config.

⚠️ The first place Nextflow will look for configuration settings must be named "nextflow.config". This configuration file can in turn include additional configuration files.

params {
    fasta_seqs = false
}

process {
    publish_dir = "${params.publish_dir}"
}

This is a very simple configuration file. We see that the params scope has a default value for "fasta_seqs" that is set to false, which is used in the workflow scope. The process scope has the mysterious publish_dir variable we were seeing in "fasta_utils.nf". Where is the params.publish_dir variable is coming from?

Input variables

So far we've seen ways to define parameters in our main.nf workflow and within our configuration file. It may also be useful to define variables from the command line when we call our Nextflow pipeline. To do this, we use the convention of double-hyphens before the parameter name. For example, if we want to define our fasta_seqs and publish_dir variables, we could use the following syntax when calling our pipeline:

nextflow ./pipeline/main.nf --fasta_seqs .fasta_seqs --publish_dir results

⚠️ The use of single hyphens before variable names are reserved for Nextflow-specific runtime parameters (e.g., "resume", "version", "-with-report" etc.).

Variable priority

In our "nextflow.config" file, we define a default value for "fasta_seqs" as false, which allows our main.nf workflow to skip the GET_HEADERS process if params.fasta_seqs is not defined. We'll see in this exercise, that providing a correct path to the fasta_seqs files will allow the workflow to call the GET_HEADERS process successfully. This small experiment shows us that variables that are input from command line will override values in our configuration file. This can be very helpful, as it lets us define defaults for any variable(s) we choose in our pipeline.

It is important to test and understand which variables will be given priority, and inclusion of multiple configuration files can become complext. Per the Nextflow training documents:

When a pipeline script is launched, Nextflow looks for configuration files in multiple locations. Since each configuration file can contain conflicting settings, the sources are ranked to determine which settings are applied. Possible configuration sources, in order of priority:

  1. Parameters specified on the command line (--something value)
  2. Parameters provided using the -params-file option
  3. Config file specified using the -c my_config option
  4. The config file named nextflow.config in the current directory
  5. The config file named nextflow.config in the workflow project directory
  6. The config file $HOME/.nextflow/config 7, Values defined within the pipeline script itself (e.g. main.nf)

When more than one of these options for specifying configurations are used, they are merged, so that the settings in the first override the same settings appearing in the second, and so on.

Calling Nextflow variables vs. bash variables:

As part of the exercise challenges for this lab, we'll encounter an additional process in the modules/fasta_utils.nf" file called CONCAT_HEADERS`. The parts and syntax of this process should be familiar, but notice that within the for-loop of our "script" we see the use of \$header (backslash before $header). In this instance, $header is the element being iterated over in our bash for-loop, and therefore it is a bash variable, not an input variable for this Nextflow process. It is fine to use bash variables in Nextflow processes, but you must put the escape character (backslash) before the variable name.

process CONCAT_HEADERS {
    publishDir(path: "${publish_dir}/concat", mode: "copy")

    input:
        path header_files

    output:
        path "combined_headers.txt", emit: ch_combined

    script:
        """
        for header in *txt; do
          cat \$header >> combined_headers.txt
        done
        """
}