-
Notifications
You must be signed in to change notification settings - Fork 1
Lab 04: Structuring Projects and Inputs
Go to exercise directory 04_structure_and_input
and change to the pipeline
directory.
This project is organized as follows:
.
├── fasta_seqs
│ ├── seqs_1.fna
│ ├── seqs_2.fna
│ ├── seqs_3.fna
│ ├── seqs_4.fna
│ └── seqs_5.fna
├── pipeline
│ ├── main.nf
│ ├── modules
│ │ └── fasta_utils.nf
│ └── nextflow.config
└── run_01.sh
Notice that we now have a modules
subdirectory containing 'fasta_utils.nf'.
We also have a directory of 5 fasta sequence files to practice with.
Previously, we included our processes within the "main.nf" file. Although this is possible, it can quickly become very crowded as processes are added, so we will now practice writing our Nextflow processes in separate files and importing them into main.nf:
include { GET_HEADERS } from "./modules/fasta_utils.nf"
Here we import the process "GET_HEADERS" from the file "fasta_utils.nf" and it is available within the scope of the main.nf workflow.
This "main.nf" workflow contains a new type of syntax in the form of an if statement. If statements follow the groovy language convention of if () { code to perform if condition is true }. For example:
if (params.fasta_seqs) {
if (params.fasta_seqs) {
ch_fastas = Channel.fromPath("${params.fasta_seqs}/*fna", checkIfExists: true)
GET_HEADERS(ch_fastas)
}
This "if block" is going to first test whether there is a parameter (from the "parameters" scope) called "fasta_seqs". If this variable is defined, then then a channel called ch_fastas
will be produced using the fromPath
operator, and this channel will be passed into the GET_HEADERS
process we included earlier.
As we noticed from the "work" output earlier, we didn't actually have access to our results in a useful way. If we inspect the "modules/fasta_utils.nf", we'll see that the process begins with publishDir'. The lines preceding the Nextflow
inputline are called
Directives`. Many directives exist, and they can add further customization to your processes including which container/package to use, the number of cpus, exectutor type (local, slurm, etc), and many other memory and critical runtime behaviors.
process GET_HEADERS {
publishDir(path: "${publish_dir}/headers", mode: "symlink")
input:
path fasta_file
output:
path "*headers.txt", emit: ch_headers
script:
"""
grep "^>" ${fasta_file} > ${fasta_file.baseName}_headers.txt
"""
}
In the above example, the publishDir
directive is going to take all values from the output channel (files ending in "headers.txt") and it will write them to the specified $publish_dir/headers
directory. Notice the dollar sign before "publish_dir", signifying that it is a parameter that we have defined a value for somewhere? Here we can see that we didn't send "publish_dir" as an input variable, so it must have been defined elsewhere. If you refer to the lecture on Nextflow scope, we covered that variables in the process
scope (and many others) can be defined in a configuration file.
Let's inspect the included configuration file for this project, nextflow.config
.
params {
fasta_seqs = false
}
process {
publish_dir = "${params.publish_dir}"
}
This is a very simple configuration file. We see that the params
scope has a default value for "fasta_seqs" that is set to false
, which is used in the workflow scope. The process
scope has the mysterious publish_dir
variable we were seeing in "fasta_utils.nf". Where is the params.publish_dir
variable is coming from?
So far we've seen ways to define parameters in our main.nf workflow and within our configuration file. It may also be useful to define variables from the command line when we call our Nextflow pipeline. To do this, we use the convention of double-hyphens before the parameter name. For example, if we want to define our fasta_seqs
and publish_dir
variables, we could use the following syntax when calling our pipeline:
nextflow ./pipeline/main.nf --fasta_seqs .fasta_seqs --publish_dir results
In our "nextflow.config" file, we define a default value for "fasta_seqs" as false, which allows our main.nf workflow to skip the GET_HEADERS process if params.fasta_seqs is not defined. We'll see in this exercise, that providing a correct path to the fasta_seqs files will allow the workflow to call the GET_HEADERS process successfully. This small experiment shows us that variables that are input from command line will override values in our configuration file. This can be very helpful, as it lets us define defaults for any variable(s) we choose in our pipeline.
It is important to test and understand which variables will be given priority, and inclusion of multiple configuration files can become complext. Per the Nextflow training documents:
When a pipeline script is launched, Nextflow looks for configuration files in multiple locations. Since each configuration file can contain conflicting settings, the sources are ranked to determine which settings are applied. Possible configuration sources, in order of priority:
- Parameters specified on the command line (--something value)
- Parameters provided using the -params-file option
- Config file specified using the -c my_config option
- The config file named nextflow.config in the current directory
- The config file named nextflow.config in the workflow project directory
- The config file $HOME/.nextflow/config 7, Values defined within the pipeline script itself (e.g. main.nf)
When more than one of these options for specifying configurations are used, they are merged, so that the settings in the first override the same settings appearing in the second, and so on.
As part of the exercise challenges for this lab, we'll encounter an additional process in the modules/fasta_utils.nf" file called
CONCAT_HEADERS`. The parts and syntax of this process should be familiar, but notice that within the for-loop of our "script" we see the use of \$header (backslash before $header). In this instance, $header is the element being iterated over in our bash for-loop, and therefore it is a bash variable, not an input variable for this Nextflow process. It is fine to use bash variables in Nextflow processes, but you must put the escape character (backslash) before the variable name.
process CONCAT_HEADERS {
publishDir(path: "${publish_dir}/concat", mode: "copy")
input:
path header_files
output:
path "combined_headers.txt", emit: ch_combined
script:
"""
for header in *txt; do
cat \$header >> combined_headers.txt
done
"""
}