-
Notifications
You must be signed in to change notification settings - Fork 0
Lab 02: Processes and Workflows
To begin, navigate to the directory within exercises/01_processes_and_workflows/pipeline
. This exercise contains the simplest example of a nextflow pipeline possible where both processes and workflow are stored in a single file, "main.nf".
Using a text editor such as nano or vim, inspect the contents of main.nf:
nextflow.enable.dsl=2
process WRITE_ODD {
input:
val x
output:
path "*txt"
script:
"""
echo ${x} > ${x}.txt
"""
}
workflow {
ch_odd = Channel.of(1, 3, 5, 7)
ch_even = Channel.of(2, 4, 6, 8)
WRITE_ODD(ch_odd)
/*
goals:
1. create a process called WRITE_EVEN that takes the queue channel
"ch_even" as input
2. see what happens if you re-use WRITE_ODD a second time with
"ch_even" as input
3. use the "view" operator to see the output of WRITE_ODD
*/
}
Notice the file begins with:
nextflow.enable.dsl=2
Don't worry much about the deeper meaning of this line, at some point Nextflow's "domain-specific language" was updated and it's advised to always use the newest iteration. It's at least worth noting that help materials for the older syntax are floating around.
Processes in Nextflow are similar to functions in other programming languages. They typically take an input, run some actions, then produce output. Additionally, processes each have their own scope
, which means the only variables they have access to are those that are specifically defined for processes (e.g. process scope variables in a nextflow.config file). More on that later.
process WRITE_ODD {
input:
val x
output:
path "*txt"
script:
"""
echo ${x} > ${x}.txt
"""
}
We see the three main components of a Nextflow process, the input, output, and script sections. This order must be preserved.
input:
val x
The input in this exercise WRITE_ODD process is going to be referred to as "x" within the scope of this process. When we look ahead to the workflow in a minute we'll see that x is going to be the numbers defined in the "ch_odd" channel.
Multiple inputs can be defined, each is placed on a new line. Notice how the input variable is preceded with val
. Variables in processes are typed
, meaning the explicit type of information contained in the input is defined in advance. Here we see the input is a simple catch-all value
.
output:
path "*txt"
The rules of defining outputs behaves similarly to the inputs. The outputs are typed, and here we see that the type is path
, which lets Nextflow processes and workflows know this data should be passed along as a filepath. The "*txt" here will return a channel of all files produced by the script ending in "txt".
script:
"""
echo ${x} > ${x}.txt
"""
The script portion of the process is where the majority of actions in the pipeline occur. The portion of the script that occurs between two lines of three double-quotation marks """ represents an actual command-line tool being called.
In this example we are echoing the value of "x" (an odd integer from ch_odd) and writing it into a file with a corresponding value in the filename. The use of ${x} references the Nextflow variable "x". This seems identical to typical shell scripting, however it can be confusing when you want to call variables that you create within a shell script. To call bash variables, the dollar sign must be preceded with a backslash ("").
The builtin Groovy-based language allows many ways for data to be manipulated and variables to be defined before calling any command-line based tools. This isn't included in the above example, but using Groovy can occur before or even within the triple quotes and allows a lot of flexibility.
The workflow is the entry point to a Nextflow pipeline. Like processes, workflows are wrapped in curly brackets and have their own scope. In this example, the process WRITE_ODD is stored in the same "main.nf" file as the main workflow, so WRITE_ODD is in the workflow's scope (the workflow knows it's there).
The two queue channels are created using the ".of()" operator, which essentially makes a queue channel from a list. The "ch_odd" channel is passed as input to the WRITE_ODD process. The remainder of the workflow consists of unused lines of code which are comments. Comments can be made on a single line using "//" per-line or longer, multi-line comments can be wrapped with "/*" and "*/" as in this example.
workflow {
ch_odd = Channel.of(1, 3, 5, 7)
ch_even = Channel.of(2, 4, 6, 8)
WRITE_ODD(ch_odd)
/*
goals:
1. create a process called WRITE_EVEN that takes the queue channel
"ch_even" as input
2. see what happens if you re-use WRITE_ODD a second time with
"ch_even" as input
3. use the "view" operator to see the output of WRITE_ODD
*/
}
The simplest way to run a nextflow pipeline is to run:
nextflow <path to main.nf>.
If you change one directory back out of the "pipeline" directory, you'll see "run_01.sh" which will run the nextflow pipeline using a few extra fancy arguments:
nextflow ./pipeline/main.nf -with-report report.html -with-timeline timeline.html
The -with-report
and -with-timeline
arguments tell Nextflow to produce detailed output summarizing the pipeline run. These can be very useful for understanding the resources being used in a pipeline, but they also require the pipeline to run completely in order to be produced, so they're not terribly useful for debugging.