Skip to content

Commit

Permalink
adding snake
Browse files Browse the repository at this point in the history
  • Loading branch information
alefisico committed Nov 13, 2024
1 parent af83bfa commit 132ec6a
Show file tree
Hide file tree
Showing 2 changed files with 198 additions and 22 deletions.
189 changes: 178 additions & 11 deletions episodes/Snakemake.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,16 +38,17 @@ Snakemake is a workflow management system that simplifies the creation and execu

## Understanding the Basics

Snakemake is a workflow management system that simplifies the process of defining complex computational pipelines. It's particularly useful for bioinformatics pipelines, but can be applied to a wide range of computational tasks.
At its core, Snakemake organizes computational workflows into **rules**, each representing a specific task within the pipeline. These rules are interconnected through their input and output files, forming a dependency graph. Using a Python-like syntax, you can specify the precise commands required to generate output files from input data. Snakemake intelligently analyzes this dependency graph to determine the optimal execution order and parallelizes tasks to maximize efficiency, making it ideal for large-scale data analysis projects.

## The Core Components of a Snakemake Workflow

1. Rules: These define the steps in your pipeline. Each rule has three parts:
1. Snakefile: This is the main file of a Snakemake workflow. It contains the definition of all rules and config file references. This is where you define your pipeline.
2. Config file: This file defines parameters and variables that can be used in your rules. _(Useful, but not mandatory.)_
3. Rules: These define the steps in your pipeline. Each rule has three important parts:
* Input files: The files that the rule needs to start.
* Output files: The files that the rule will produce.
* Shell command: The command to be executed to produce the output files from the input files.
2. Config file: This file defines parameters and variables that can be used in your rules.
3. Snakefile: This is the main file of a Snakemake workflow. It contains the definition of all rules and config file references.


## A Simple Example: A Parallel Workflow

Expand Down Expand Up @@ -101,16 +102,14 @@ rule all:
expand("results/analysis_{sample}.txt", sample=config["samples"])

rule simulate_data:
input:
configfile = "config.yaml"
output:
"data/{sample}.txt"
shell:
"python simulate_data.py --sample {wildcards.sample} > {output}"

rule analyze_data:
input:
data = "data/{sample}.txt"
data="data/{sample}.txt"
output:
"results/analysis_{sample}.txt"
shell:
Expand All @@ -120,12 +119,180 @@ rule analyze_data:
### Explanation of the Workflow:
* Rules:
* **rule all**: This rule defines the *final goal* of the workflow. It specifies that all analysis files should be generated.
* **rule simulate_data**: This rule simulates data for each sample.
* **rule analyze_data**: This rule analyzes the simulated data for each sample.
* **rule simulate_data**: This rule simulates data for each sample. As it has no input dependencies, it can be executed at the beginning of the pipeline.
* **rule analyze_data**: This rule analyzes the simulated data for each sample. It depends on the output of the `simulate_data` rule, ensuring that this rule only proceeds after `simulate_data` is complete.
* **rule all**: This **mandatory** rule specifies the final goal of the workflow. By defining the desired output, Snakemake automatically determines the necessary steps and their execution order. This rule also allows for the use of `wildcards`, which enable flexible and scalable workflows.

* Parallelism: Snakemake automatically parallelizes the analyze_data rule for each sample, as they are independent of each other.

* Wildcards: The `{wildcards.sample}` syntax is used to dynamically generate input and output file names based on the sample name.

This is a basic example of how to use Snakemake to create a parallel workflow. You can customize this workflow to fit your specific needs by adding more rules, modifying the shell commands, and adjusting the config file.
### Running snakemake

Load your environment or container to launch snakemake. Then to run the simple example:
```
snakemake --snakefile Snakefile --configfile config.yaml --dry-run
```

:::::::::::::::::::::::::::OUTPUT

Check warning on line 137 in episodes/Snakemake.md

View workflow job for this annotation

GitHub Actions / Build Full Site

[unknown div] OUTPUT
Config file config.yaml is extended by additional config specified via the command line.
host: cmslpc305.fnal.gov
Building DAG of jobs...
Job stats:
job count
------------- -------
all 1
analyze_data 5
simulate_data 5
total 11


[Wed Nov 13 15:19:33 2024]
rule simulate_data:
output: data/sample2.txt
jobid: 4
reason: Missing output files: data/sample2.txt
wildcards: sample=sample2
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule simulate_data:
output: data/sample3.txt
jobid: 6
reason: Missing output files: data/sample3.txt
wildcards: sample=sample3
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule simulate_data:
output: data/sample4.txt
jobid: 8
reason: Missing output files: data/sample4.txt
wildcards: sample=sample4
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule simulate_data:
output: data/sample5.txt
jobid: 10
reason: Missing output files: data/sample5.txt
wildcards: sample=sample5
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule simulate_data:
output: data/sample1.txt
jobid: 2
reason: Missing output files: data/sample1.txt
wildcards: sample=sample1
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule analyze_data:
input: data/sample5.txt
output: results/analysis_sample5.txt
jobid: 9
reason: Missing output files: results/analysis_sample5.txt; Input files updated by another job: data/sample5.txt
wildcards: sample=sample5
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule analyze_data:
input: data/sample1.txt
output: results/analysis_sample1.txt
jobid: 1
reason: Missing output files: results/analysis_sample1.txt; Input files updated by another job: data/sample1.txt
wildcards: sample=sample1
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule analyze_data:
input: data/sample4.txt
output: results/analysis_sample4.txt
jobid: 7
reason: Missing output files: results/analysis_sample4.txt; Input files updated by another job: data/sample4.txt
wildcards: sample=sample4
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule analyze_data:
input: data/sample3.txt
output: results/analysis_sample3.txt
jobid: 5
reason: Missing output files: results/analysis_sample3.txt; Input files updated by another job: data/sample3.txt
wildcards: sample=sample3
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule analyze_data:
input: data/sample2.txt
output: results/analysis_sample2.txt
jobid: 3
reason: Missing output files: results/analysis_sample2.txt; Input files updated by another job: data/sample2.txt
wildcards: sample=sample2
resources: tmpdir=<TBD>


[Wed Nov 13 15:19:33 2024]
rule all:
input: results/analysis_sample1.txt, results/analysis_sample2.txt, results/analysis_sample3.txt, results/analysis_sample4.txt, results/analysis_sample5.txt
jobid: 0
reason: Input files updated by another job: results/analysis_sample3.txt, results/analysis_sample1.txt, results/analysis_sample4.txt, results/analysis_sample2.txt, results/analysis_sample5.txt
resources: tmpdir=<TBD>

Job stats:
job count
------------- -------
all 1
analyze_data 5
simulate_data 5
total 11

Reasons:
(check individual jobs above for details)
input files updated by another job:
all, analyze_data
output files have to be generated:
analyze_data, simulate_data

This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Each part of this command serves a specific purpose:
* `--snakefile`: This flag specifies the path to the Snakefile, which contains the definitions of the rules and their dependencies.
* `--configfile`: This flag indicates the path to the configuration file (YAML format) where you can define parameters and variables that can be used within the Snakefile. (It is not mandatory.)
* `--dry-run`: This flag tells Snakemake to simulate the workflow execution without actually running the commands. It's useful for visualizing the execution order of rules and identifying potential issues before running the actual workflow.




## More about wildcards

Wildcards are powerful tools in Snakemake that enable you to create flexible and scalable workflows. They act as placeholders within rule names and file paths, allowing you to define generic rules that can handle many different input and output files.

How do wildcards work?
You can define wildcards within curly braces {} in your Snakemake file. When Snakemake executes a rule, it replaces the wildcard with specific values, allowing the rule to process multiple files.

Example:
```
rule analyze_sample:
input:
"data/{sample}.txt"
output:
"results/{sample}_results.txt"
shell:
"python analyze.py {input} {output}"
```

In this example, {sample} is a wildcard. Snakemake will automatically iterate over different sample names and execute the rule for each one, creating specific input and output files.

By effectively using wildcards, you can significantly simplify your Snakemake workflows and make them more adaptable to varying datasets and experimental designs.
31 changes: 20 additions & 11 deletions learners/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,16 +100,25 @@ reana-client ping

In my opinion, using a container is the easiest way to interact with the REANA cluster from _any machine_. To do that you can use:
```BASH
apptainer run --env REANA_SERVER_URL=https://reana.cern.ch --env REANA_ACCESS_TOKEN=YOUR_ACCESS_TOKEN --bind ${PWD}:/srv --pwd /srv docker://docker.io/reanahub/reana-client:0.9.3 ping
apptainer run --env REANA_SERVER_URL=https://reana.cern.ch --env REANA_ACCESS_TOKEN=xxxxxxxxxxxxxxxxx --bind ${PWD}:/srv --pwd /srv docker://docker.io/reanahub/reana-client:0.9.3 ping
```
You can save this in a bash script for convenience, or in your `.bashrc` as:
```BASH
reana_client ()
{
local base_command="apptainer run --env REANA_SERVER_URL=https://reana.cern.ch --env REANA_ACCESS_TOKEN=xxxxxxxxxxxxxxxx --bind ${PWD}:/srv --pwd /srv docker://docker.io/reanahub/reana-client:0.9.3";
if [[ $# -eq 0 ]]; then
echo "Usage: reana_client <command> [arguments]";
return 1;
fi;
local command="$1";
shift;
run_reana_cmd="$base_command $command $@";
eval "$run_reana_cmd"
}
```
and, after reloading your `.bashrc`, then simply:
```BASH
reana-client ping
```
You can save this in a bash script for convenience.
::::::::::::::::::::::::

<!--
## Gitlab repository
FIXME: place any data you want learners to use in `episodes/data` and then use
a relative link ( [data zip file](data/lesson-data.zip) ) to provide a
link to it, replacing the example.com link.
Download the [data zip file](https://example.com/FIXME) and unzip it to your Desktop
FIXME -->

0 comments on commit 132ec6a

Please sign in to comment.