Skip to content

Commit

Permalink
Merge pull request #463 from sunbeam-labs/462-make-updates-to-work-wi…
Browse files Browse the repository at this point in the history
…th-storage-plugins

462 make updates to work with storage plugins
  • Loading branch information
Ulthran authored Aug 12, 2024
2 parents 9c8f1ca + d9d2184 commit 3fe697c
Show file tree
Hide file tree
Showing 18 changed files with 233 additions and 76 deletions.
5 changes: 4 additions & 1 deletion docs/commands.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Sunbeam Commands
Executes the Sunbeam pipeline by calling Snakemake.
.. code-block:: shell
sunbeam run [-h] [-m] [-s PATH] [--target_list [TARGETS, ...]] [--include [INCLUDES, ...]] [--exclude [EXCLUDE, ...]] [--docker_tag TAG] <snakemake options>
sunbeam run [-h] [-m] [-s PATH] [--target_list [TARGETS, ...]] [--include [INCLUDES, ...]] [--exclude [EXCLUDE, ...]] [--skip SKIP] [--docker_tag TAG] <snakemake options>
.. tip::
The ``--target_list`` option is deprecated. Pass the targets directly to ``sunbeam run`` instead.
Expand All @@ -58,6 +58,8 @@ Sunbeam Commands
``sunbeam run --profile /path/to/project/ all_decontam all_assembly all_annotation``
3. The equivalent of 2, using the deprecated ``--target_list`` option:
``sunbeam run --profile /path/to/project/ --target_list all_decontam all_assembly all_annotation``
4. To run assembly on samples that have already been decontaminated:
``sunbeam run --profile /path/to/project/ --skip decontam all_assembly``
.. code-block:: shell
-h/--help: Display help.
Expand All @@ -66,6 +68,7 @@ Sunbeam Commands
--target_list: A list of targets to run successively. (DEPRECATED)
--include: List of extensions to include in run.
--exclude: List of extensions to exclude from run, use 'all' to exclude all extensions.
--skip: Either 'qc' to skip the quality control steps or 'decontam' to skip the quality control and decontamination.
--docker_tag: Tag to use for internal environment docker images. Try 'latest' if the default tag doesn't work.
<snakemake options>: You can pass further arguments to Snakemake, e.g: ``$ sunbeam run --cores 12``. See http://snakemake.readthedocs.io for more information.
Expand Down
59 changes: 59 additions & 0 deletions docs/dev.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
.. _dev:

====
Dev
====

Getting involved with developing Sunbeam can be a little daunting at first. This doc will try to break down the constituent parts from a developer's perspective. For starters, check out the structure_ doc to get a sense of how the code is organized.

sunbeamlib
==========

The core of Sunbeam's configuration, setup, and execution is in the ``sunbeamlib`` module. This module is located in ``src/sunbeamlib/`` with the root ``pyproject.toml`` configuring it. It has a number of different scripts, each located in its own file prefixed by ``script_``, and also has utility functions and classes for both the scripts and some portions of the pipeline that are particularly common.

workflow
========

The core of the work done by Sunbeam is handled by the Snakemake workflow, located in ``workflow/``. Once a project is setup properly by sunbeamlib, the workflow can be run with all the reproducibility benefits of Snakemake. The core of the workflow is defined in ``workflow/Snakefile``. Reference the Snakemake docs for help understanding Snakemake things better; they're very good. From this core Snakefile, we import more ``.smk`` files from ``workflow/rules/`` and ``extensions/sbx_*/``.

Important Variables
-------------------

Variables defined in the main Snakefile can be accessed throughout the workflow. Some important variables include:

- ``Samples``: Dict[str, Dict[str, str]] - A dictionary where keys are sample names and values are dictionaries of read pairs mapping to file paths (``Samples[sample] = {"1": r1, "2": r2}``).
- ``Pairs``: List[str] - Either ``["1", "2"]`` or ``["1"]`` depending on if the project is paired end or not.
- ``Cfg``: Dict[str, Dict[str, str]] - The YAML config converted into dictionary form.
- ``MIN_MEM_MB``: int - A minimum value of the number of megabytes of memory to request for each job. This will only apply for jobs that rely on Sunbeam to guess their memory requirements.
- ``MIN_RUNTIME``: int - A minimum value of the number of minutes to request for each job. This will only apply for jobs that rely on Sunbeam to guess their runtime requirements.
- ``HostGenomes``: List[str] - A list of host genomes that are used for decontaminating reads.
- ``HostGenomeFiles``: List[str] - A list of files with host genomes that are used for decontaminating reads (not to be confused with ``sbx_mapping``'s ``GenomeFiles`` variable, which it uses to track reference genome files).
- ``QC_FP``: Path - The Path to the project's quality control output directory.
- ``ASSEMBLY_FP``: Path - The Path to the project's assembly output directory.
- ``CLASSIFY_FP``: Path - The Path to the project's classification output directory.
- ``MAPPING_FP``: Path - The Path to the project's mapping output directory.
- ``BENCHMARK_FP``: Path - The Path to the project's benchmarking output directory.
- ``LOG_FP``: Path - The Path to the project's log output directory.

Environment Variables
---------------------

- ``SUNBEAM_DIR``: str - The path to the Sunbeam installation directory.
- ``SUNBEAM_VER``: str - The version of Sunbeam being run.
- ``SUNBEAM_EXTS_INCLUDE``: str - If set, will include the given extension in the workflow (and exclude the rest). This is useful for testing individual extensions.
- ``SUNBEAM_EXTS_EXCLUDE``: str - If set, will exclude the given extension from the workflow. This is useful for when namespaces between extensions collide (same rule name multiple times).
- ``SUNBEAM_SKIP``: str - If set, will skip either 'qc' or 'decontam'.
- ``SUNBEAM_DOCKER_TAG``: str - If set, will use the given tag for the Docker image instead of the default.
- ``SUNBEAM_MIN_MEM_MB``: int - If set, will override the default minimum memory value.
- ``SUNBEAM_MIN_RUNTIME``: int - If set, will override the default minimum runtime value.
- ``SUNBEAM_NO_ADAPTER``: bool - If set, will not check that the adapter template file exists.

tests
=====

All tests are located in the ``tests/`` directory. The tests are run with pytest, and the tests are organized into subdirectories based on the module they are testing.

.github
=======

The ``.github/`` directory contains the configuration for GitHub Actions, which are used to run the tests on every push to the repository and manage releases. The configuration is in ``.github/workflows/``.
13 changes: 13 additions & 0 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,6 +159,19 @@ Then you submit the job:
sbatch run_sunbeam.sh
Skipping the QC and Decontamination
===================================

This time you're coming at sunbeam with a data set that you have already run QC on and removed host reads from. You want to run the assembly pipeline on this data. Your data is paired end and lives in a directory called ``/data``. Run:

.. code-block:: bash
sunbeam extend https://github.com/sunbeam-labs/sbx_assembly
sunbeam init --data_fp /data/ /projects/my_project/
sunbeam run --profile /projects/my_project --skip decontam all_assembly
Once this run completes, you will have a directory called ``/projects/my_project/sunbeam_output/`` that contains all of the output from the run. Look in ``/projects/my_project/sunbeam_output/assembly/contigs/`` for the assembled contigs.

Running on AWS Batch with AWS S3 Data
======================================

Expand Down
19 changes: 19 additions & 0 deletions docs/faqs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
.. _faqs:

====
FAQs
====

A collection of common questions, issues, or points of confusion.

**I'm getting ``snakemake: error: argument --executor/-e: invalid choice: '_____' (choose from 'local', 'dryrun', 'touch')``. Why can't I use the ``--executor`` option?**

You're using the exectuor option properly, it's just that you haven't installed the executor plugin. Use ``pip`` to install it and you should be good to go.

**I'm trying to use singularity but it keeps failing and complaining about running out of space. I know I have plenty of open disk space. Why is it running out?**

This is a known issue with singularity. It's not actually running out of space, it's just that the default location for the temporary directory is on a partition that is too small. You can change the location of the temporary directory by setting the ``SINGULARITY_TMPDIR`` and ``TMPDIR`` environment variables to a location with more space.

**A rule keeps failing with an error like "perl: error while loading shared libraries: libcrypt.so.1: cannot open shared object file: No such file or directory". What's going on?**

This is unfortunately a common issue with conda where shared libraries are either not installed or not properly loaded for packages that depend on them. There can be many causes and many fixes. You can start by searching the exact error message and seeing if there are any suggestions for how to solve it. Often it will involve installing the missing library with conda or installing the missing library with the system package manager. For example, the solution to the example error for me running sunbeam on a standard Amazon machine image (AMI) was to install the library using ``sudo yum install libxcrypt-compat``.
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -37,5 +37,6 @@ EL Clarke, LJ Taylor, C Zhao *et al.* Sunbeam: an extensible pipeline for analyz
extensions.rst
examples.rst
install.rst
dev.rst
citation.rst

2 changes: 1 addition & 1 deletion docs/structure.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

==================
Software Structure

==================

Overview
========
Sunbeam is a snakemake pipeline with a python library acting as a wrapper (``sunbeamlib``). Calling ``sunbeam run [args] [options]`` is a call to this wrapper library which then invokes the necessary snakemake commands. The main Snakefile can be found in the ``workflow/`` directory and it makes use of rules from ``workflow/rules/`` and ``extensions/``, scripts from ``workflow/scripts/``, and environments from ``workflow/envs/``. Tests are run with pytest and live in the ``tests/`` directory. Documentation lives in ``docs/`` and is served by ReadTheDocs.
Expand Down
3 changes: 2 additions & 1 deletion src/sunbeamlib/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,8 @@ def _verify_path(fp: str) -> str:
raise ValueError("Missing filename")
path = Path(fp)
if not path.is_file():
raise ValueError("File not found")
sys.stderr.write(f"WARNING: File {str(path)} does not exist locally\n")
return str(path)
return str(path.resolve())


Expand Down
13 changes: 13 additions & 0 deletions src/sunbeamlib/chop.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Template for running on CHOP HPC

qc:
host_fp: /mnt/isilon/microbiome/analysis/biodata/hosts
sbx_kraken:
kraken_db_fp: '/mnt/isilon/microbiome/analysis/biodata/kraken2db/standard_20200204'
sbx_gene_clusters:
genes_fp: /mnt/isilon/microbiome/analysis/biodata/diamondIndexes/v2.1.6.160
sbx_mapping:
genomes_fp: /mnt/isilon/microbiome/analysis/biodata/bwa_and_bowtie2/six_fungal_genomes
sbx_metaphlan4:
dbdir: "/mnt/isilon/microbiome/analysis/biodata/metaphlan_databases/v4"
dbname: "mpa_vOct22_CHOCOPhlAnSGB_202212"
2 changes: 1 addition & 1 deletion src/sunbeamlib/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ def validate_paths(cfg: Dict[str, str], root: Path) -> Dict[str, Union[str, Path
try:
v = makepath(v)
except TypeError as e:
raise TypeError(f"Missing value for key: {k}")
sys.stderr.write(f"Warning: Missing value for key: {k}")
if not v.is_absolute():
v = root / v
if k != "output_fp":
Expand Down
1 change: 1 addition & 0 deletions src/sunbeamlib/default_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ qc:
pct_id: 0.5
frac: 0.6
host_fp: ""
host_list: []

# Taxonomic classifications
classify:
Expand Down
15 changes: 13 additions & 2 deletions src/sunbeamlib/script_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def main(argv=sys.argv):
"-m",
"--mamba",
action="store_true",
help="Use mamba instead of conda to manage environments",
help="Use mamba instead of conda to create environments",
)
parser.add_argument(
"--target_list",
Expand All @@ -53,6 +53,11 @@ def main(argv=sys.argv):
default=[],
help="List of extensions to exclude from run, use 'all' to exclude all extensions",
)
parser.add_argument(
"--skip",
default="",
help="Workflow to skip. Either 'qc' to skip the quality control steps or 'decontam' to skip everything in sunbeam core (QC and decontamination).",
)
parser.add_argument(
"--docker_tag",
default=__version__,
Expand Down Expand Up @@ -80,7 +85,7 @@ def main(argv=sys.argv):
)

if args.include and args.exclude:
sys.stderr.write("Error: cannot pass both --include and --exclude\n")
sys.stderr.write("Error: cannot use both --include and --exclude\n")
sys.exit(1)

os.environ["SUNBEAM_EXTS_INCLUDE"] = ""
Expand All @@ -90,6 +95,12 @@ def main(argv=sys.argv):
if args.exclude:
os.environ["SUNBEAM_EXTS_EXCLUDE"] = ", ".join(args.exclude)

if args.skip not in ["", "qc", "decontam"]:
sys.stderr.write("Error: --skip must be either 'qc' or 'decontam'\n")
sys.exit(1)

os.environ["SUNBEAM_SKIP"] = args.skip

os.environ["SUNBEAM_DOCKER_TAG"] = args.docker_tag

snakemake_args = (
Expand Down
24 changes: 15 additions & 9 deletions tests/unit/sunbeamlib/test__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ def test_version():
assert __version__ != "0.0.0"


def test_load_sample_list(init):
def test_load_sample_list(init, capsys):
output_dir = init
samples_fp = output_dir / "samples"
samples_fp.mkdir()
Expand All @@ -50,7 +50,10 @@ def test_load_sample_list(init):

try:
load_sample_list(sample_list_fp)
assert False
assert (
capsys.readouterr().err
== f"WARNING: File {sample1.resolve()} does not exist locally\nWARNING: File {sample2.resolve()} does not exist locally\n"
)
except ValueError as e:
pass

Expand All @@ -59,7 +62,10 @@ def test_load_sample_list(init):

try:
load_sample_list(sample_list_fp)
assert False
assert (
capsys.readouterr().err
== f"WARNING: File {sample2.resolve()} does not exist locally\n"
)
except ValueError as e:
pass

Expand Down Expand Up @@ -96,7 +102,7 @@ def test_guess_format_string_single_end():
assert ret == "{sample}.fastq.gz"


def test_verify_path(init):
def test_verify_path(init, capsys):
output_dir = init

try:
Expand All @@ -105,11 +111,11 @@ def test_verify_path(init):
except ValueError as e:
pass

try:
_verify_path("thisdoesnotexist")
assert False
except ValueError as e:
pass
_verify_path("thisdoesnotexist")
assert (
capsys.readouterr().err
== "WARNING: File thisdoesnotexist does not exist locally\n"
)

with open(output_dir / "test", "w") as f:
f.write(" ")
Expand Down
72 changes: 55 additions & 17 deletions workflow/Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -51,15 +51,9 @@ MIN_RUNTIME = int(os.getenv("SUNBEAM_MIN_RUNTIME", 15))

# Check for major version compatibility
pkg_major, cfg_major = check_compatibility(config)
if pkg_major > cfg_major:
if pkg_major != cfg_major:
raise SystemExit(
"\nThis config file was created with an older version of Sunbeam"
" and may not be compatible. Create a new config file using"
"`sunbeam init` or update this one using `sunbeam config update -i /path/to/sunbeam_config.yml`\n"
)
elif pkg_major < cfg_major:
raise SystemExit(
"\nThis config file was created with an older version of Sunbeam"
"\nThis config file was created with a different version of Sunbeam"
" and may not be compatible. Create a new config file using"
"`sunbeam init` or update this one using `sunbeam config update -i /path/to/sunbeam_config.yml`\n"
)
Expand Down Expand Up @@ -96,9 +90,19 @@ else:
Cfg["qc"]["host_fp"]
)
)
HostGenomes = {
Path(g.name).stem: read_seq_ids(Cfg["qc"]["host_fp"] / g) for g in HostGenomeFiles
}

# Once this change has been implemented for a while we can remove the try/except
# and just use an if/else, using the try/except for now to avoid migration pains
# with old sunbeam configs being copied over
try:
if Cfg["qc"]["host_list"]:
HostGenomes = Cfg["qc"]["host_list"]
else:
raise KeyError
except KeyError:
HostGenomes = [Path(g.name).stem for g in HostGenomeFiles]
print(HostGenomes)

sys.stderr.write("done.\n")


Expand All @@ -110,20 +114,54 @@ sys.stderr.write("done.\n")
QC_FP = output_subdir(Cfg, "qc")
BENCHMARK_FP = output_subdir(Cfg, "benchmarks")
LOG_FP = output_subdir(Cfg, "logs")
# ---- DEPRECATED
# ---- BEGIN DEPRECATED
# These paths will be moved to their respective extensions in a future version
ASSEMBLY_FP = output_subdir(Cfg, "assembly")
ANNOTATION_FP = output_subdir(Cfg, "annotation")
CLASSIFY_FP = output_subdir(Cfg, "classify")
MAPPING_FP = output_subdir(Cfg, "mapping")
# ---- DEPRECATED
# ---- END DEPRECATED


# ---- Targets rules
# ---- Import rules
include: "rules/targets.smk"
# ---- Quality control rules
include: "rules/qc.smk"
include: "rules/decontaminate.smk"


# Skip QC and/or decontam
if os.environ.get("SUNBEAM_SKIP", "").lower() == "decontam":

rule skip_decontam:
input:
lambda wildcards: Samples[wildcards.sample][wildcards.rp],
output:
QC_FP / "decontam" / "{sample}_{rp}.fastq.gz",
log:
LOG_FP / "skip_decontam_{sample}_{rp}.log",
shell:
"""
cp {input} {output}
"""

elif os.environ.get("SUNBEAM_SKIP", "").lower() == "qc":

rule skip_qc:
input:
lambda wildcards: Samples[wildcards.sample][wildcards.rp],
output:
QC_FP / "cleaned" / "{sample}_{rp}.fastq.gz",
log:
LOG_FP / "skip_qc_{sample}_{rp}.log",
shell:
"""
cp {input} {output}
"""

include: "rules/decontaminate.smk"

else:

include: "rules/qc.smk"
include: "rules/decontaminate.smk"


for sbx_path, wildcards in sbxs:
Expand Down
Loading

0 comments on commit 3fe697c

Please sign in to comment.