Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containertask #62

Open
wants to merge 101 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
ff68e8f
Added containertasks
Apr 9, 2018
9dabb6d
Added engine to containerinfo
Apr 9, 2018
3c3e878
added singularity_slurm engine basics
Apr 9, 2018
bcbb8f5
Added bucket_command_wrapper.py to tools
Apr 10, 2018
8e763da
Pre pull commit
Apr 11, 2018
cccc0ac
Working bucket_command_wrapper.py
Apr 11, 2018
f771ba0
Changed BCW to allow repeated file commands and accept command as env…
Apr 12, 2018
52f0426
Added ContainerTarget
Apr 12, 2018
0347653
Made filesytem path just path component of url
Apr 12, 2018
9373740
Added ContainerTaskInfo to base import
Apr 12, 2018
09e9032
Made containertaskinfo inherit from taskinfo
Apr 12, 2018
c2e630a
Partially completed AWS batch engine
Apr 13, 2018
652df08
working on aws batch
Apr 13, 2018
a594b5f
Working towards complete AWS-batch engine
Apr 16, 2018
c141ef8
Pre work
Apr 20, 2018
7f1ae5d
Working AWS batch engine with ability to upload / download from S3 as…
Apr 25, 2018
7802356
Tidied up a bit of the logging for AWS-batch engine
Apr 25, 2018
8ef016d
Allowed for piped and && / || commands in both docker and aws_batch
Apr 25, 2018
21055c8
Moved custom mounts to containerinfo
Apr 26, 2018
3041e88
Moved mounts to a containerinfo parameter, and implemented for AWS batch
Apr 26, 2018
814ac0a
Switch from set to list
Apr 26, 2018
f532674
From string to bool for readonly
Apr 26, 2018
c3c54dd
name for volume through uuid
Apr 26, 2018
7e572c6
check not just registered job def, but active
Apr 27, 2018
23dcc64
String for name not class uuid
Apr 27, 2018
554e4c8
Fixed a few bugs, and moved over to the packaged bucket_container_wra…
Apr 27, 2018
1c62815
Fixed bug where no files are being uploaded to S3 temp
Apr 27, 2018
ab6a6cb
Docker engine working, with switch to targets rather than paths to be…
Apr 28, 2018
ec3dcd6
basic engine for docker and aws_batch done
May 1, 2018
a860c34
Effort to get basics of slurm_singularity engine working
May 1, 2018
06ba025
Fixed bug in containerinfo init
May 1, 2018
b212c62
Working on singularity_slurm code
May 2, 2018
0160a00
Generally working slurm-sciluigi version
May 2, 2018
771c5e4
Pre pull commit
May 2, 2018
1e255f4
Merge branch 'containertask' of https://github.com/jgolob/sciluigi in…
May 2, 2018
d22e6d6
Working singularity_slurm code
May 2, 2018
56ab1bd
Continued refinement of slurm_singularity engine
May 2, 2018
feb7c22
Merge branch 'containertask' of https://github.com/jgolob/sciluigi in…
May 2, 2018
eef05af
Catch ClientError for AWS Batch API calls
May 3, 2018
f617a29
Add MAX_BOTO_TRIES
May 3, 2018
8082c49
Merge branch 'catch_batch_client_exceptions' of https://github.com/jg…
May 3, 2018
a7e1669
Pre pull commit
May 3, 2018
b319c4d
Pre-pull commit
May 3, 2018
8bcb241
Switch boto_max_tries to a containerinfo variable
May 3, 2018
308c93b
Merge branch 'catch_batch_client_exceptions' into containertask
May 3, 2018
0fbf5da
Fixed a minor bug when checking batch job status fails
May 3, 2018
329452f
Add custom aws batch job name param
May 10, 2018
3e4f1fe
Change to "_prefix" to match functionality
May 10, 2018
9ae1efd
Merge pull request #3 from jgolob/custom_aws_batch_job_name
jgolob May 10, 2018
3820e9d
Mild changes
May 11, 2018
44dde95
Fixed bug involving need to make directories when batch -> local FS
May 14, 2018
38942c9
More logging for boto3 ClientError
May 17, 2018
b177a1e
Merge pull request #4 from jgolob/containerTask_error_logging
jgolob May 17, 2018
7bedecc
Parameterized aws batch job poll time.
May 25, 2018
376e837
Merge branch 'containertask' of https://github.com/jgolob/sciluigi in…
May 25, 2018
699165b
removed extra param
May 25, 2018
27dfc98
commonprefix -> commonpath to fix a bug
May 25, 2018
611f53f
Fixed bug in pol vs poll
May 25, 2018
637c576
Updated the readme to document how containers work now.
Jun 27, 2018
004dcd6
Added dependencies
Jun 27, 2018
352b63e
Permissions changes. setup.py modified to better fit actual dependencies
Aug 7, 2018
5621e16
Working version for singularity on PBS.
Aug 15, 2018
8f08f62
Many changes to optimize with PBS-singularity
Aug 28, 2018
0ad11ac
Fixes for PBS
Sep 5, 2018
0da06e5
Merge branch 'pbs' into containertask
Sep 5, 2018
feab1e6
Added example for ContainerTask as well as an example Dockerfile to m…
Sep 6, 2018
81e99c3
Working example plus new class to poll AWS
Sep 18, 2018
d63f558
Working with a few bits of fuss batch poller
Sep 18, 2018
6e9d3bf
Working consolidation of polling for job status on AWS
Sep 18, 2018
87a40af
Merge branch 'polling_aws' into containertask
Sep 18, 2018
a6d288e
Added ability to read containerinfo from a config file
Sep 28, 2018
8a1ace8
Fix of singularity run commands
Nov 7, 2018
9b12226
Merge pull request #5 from pharmbio/master
jgolob Nov 7, 2018
c56fb20
Merge pull request #6 from jgolob/master
jgolob Nov 7, 2018
75d6b76
Corrected loading of mounts from ini
Nov 15, 2018
ad3f563
Merge branch 'containertask' of https://github.com/jgolob/sciluigi in…
Nov 15, 2018
906445c
Reading of config from ini more robust
Nov 19, 2018
4e03a24
Fixed bug in readconfig
Nov 19, 2018
438c91b
Some debugging of aws mounts
Nov 20, 2018
eef7eae
Switch job def
Nov 20, 2018
6d906ff
Fixed loading of mounts from config
Nov 20, 2018
99a477a
A bit more logging around task watcher
Nov 21, 2018
e1ff25b
Moved batch_task_watcher global holder to __init__.py
Nov 21, 2018
94bf206
Moved global
Nov 21, 2018
59269c0
Moved batch watcher to module global
Nov 21, 2018
0dc340e
Change to loading of batch task watcher
Nov 27, 2018
957fa21
Made import a bit more robust in broken AWS environments
Nov 27, 2018
59ac40e
Switch to debug for task count
Nov 27, 2018
8b0e400
back to info for polling
Nov 27, 2018
613670e
Attempt to terminate running aws jobs on workflow exit
Nov 29, 2018
2010738
Added lock for singularity
Nov 29, 2018
0c0cdda
Fixed a few bugs in the slurm engine
Feb 14, 2019
b9e4348
Fixed a few bugs in slurm.
Feb 15, 2019
0df2b08
SLURM changes
Feb 17, 2019
dd6a7a7
ugly fix for interface change in luigi
Apr 19, 2019
1c5d27a
Added sample config. Bumped version to 2.0.0
Apr 23, 2019
b43a1d1
made easier defaults in config
Apr 26, 2019
79bb766
Adjustments to singularity settings.
May 2, 2019
d83ed1d
Bumped version to 2.0.1
May 2, 2019
2631917
pre pull
May 2, 2019
8a24da4
Merge branch 'containertask' of https://github.com/jgolob/sciluigi in…
May 2, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file modified LICENSE
100644 → 100755
Empty file.
Empty file modified MANIFEST.in
100644 → 100755
Empty file.
133 changes: 65 additions & 68 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,23 +1,13 @@
![SciLuigi Logo](http://i.imgur.com/2aMT04J.png)

* ***UPDATE, Nov, 2016: A paper with the motivation and design decisions behind SciLuigi [now available](http://dx.doi.org/10.1186/s13321-016-0179-6)***
* If you use SciLuigi in your research, please cite it like this:<br>
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. *J Cheminform*. 2016. doi:[10.1186/s13321-016-0179-6](http://dx.doi.org/10.1186/s13321-016-0179-6).
* ***A Virtual Machine with a realistic, runnable, example workflow in a Jupyter Notebook, is available [here](https://github.com/pharmbio/bioimg-sciluigi-casestudy)***
* ***Watch a 10 minute screencast going through the basics of using SciLuigi [here](https://www.youtube.com/watch?v=gkKUWskRbjw)***
* ***See a poster describing the motivations behind SciLuigi [here](http://dx.doi.org/10.13140/RG.2.1.1143.6246)***
# Scientific Luigi
(SciLuigi for short) is a light-weight wrapper library around [Spotify](http://spotify.com)'s [Luigi](http://github.com/spotify/luigi) workflow system that aims to make writing scientific workflows more fluent, flexible and modular.

Scientific Luigi (SciLuigi for short) is a light-weight wrapper library around [Spotify](http://spotify.com)'s [Luigi](http://github.com/spotify/luigi)
workflow system that aims to make writing scientific workflows more fluent, flexible and
modular.
Luigi is a flexile and fun-to-use library. It has turned out though that its default way of defining dependencies by hard coding them in each task's requires() function is not optimal for some type of workflows common e.g. in bioinformatics where multiple inputs and outputs, complex dependencies, and the need to quickly try different workflow connectivity in an explorative fashion is central to the way of working.

Luigi is a flexile and fun-to-use library. It has turned out though
that its default way of defining dependencies by hard coding them in each task's
requires() function is not optimal for some type of workflows common e.g. in bioinformatics where multiple inputs and outputs, complex dependencies,
and the need to quickly try different workflow connectivity in an explorative fashion is central to the way of working.
Sciluigi can (optionally) complete tasks by running commands in containers. This can improve reproducibility (as a container can be portably run on the cloud, on private clusters, or for lightweight tasks on a users computer via docker) and ease of use (not requiring the end-user of a workflow to install finicky bioinformatics software while avoiding the problem of conflicting dependencies). Sciluigi can facilitate running software that only runs on linux when hosted on a Windows or Macintosh computer, and leverage cloud computing resources (AWS batch).

SciLuigi was designed to solve some of these problems, by providing the following
"features" over vanilla Luigi:
SciLuigi was designed to solve some of these problems, by providing the following "features" over vanilla Luigi:

- Separation of dependency definitions from the tasks themselves,
for improved modularity and composability.
Expand All @@ -30,39 +20,21 @@ SciLuigi was designed to solve some of these problems, by providing the followin
- Inputs and outputs are connected with an intuitive "single-assignment syntax".
- "Good default" high-level logging of workflow tasks and execution times.
- Produces an easy to read audit-report with high level information per task.
- Integration with some HPC workload managers.
(So far only [SLURM](http://slurm.schedmd.com/) though).
- Integration with some HPC workload managers, currently AWS batch.
- Integration with cloud-bucket stores (currently AWS S3).
- When containers are used, one can prototype and test a task on test data locally
with docker, and then run it on cloud resources (e.g. AWS batch) when confronted
with a large dataset with only a change in a single parameter.

Because of Luigi's easy-to-use API these changes have been implemented
as a very thin layer on top of luigi's own API with no changes at all to the luigi
core, which means that you can continue leveraging the work already being
put into maintaining and further developing luigi by the team at Spotify and others.

## Workflow code quick demo

***For a brief 10 minute screencast going through the basics below, see [this link](https://www.youtube.com/watch?v=gkKUWskRbjw)***

Just to give a quick feel for how a workflow definition might look like in SciLuigi, check this code example
(implementation of tasks hidden here for brevity. See Usage section further below for more details):

```python
import sciluigi as sl

class MyWorkflow(sl.WorkflowTask):
def workflow(self):
# Initialize tasks:
foowrt = self.new_task('foowriter', MyFooWriter)
foorpl = self.new_task('fooreplacer', MyFooReplacer,
replacement='bar')

# Here we do the *magic*: Connecting outputs to inputs:
foorpl.in_foo = foowrt.out_foo

# Return the last task(s) in the workflow chain.
return foorpl
```

That's it! And again, see the "usage" section just below for a more detailed description of getting to this!
* ***UPDATE, Nov, 2016: A paper with the motivation and design decisions behind SciLuigi [now available](http://dx.doi.org/10.1186/s13321-016-0179-6)***
* If you use SciLuigi in your research, please cite it like this:<br>
Lampa S, Alvarsson J, Spjuth O. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. *J Cheminform*. 2016. doi:[10.1186/s13321-016-0179-6](http://dx.doi.org/10.1186/s13321-016-0179-6).*
* ***See a poster describing the motivations behind SciLuigi [here](http://dx.doi.org/10.13140/RG.2.1.1143.6246)***

## Support: Getting help

Expand All @@ -72,6 +44,8 @@ Please use the [issue queue](https://github.com/pharmbio/sciluigi/issues) for an

- Python 2.7 - 3.4
- Luigi 1.3.x - 2.0.1
- boto3 > 1.7.10
- docker >= 3.2.1

## Install

Expand Down Expand Up @@ -129,31 +103,37 @@ Then, you need to define some tasks that can be done in this workflow.

This is done by:

1. Creating a subclass of `sciluigi.Task` (or `sciluigi.SlurmTask` if you want Slurm support)
1. Creating a subclass of `sciluigi.ContainerTask`
2. Adding fields named `in_<yournamehere>` for each input, in the new task class
3. Define methods named `out_<yournamehere>()` for each output, that return `sciluigi.TargetInfo` objects. (sciluigi.TargetInfo is initialized with a reference to the task object itself - typically `self` - and a path name, where upstream tasks paths can be used).
3. Define methods named `out_<yournamehere>()` for each output, that return `sciluigi.ContainerTargetInfo` objects. sciluigi.TargetInfo is initialized with a reference to the task object itself - typically `self` - and an url. ContainerTargets can silently change where they are hosted, including on local filesystems (/path/to/file.txt) or in buckets (s3://bucket/key/file.txt).
4. Define luigi parameters to the task.
5. Implement the `run()` method of the task.
5. Define the container engine and parameters that the container will be run.
6. Implement the `run()` method of the task.

#### Example:

Let's define a simple task that just writes "foo" to a file named `foo.txt`:
##### Let's define a simple task that just writes "foo" to a file named `foo.txt`.

For this very simple task, we do not need a container, and thus we can base the task on the sciluigi.Task class. We do use the sciluigi.ContainerTargetInfo class here. The path/url we gave is for the local filesystem. If instead we gave an S3 bucket/key url (s3://bucket/foo.txt), this class will handle uploading (and later downloading if needed) from S3.

```python
class MyFooWriter(sciluigi.Task):
# We have no inputs here
# Define outputs:
def out_foo(self):
return sciluigi.TargetInfo(self, 'foo.txt')
return sciluigi.ContainerTargetInfo(self, 'foo.txt')
def run(self):
with self.out_foo().open('w') as foofile:
foofile.write('foo\n')
```

Then, let's create a task that replaces "foo" with "bar":
##### Then, let's create a task that replaces "foo" with "bar":

This task will be run in a container, in this case, the alpine linux container. This way (say if we are running sciluigi on a Windows machine without sed), we can still run the command wihtout fuss. In fact, no matter where this is hosted, the task will reliably run in the docker container the same way.

```python
class MyFooReplacer(sciluigi.Task):
class MyFooReplacer(sciluigi.ContainerTask):
container = 'alpine:3.7'
replacement = sciluigi.Parameter() # Here, we take as a parameter
# what to replace foo with.
# Here we have one input, a "foo file":
Expand All @@ -162,24 +142,27 @@ class MyFooReplacer(sciluigi.Task):
def out_replaced(self):
# As the path to the returned target(info), we
# use the path of the foo file:
return sciluigi.TargetInfo(self, self.in_foo().path + '.bar.txt')
return sciluigi.ContainerTargetInfo(self, self.in_foo().path + '.bar.txt')
def run(self):
with self.in_foo().open() as in_f:
with self.out_replaced().open('w') as out_f:
# Here we see that we use the parameter self.replacement:
out_f.write(in_f.read().replace('foo', self.replacement))
self.ex(
command="sed 's/foo/$repl/g' $infile > $outfile",
input_targets={
'infile': self.in_foo(),
},
output_targets={
'outfile': self.out_replaced(),
},
extra_parameters={
'repl': self.replacement,
}
)
```
Several things have happened here:

The last lines, we could have instead written using the command-line `sed` utility, available in linux, by calling it on the commandline, with the built-in `ex()` method:

```python
def run(self):
# Here, we use the in-built self.ex() method, to execute commands:
self.ex("sed 's/foo/{repl}/g' {inpath} > {outpath}".format(
repl=self.replacement,
inpath=self.in_foo().path,
outpath=self.out_replaced().path))
```
- We've specified which container the command should be run in. This can be any docker-style URI
- The command now uses the [python string template system](https://docs.python.org/3.5/library/string.html#string.Template) to replace parameters, input and output targets
- We use a ContainerTargetInfo in place of a ContainerTarget. This replacement target takes a URL, and can seemlessly handle
local files, S3 buckets (and in the future SFTP, etc).

### Write the workflow definition

Expand All @@ -189,17 +172,29 @@ We do this by:

1. Instantiating the tasks, using the `self.new_task(<unique_taskname>, <task_class>, *args, **kwargs)` method, of the workflow task.
2. Connect the tasks together, by pointing the right `out_*` method to the right `in_*` field.
3. Returning the last task in the chain, from the workflow method.
3. Giving some basic parameters as to which sort of container engine should be used for the container task via defining a `sciluigi.ContainerInfo` class.
4. Returning the last task in the chain, from the workflow method.

#### Example:

```python
import sciluigi
class MyWorkflow(sciluigi.WorkflowTask):
def workflow(self):
foowriter = self.new_task('foowriter', MyFooWriter)
fooreplacer = self.new_task('fooreplacer', MyFooReplacer,
replacement='bar')
foowriter = self.new_task(
'foowriter',
MyFooWriter
)
fooreplacer = self.new_task(
'fooreplacer',
MyFooReplacer,
containerinfo=sciluigi.ContainerInfo(
vcpu=1,
mem=512,
engine='docker',
),
replacement='bar'
)

# Here we do the *magic*: Connecting outputs to inputs:
fooreplacer.in_foo = foowriter.out_foo
Expand Down Expand Up @@ -269,6 +264,8 @@ If you run into any of these problems, you might be interested in a new workflow

Changelog
---------
- 0.9.6b7_ct
- Support for containerized tasks and `ContainerTargetInfo`
- 0.9.3b4
- Support for Python 3 (Thanks to @jeffcjohnson for contributing this!).
- Bug fixes.
Expand Down
Empty file modified README.rst
100644 → 100755
Empty file.
79 changes: 79 additions & 0 deletions example-config/containerinfo.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Sciluigi needs to know how to run your containers
# This configuration file helps specify the options needed
[DEFAULT]
# Which container engine to use. Options include:
# docker -> docker on the hosting machine
# aws_batch -> AWS batch
# pbs -> PBS / torque via qsub
# slurm -> slurm HPC management engine, via srun
engine = docker

# How many vcpu to request (concurrent threads)
vcpu = 1

# Maximum memory, in MB
mem = 4096

# Time limit in minutes
timeout = 10080

container_working_dir = /tmp/

# Some engine specific options
# ** singularity (for slurm and pbs) **
# where should we store our singularity containers.
# Should be some shared filesystem between nodes
container_cache =

# ** slurm **
# To which partition should we submit
slurm_partition =

# ** PBS **
# Under which account should jobs be submitted
pbs_account =
# to which queue?
pbs_queue =
# Path on shared filesystem between nodes
# To use to store scripts.
pbs_scriptpath =


# ** AWS batch **
# The role ID needed for tasks to access S3
aws_jobRoleArn =
# S3 bucket to use for temporary upload / download of files
aws_s3_scratch_loc =
# Which batch job queue should jobs be submitted
aws_batch_job_queue =
# Prefix to add to jobs (human readable)
aws_batch_job_prefix =
# How often should we poll batch (secs)
aws_batch_job_poll_sec = 10
# Where can we find credentials (defaults to ~/.aws if not specified)
aws_secrets_loc =
# How many times to try submitting via boto before being killed
aws_boto_max_tries = 10

# Now specify some defaults for tasks with specific resource need types
# Overriding only the relevant options

# High memory relative to number of CPU
[highmem]
mem = 120000
vcpu = 1

# Mixed needs for moderate mulitthreaded tasks
[midcpu]
mem = 4096
vcpu = 4

# Big cpu and memory
[heavy]
mem = 120000
;vcpu = 12

# Minimal CPU and memory needs (suitable for IO limited tasks)
[light]
vcpu = 1
mem = 1024
20 changes: 20 additions & 0 deletions examples/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# sciluigi-example
#
# VERSION 0.1.0__bcw.0.3.0

FROM ubuntu:16.04
# Create some mount points in the container for use by bucket-command-wrapper
RUN mkdir -p /mnt/inputs/file && mkdir -p /mnt/outputs/file && mkdir /scratch && mkdir /working
# Install at least python3 (used by BCW). It's OK to change the specific version of python3 used.
RUN apt-get update && apt-get install -y \
python3>=3.5.1-3 \
python3-pip>=3.5.1-3
# Since we are ONLY installing python3 link to it to make it the default python
RUN ln -s /usr/bin/python3 /usr/bin/python
# Install bucket_command_wrapper via pip, along with boto3 / awscli if we want to use AWS at all
RUN pip3 install \
awscli>=1.15.14 \
boto3>=1.7.14 \
bucket_command_wrapper==0.3.0

# Feel free to make this more useful by installing software, etc
Empty file modified examples/clean.sh
100644 → 100755
Empty file.
Empty file modified examples/data/a.txt
100644 → 100755
Empty file.
Empty file modified examples/data/acgt.txt
100644 → 100755
Empty file.
Empty file modified examples/data/afolder/hej.txt
100644 → 100755
Empty file.
Empty file modified examples/data/c.txt
100644 → 100755
Empty file.
Empty file modified examples/data/g.txt
100644 → 100755
Empty file.
Empty file modified examples/data/t.txt
100644 → 100755
Empty file.
Loading