diff --git a/README.md b/README.md index 0974739b2..5985e5719 100644 --- a/README.md +++ b/README.md @@ -6,7 +6,8 @@ Vivarium *E. coli* (vEcoli) is a port of the Covert Lab's [E. coli Whole Cell Model](https://github.com/CovertLab/wcEcoli) (wcEcoli) -to the [Vivarium framework](https://github.com/vivarium-collective/vivarium-core). Its main benefits over the original model are: +to the [Vivarium framework](https://github.com/vivarium-collective/vivarium-core). +Its main benefits over the original model are: 1. **Modular processes:** easily add/remove processes that interact with existing or new simulation state @@ -14,11 +15,14 @@ to the [Vivarium framework](https://github.com/vivarium-collective/vivarium-core making it easy to run simulations/analyses with different options 3. **Parquet output:** simulation output is in a widely-supported columnar file format that enables fast, larger-than-RAM analytics with DuckDB +4. **Google Cloud support:** workflows too large to run on a local machine + can be easily run on Google Cloud As in wcEcoli, [raw experimental data](reconstruction/ecoli/flat) is first processed by the parameter calculator or [ParCa](reconstruction/ecoli/fit_sim_data_1.py) to calculate -model parameters (e.g. transcription probabilities). These parameters are used to configure [processes](ecoli/processes) that are linked together -into a [complete simulation](ecoli/experiments/ecoli_master_sim.py). +model parameters (e.g. transcription probabilities). These parameters are used to configure +[processes](ecoli/processes) that are linked together into a +[complete simulation](ecoli/experiments/ecoli_master_sim.py). ## Installation @@ -26,8 +30,10 @@ into a [complete simulation](ecoli/experiments/ecoli_master_sim.py). > attempt to follow the same instructions after setting up > [Windows Subsystem for Linux](https://learn.microsoft.com/en-us/windows/wsl/install). -> **Note:** The instructions to set up the model on Sherlock are different and documented -> under the "Sherlock" sub-heading in the "Workflows" documentation page. +> **Note:** Refer to the following pages for non-local setups: +> [Sherlock](https://covertlab.github.io/vEcoli/workflows.html#sherlock), +> [other HPC cluster](https://covertlab.github.io/vEcoli/workflows.html#other-hpc-clusters), +> [Google Cloud](https://covertlab.github.io/vEcoli/gcloud.html). pyenv lets you install and switch between multiple Python releases and multiple "virtual environments", each with its own pip packages. Using pyenv, create a virtual environment @@ -70,7 +76,7 @@ If any downloads failed, re-run this command until it succeeds. To test your installation, from the top-level of the cloned repository, invoke: - # Must set PYTHONPATH and OMP_NUM_THREADS for every new shell + # Must set PYTHONPATH and OMP_NUM_THREADS for every new shell (can add to .bashrc/.zshrc) export PYTHONPATH=. export OMP_NUM_THREADS=1 python runscripts/workflow.py --config ecoli/composites/ecoli_configs/test_installation.json diff --git a/doc/gcloud.rst b/doc/gcloud.rst index 170158c2c..880268351 100644 --- a/doc/gcloud.rst +++ b/doc/gcloud.rst @@ -103,7 +103,7 @@ the email address for that service account. If you are a member of the Covert La or have been granted access to the Covert Lab project, substitute ``fireworker@allen-discovery-center-mcovert.iam.gserviceaccount.com``. Otherwise, including if you edited the default service account permissions, run -the above command without the ``--service-acount`` flag. +the above command without the ``--service-account`` flag. .. warning:: Remember to stop your VM when you are done using it. You can either do this @@ -143,6 +143,15 @@ requirements.txt for correct versions):: Then, install Java (through SDKMAN) and Nextflow following `these instructions `_. +.. note:: + The only requirements to run :mod:`runscripts.workflow` on Google Cloud + are Nextflow and PyArrow. The workflow steps will be run inside Docker + containers (see :ref:`docker-images`). The other Python requirements can be + omitted for a more minimal installation. You will need to use + :ref:`interactive containers ` to run the model using + any interface other than :mod:`runscripts.workflow`, but this may be a good + thing for maximum reproducibility. + ------------------ Create Your Bucket ------------------ @@ -162,42 +171,44 @@ Once you have created your bucket, tell vEcoli to use that bucket by setting the The URI should be in the form ``gs://{bucket name}``. Remember to remove the ``out_dir`` key under ``emitter_arg`` if present. +.. _docker-images: + ------------------- Build Docker Images ------------------- On Google Cloud, each job in a workflow (ParCa, sim 1, sim 2, etc.) is run on its own temporary VM. To ensure reproducibility, workflows run on Google -Cloud must be run using Docker containers. vEcoli contains scripts in the +Cloud are run using Docker containers. vEcoli contains scripts in the ``runscripts/container`` folder to build the required Docker images from the -current state of your repository. +current state of your repository, with the built images being automatically +uploaded to the ``vecoli`` Artifact Registry repository of your project. -``build-runtime.sh`` builds a base Docker image containing the Python packages -necessary to run vEcoli as listed in ``requirements.txt``. After the build is -finished, the Docker image should be automatically uploaded to an Artifact Registry -repository called ``vecoli``. - -``build-wcm.sh`` builds on the base image created by ``build-runtime.sh`` by copying -the files in the cloned vEcoli repository including any uncommitted changes. Note -that files matching any entry in ``.gitignore`` are not copied. The built image is -also uploaded to the ``vecoli`` Artifact Registry repository. + - ``build-runtime.sh`` builds a base Docker image containing the Python packages +necessary to run vEcoli as listed in ``requirements.txt`` +- ``build-wcm.sh`` builds on the base image created by ``build-runtime.sh`` by copying +the files in the cloned vEcoli repository, honoring ``.gitignore`` .. tip:: If you want to build these Docker images for local testing, you can run - these scripts locally as long as you have Docker installed. + these scripts locally with ``-l`` as long as you have Docker installed. These scripts are mostly not meant to be run manually. Instead, users should let -:py:mod:`runscripts.workflow` handle this automatically by setting the following +:py:mod:`runscripts.workflow` handle image builds by setting the following keys in your configuration JSON:: { "gcloud": { - "runtime_image_name": "Name of image build-runtime.sh built/will build" - "build_runtime_image": Boolean, can put false if requirements.txt did not - change since the last time this was true, - "wcm_image_image": "Name of image build-wcm.sh built/will build" - "build_wcm_image": Boolean, can put false if nothing in repository changed - since the last time this was true + # Name of image build-runtime.sh built/will build + "runtime_image_name": "" + # Boolean, can put false if requirements.txt did not change since the last + # time a workflow was run with this set to true + "build_runtime_image": true, + # Name of image build-wcm.sh built/will build + "wcm_image_image": "" + # Boolean, can put false if nothing in repository changed since the + # last time a workflow was run with this set to true + "build_wcm_image": true } } @@ -212,7 +223,7 @@ as normal to start your workflow:: Once your workflow has started, you can use press "ctrl+a d" to detach from the virtual console then close your SSH connection to your VM. The VM must continue -to run until the workflow is complete. You can SSH into the VM and reconnect to +to run until the workflow is complete. You can SSH into your VM and reconnect to the virtual terminal with ``screen -r`` to monitor progress or inspect the file ``.nextflow.log`` in the root of the cloned repository. @@ -220,7 +231,9 @@ the virtual terminal with ``screen -r`` to monitor progress or inspect the file While there is no strict time limit for workflow jobs on Google Cloud, jobs can be preempted at any time due to the use of spot VMs. Analysis scripts that take more than a few hours to run should be excluded from workflow configurations - and manually run using :py:mod:`runscripts.analysis` afterwards. + and manually run using :py:mod:`runscripts.analysis` afterwards. Alternatively, if + you are willing to pay the significant extra cost for standard VMs, delete + ``google.batch.spot = true`` from ``runscripts/nextflow/config.template``. ---------------- Handling Outputs @@ -239,6 +252,48 @@ reason, we recommend that you delete workflow output data from your bucket as so you are done with your analyses. If necessary, it will likely be cheaper to re-run the workflow to regenerate that data later than to keep it around. +.. _interactive-containers: + +---------------------- +Interactive Containers +---------------------- + +.. warning:: + Install + :ref:`Docker ` and + :ref:`Google Cloud Storage FUSE ` + on your VM before continuing. + +Since all steps of the workflow are run inside Docker containers, it can be +helpful to launch an interactive instance of the container for debugging. + +To do so, run the following command:: + + runscripts/container/interactive.sh -w wcm_image_name -b bucket + +``wcm_image_name`` should be the same ``wcm_image_name`` from the config JSON +used to run the workflow. A copy of the config JSON should be saved to the Cloud +Storage bucket with the other output (see :ref:`output`). ``bucket`` should be +the Cloud Storage bucket of the output (``out_uri`` in config JSON). + +Inside the container, add breakpoints to any Python files located at ``/vEcoli`` by +inserting:: + + import ipdb; ipdb.set_trace() + +Navigate to the working directory (see :ref:`troubleshooting`) of the failing +task at ``/mnt/disks/{bucket}/...``. Evoke ``bash .command.sh`` to run the +task. Execution should pause at your set breakpoints, allowing you to inspect +variables and step through the code. + +.. warning:: + Any changes that you make to the code in ``/vEcoli`` inside the container are not + persistent. For large code changes, we recommend that you navigate to ``/vEcoli`` + inside the container and run ``git init`` then + ``git remote add origin https://github.com/CovertLab/vEcoli.git``. With the + git repository initialized, you can make changes locally, push them to a + development branch on GitHub, and pull/merge them in your container. + --------------- Troubleshooting --------------- diff --git a/doc/workflows.rst b/doc/workflows.rst index 792120f8c..a60576822 100644 --- a/doc/workflows.rst +++ b/doc/workflows.rst @@ -488,60 +488,74 @@ in the worklow. Sherlock -------- +Setup +===== + .. note:: - The following information is intended for members of the Covert Lab only. + The following setup applies to members of the Covert Lab only. -After cloning the model repository to your home directory, skip the other steps -in the README until reaching the instructions to install Nextflow. After installing -Nextflow in your home directory, add the following lines to your ``~/.bash_profile``, -then close and reopen your ssh connection: +After cloning the model repository to your home directory, add the following +lines to your ``~/.bash_profile``, then close and reopen your SSH connection: .. code-block:: bash - # Legacy environment variables so old scripts work - export PI_HOME=$GROUP_HOME - export PI_SCRATCH=$GROUP_SCRATCH - - # Load group-wide settings - if [ -f "${PI_HOME}/etc/bash_profile" ]; then - . "${PI_HOME}/etc/bash_profile" - fi - - # Environment variable required by pyenv - export PYENV_ROOT="${PI_HOME}/pyenv" - - # Environment modules used by vEcoli - module load system git/2.45.1 parallel - module load wcEcoli/python3 - - # Need Java for nextflow - module load java/18.0.2 + # Load newer Git and Java for nextflow + module load system git java/21.0.4 + # Set PYTHONPATH to root of repo so imports work export PYTHONPATH="$HOME/vEcoli" + # Use one thread for OpenBLAS (better performance and reproducibility) + export OMP_NUM_THREADS=1 + # Initialize pyenv + export PYENV_ROOT="${GROUP_HOME}/pyenv" if [ -d "${PYENV_ROOT}" ]; then export PATH="${PYENV_ROOT}/bin:${PATH}" eval "$(pyenv init -)" eval "$(pyenv virtualenv-init -)" fi - export PATH=$PATH:$HOME/.local/bin +Inside the cloned repository, run ``pyenv local vEcoli``. This loads a virtual +environment with PyArrow, the only Python package required to start a workflow +with :mod:`runscripts.workflow`. Once a workflow is started, vEcoli will build +an Apptainer image with all the other model dependencies using +``runscripts/container/build-runtime.sh``. This image will then be used to start +containers to run the steps of the workflow. To run or interact with the model +without using :mod:`runscripts.workflow`, start an interactive container by +following the steps in :ref:`sherlock-interactive`. - # Use one thread for OpenBLAS (better performance and reproducibility) - export OMP_NUM_THREADS=1 +.. _sherlock-config: -Finally, inside the cloned repository, run ``pyenv local viv-ecoli`` -to load the Python virtual environment with all required packages installed. +Configuration +============= -For convenience, :py:mod:`runscripts.workflow` accepts a boolean top-level -configuration option ``sherlock``. If set to True, :py:mod:`runscripts.workflow` +To tell vEcoli that you are running on Sherlock, you MUST add the following +options to your configuration JSON (note the top-level ``sherlock`` key):: + + { + "sherlock": { + # Boolean, whether to build a fresh Apptainer runtime image. If requirements.txt + # did not change since your last build, you can set this to false + "build_runtime_image": true, + # Absolute path (including file name) of Apptainer runtime image to either + # build or use (if build_runtime_image is false) + "runtime_image_name": "", + } + } + +With these options in the configuration JSON, :py:mod:`runscripts.workflow` can be run on a login node to automatically submit a job that will run the Nextflow workflow orchestrator with a 7-day time limit on the lab's dedicated -partition (job should start fairly quickly and never get preempted by other -users). The workflow orchestrator will automatically submit jobs for each step +partition. This job should start fairly quickly and never get preempted by other +users. The workflow orchestrator will automatically submit jobs for each step in the workflow: one for the ParCa, one to create variants, one for each cell, and one for each analysis. +If you are trying to run a workflow that takes longer than 7 days, you can +use the resume functionality (see :ref:`fault_tolerance`). Alternatively, +consider running your workflow on Google Cloud, which has no maximum workflow +runtime (see :doc:`gcloud`). + Importantly, the emitter output directory (see description of ``emitter_arg`` in :ref:`json_config`) should be an absolute path to somewhere in your ``$SCRATCH`` directory (e.g. ``/scratch/users/{username}/out``). The path must @@ -549,19 +563,74 @@ be absolute because Nextflow does not resolve environment variables like ``$SCRATCH`` in paths. .. warning:: - Running the workflow on Sherlock sets a 2 hour limit on all jobs in the + Running the workflow on Sherlock sets a 2 hour limit on each job in the workflow, including analyses. Analysis scripts that take more than 2 hours to run should be excluded from workflow configurations and manually run using :py:mod:`runscripts.analysis` afterwards. -.. tip:: - If you have access to a different HPC cluster that also uses the SLURM - scheduler, you can use vEcoli on that cluster by simply changing - the ``process.queue`` option in ``runscripts/nextflow/config.template`` - to the correct SLURM queue. If your HPC cluster uses a different scheduler, - you will have to change many options in the ``sherlock`` configuration - profile starting with ``process.executor``. Refer to the Nextflow - `executor documentation `_. +.. _sherlock-interactive: + +Interactive Container +===================== + +To run and develop the model on Sherlock outside a workflow, run:: + + runscripts/container/interactive.sh -w runtime_image_path -a + +Replace ``runtime_image_path`` with the path of an Apptainer image built with +the latest ``requirements.txt``. If you are not sure if ``requirements.txt`` +changed since the last time you ran a workflow with ``build_runtime_image`` +set to true (or if you have never run a workflow), run the following to build +a runtime image, picking any path:: + + runscripts/container/build-runtime.sh -r runtime_image_path -a + +Inside the container, set the ``PYTHONPATH`` with ``export PYTHONPATH={}``, +substituting in the path to your cloned ``vEcoli`` repository. You can now run +any of the scripts in ``runscripts``. + +If you are trying to debug a failed process, add breakpoints to any Python script +in your cloned repository by inserting:: + + import ipdb; ipdb.set_trace() + +Inside the interactive container, navigate to the working directory (see +:ref:`troubleshooting`) for the task that you want to debug. By invoking +``bash .command.sh``, the task should run and pause upon reaching your +breakpoints, allowing you to inspect variables and step through the code. + +------------------ +Other HPC Clusters +------------------ + +If your HPC cluster has Apptainer (formerly known as Singularity) installed, +the only other packages necessary to run :mod:`runscripts.workflow` are Nextflow +(requires Java) and PyArrow (pip install). It would be helpful if your Apptainer +installation automatically mounts all filesystems on the cluster (see +`Apptainer docs `_). +If not, workflows should still run but you will need to manually specify mount paths +to debug with interactive containers (see :ref:`sherlock-interactive`). +This can be done using the ``-p`` argument for ``runscripts/container/interactive.sh``. + +If your HPC cluster does not have Apptainer installed, you can follow the +local setup instructions in the README assuming your pyenv installation and +virtual environments are accessible from all nodes. Then, delete the following +lines from ``runscripts/nextflow/config.template`` and always set +``build_runtime_image`` to false in your config JSONs (see :ref:`sherlock-config`):: + + process.container = 'IMAGE_NAME' + apptainer.enabled = true + +If your HPC cluster also uses the SLURM scheduler, +you can use vEcoli on that cluster by changing the ``process.queue`` option in +``runscripts/nextflow/config.template`` and all strings of the format +``--partition=QUEUE`` in :py:mod:`runscripts.workflow` to the right queue for your +cluster. + +If your HPC cluster uses a different scheduler, refer to the Nextflow +`executor documentation `_ +for more information on configuring the right executor, starting with +``process.executor`` in ``runscripts/nextflow/config.template``. .. _progress: @@ -730,12 +799,26 @@ in a workflow called ``agitated_mendel``:: nextflow log agitated_mendel -f name,stderr,workdir -F "status == 'FAILED'" -Test Fixes -========== +Make and Test Fixes +=================== + +If you need to further investigate an issue, the exact steps differ depending +on where you are debugging. + +- Google Cloud: See :ref:`instructions here ` +- Sherlock: See :ref:`instructions here ` +- Local machine: Continue below + +Add breakpoints to any Python file with the following line:: + + import ipdb; ipdb.set_trace() + +Then, navigate to the working directory (see :ref:`troubleshooting`) for a +failing process. ``bash .command.run`` should re-run the job and pause upon +reaching the breakpoints you set. You should now be in an ipdb shell which +you can use to examine variable values or step through the code. -After identifying the issue and applying fixes, you can test a failed job -in isolation by invoking ``bash .command.run`` inside the work -directory for that job. Once you have addressed all issues, -you relaunch the workflow by navigating back to the directory in which you +After fixing the issue, you can resume the workflow (avoid re-running +already successful jobs) by navigating back to the directory in which you originally started the workflow and issuing the same command with the -added ``--resume`` option (see :ref:`fault_tolerance`). +``--resume`` option (see :ref:`fault_tolerance`). diff --git a/ecoli/experiments/ecoli_master_sim.py b/ecoli/experiments/ecoli_master_sim.py index 84ad22065..4f89c34d7 100644 --- a/ecoli/experiments/ecoli_master_sim.py +++ b/ecoli/experiments/ecoli_master_sim.py @@ -41,20 +41,7 @@ from ecoli.composites.ecoli_configs import CONFIG_DIR_PATH from ecoli.library.schema import not_a_process - -LIST_KEYS_TO_MERGE = ( - "save_times", - "add_processes", - "exclude_processes", - "processes", - "engine_process_reports", - "initial_state_overrides", -) -""" -Special configuration keys that are list values which are concatenated -together when they are found in multiple sources (e.g. default JSON and -user-specified JSON) instead of being directly overriden. -""" +from runscripts.workflow import LIST_KEYS_TO_MERGE class TimeLimitError(RuntimeError): diff --git a/reconstruction/ecoli/fit_sim_data_1.py b/reconstruction/ecoli/fit_sim_data_1.py index 07710d7b5..ebcbfc119 100644 --- a/reconstruction/ecoli/fit_sim_data_1.py +++ b/reconstruction/ecoli/fit_sim_data_1.py @@ -30,7 +30,9 @@ # Fitting parameters # NOTE: This threshold is arbitrary and was relaxed from 1e-9 # to 1e-8 to fix failure to converge after scipy/scipy#20168 -FITNESS_THRESHOLD = 1e-8 +# NOTE: Relaxed from 1e-8 to 1e-7 to fix failure to converge +# on Sherlock +FITNESS_THRESHOLD = 1e-7 MAX_FITTING_ITERATIONS = 150 N_SEEDS = 10 diff --git a/runscripts/container/build-runtime.sh b/runscripts/container/build-runtime.sh index db2759c36..1972efd98 100755 --- a/runscripts/container/build-runtime.sh +++ b/runscripts/container/build-runtime.sh @@ -1,28 +1,31 @@ #!/bin/sh -# Use Google Cloud Build or local Docker install to build a personalized -# image with requirements.txt installed. If using Cloud Build, store the -# built image in the "vecoli" folder in the Google Artifact Registry. +# Use Google Cloud Build, local Docker, or HPC cluster Apptainer to build +# a personalized image with requirements.txt installed. If using Cloud Build, +# store the built image in the "vecoli" repository in Artifact Registry. # # ASSUMES: The current working dir is the vEcoli/ project root. set -eu RUNTIME_IMAGE="${USER}-wcm-runtime" -RUN_LOCAL='false' +RUN_LOCAL=0 +BUILD_APPTAINER=0 -usage_str="Usage: build-runtime.sh [-r RUNTIME_IMAGE] [-l]\n\ - -r: Docker tag for the wcm-runtime image to build; defaults to \ -${USER}-wcm-runtime\n\ +usage_str="Usage: build-runtime.sh [-r RUNTIME_IMAGE] [-a] [-l]\n\ + -r: Path of built Apptainer image if -a, otherwise Docker tag \ +for the wcm-runtime image to build; defaults to ${USER}-wcm-runtime\n\ + -a: Build Apptainer image (cannot use with -l).\n\ -l: Build image locally.\n" print_usage() { printf "$usage_str" } -while getopts 'r:l' flag; do +while getopts 'r:al' flag; do case "${flag}" in r) RUNTIME_IMAGE="${OPTARG}" ;; - l) RUN_LOCAL="${OPTARG}" ;; + a) (( $RUN_LOCAL )) && print_usage && exit 1 || BUILD_APPTAINER=1 ;; + l) (( $BUILD_APPTAINER )) && print_usage && exit 1 || RUN_LOCAL=1 ;; *) print_usage exit 1 ;; esac @@ -32,9 +35,12 @@ done # the project root which would upload the entire project. cp requirements.txt runscripts/container/runtime/ -if [ "$RUN_LOCAL" = true ]; then +if (( $RUN_LOCAL )); then echo "=== Locally building WCM runtime Docker Image: ${RUNTIME_IMAGE} ===" - docker build -f runscripts/container/runtime/Dockerfile -t "${WCM_RUNTIME}" . + docker build -f runscripts/container/runtime/Dockerfile -t "${RUNTIME_IMAGE}" . +elif (( $BUILD_APPTAINER )); then + echo "=== Building WCM runtime Apptainer Image: ${RUNTIME_IMAGE} ===" + apptainer build ${RUNTIME_IMAGE} runscripts/container/runtime/Singularity else echo "=== Cloud-building WCM runtime Docker Image: ${RUNTIME_IMAGE} ===" # For this script to work on a Compute Engine VM, you must diff --git a/runscripts/container/build-wcm.sh b/runscripts/container/build-wcm.sh index 83b76ed1a..41ff101b6 100755 --- a/runscripts/container/build-wcm.sh +++ b/runscripts/container/build-wcm.sh @@ -1,7 +1,7 @@ #!/bin/sh -# Use Google Cloud Build or local Docker install to build a personalized image -# with current state of the vEcoli repo. If using Cloud Build, store -# the built image in the "vecoli" folder in the Google Artifact Registry. +# Use Google Cloud Build or local Docker to build a personalized image with +# current state of the vEcoli repo. If using Cloud Build, store the built +# image in the "vecoli" repository in Artifact Registry. # # ASSUMES: The current working dir is the vEcoli/ project root. @@ -9,25 +9,24 @@ set -eu RUNTIME_IMAGE="${USER}-wcm-runtime" WCM_IMAGE="${USER}-wcm-code" -RUN_LOCAL='false' +RUN_LOCAL=0 usage_str="Usage: build-wcm.sh [-r RUNTIME_IMAGE] \ -[-w WCM_IMAGE] [-l]\n\ - -r: Docker tag for the wcm-runtime image to build FROM; defaults to \ -"$USER-wcm-runtime" (must already exist in Artifact Registry).\n\ - -w: Docker tag for the "wcm-code" image to build; defaults to \ -"$USER-wcm-code".\n\ +[-w WCM_IMAGE] [-a] [-b BIND_PATH] [-l]\n\ + -r: Docker tag of wcm-runtime image to build from; defaults to \ +"$USER-wcm-runtime" (must exist in Artifact Registry).\n\ + -w: Docker tag of wcm-code image to build; defaults to "$USER-wcm-code".\n\ -l: Build image locally.\n" print_usage() { printf "$usage_str" } -while getopts 'r:w:l' flag; do +while getopts 'r:w:abl:' flag; do case "${flag}" in r) RUNTIME_IMAGE="${OPTARG}" ;; w) WCM_IMAGE="${OPTARG}" ;; - l) RUN_LOCAL="${OPTARG}" ;; + l) RUN_LOCAL=1 ;; *) print_usage exit 1 ;; esac @@ -39,7 +38,7 @@ TIMESTAMP=$(date '+%Y%m%d.%H%M%S') mkdir -p source-info git diff HEAD > source-info/git_diff.txt -if [ "$RUN_LOCAL" = true ]; then +if (( $RUN_LOCAL )); then echo "=== Locally building WCM code Docker Image ${WCM_IMAGE} on ${RUNTIME_IMAGE} ===" echo "=== git hash ${GIT_HASH}, git branch ${GIT_BRANCH} ===" docker build -f runscripts/container/wholecell/Dockerfile -t "${WCM_IMAGE}" \ diff --git a/runscripts/container/interactive.sh b/runscripts/container/interactive.sh new file mode 100755 index 000000000..706409127 --- /dev/null +++ b/runscripts/container/interactive.sh @@ -0,0 +1,77 @@ +#!/bin/sh +# Start an interactive Docker or Apptainer container from an image. +# Supports optional bind mounts and Cloud Storage bucket mounting + +set -eu # Exit on any error or unset variable + +# Default configuration variables +WCM_IMAGE="${USER}-wcm-code" # Default image name for Docker/Apptainer +USE_APPTAINER=0 # Flag: Use Apptainer if set to 1 +BIND_MOUNTS=() # Array for bind mount paths +BIND_CWD="" # Formatted bind mount string for runtime +BUCKET="" # Cloud Storage bucket name + +# Help message string +usage_str="Usage: interactive.sh [-w WCM_IMAGE] [-a] [-b] [-p]...\n\ +Options:\n\ + -w: Path of Apptainer image if -a, otherwise name of Docker \ +image inside vecoli Artifact Repository; defaults to "$USER-wcm-code".\n\ + -a: Load Apptainer image.\n\ + -b: Name of Cloud Storage bucket to mount inside container; first mounts +bucket to VM at $HOME/bucket_mnt using gcsfuse (does not work with -a).\n\ + -p: Path(s) to mount inside container; can specify multiple with \ +\"-p path1 -p path2\"\n" + +# Function to print usage instructions +print_usage() { + printf "$usage_str" +} + +# Parse command-line options +while getopts 'w:ab:p:' flag; do + case "${flag}" in + w) WCM_IMAGE="${OPTARG}" ;; # Set custom image name + a) USE_APPTAINER=1 ;; # Enable Apptainer mode + b) BUCKET="${OPTARG}" ;; # Set the Cloud Storage bucket + p) BIND_MOUNTS+=($(realpath "${OPTARG}")) ;; # Convert path to absolute and add to array + *) print_usage # Print usage for unknown flags + exit 1 ;; + esac +done + +# Apptainer-specific logic +if (( $USE_APPTAINER )); then + # If there are bind mounts, format them for Apptainer + if [ ${#BIND_MOUNTS[@]} -ne 0 ]; then + BIND_CWD=$(printf " -B %s" "${BIND_MOUNTS[@]}") + fi + echo "=== Launching Apptainer container from ${WCM_IMAGE} ===" + # Start Apptainer container with bind mounts + apptainer shell -e --writable-tmpfs ${BIND_CWD} ${WCM_IMAGE} +else + # Docker-specific logic + # Get GCP project name and region to construct image path + PROJECT=$(gcloud config get project) + REGION=$(gcloud config get compute/region) + WCM_IMAGE="${REGION}-docker.pkg.dev/${PROJECT}/vecoli/${WCM_IMAGE}" + + # If there are bind mounts, format them for Docker + if [ ${#BIND_MOUNTS[@]} -ne 0 ]; then + BIND_CWD=$(printf " -v %s:%s" "${BIND_MOUNTS[@]}" "${BIND_MOUNTS[@]}") + fi + + # Mount the cloud storage bucket using gcsfuse if provided + if [ -n "$BUCKET" ]; then + echo "=== Mounting Cloud Storage bucket ${BUCKET} ===" + # Create mount point and mount bucket with gcsfuse + mkdir -p $HOME/bucket_mnt + gcsfuse --implicit-dirs $BUCKET $HOME/bucket_mnt + # Nextflow mounts bucket to /mnt/disks so we need to copy that for + # symlinks to work properly + BIND_CWD="${BIND_CWD} -v ${HOME}/bucket_mnt:/mnt/disks/${BUCKET}" + fi + + # Launch the Docker container + echo "=== Launching Docker container from ${WCM_IMAGE} ===" + docker container run -it ${BIND_CWD} ${WCM_IMAGE} bash # Start Docker container with bind mounts +fi diff --git a/runscripts/container/runtime/Dockerfile b/runscripts/container/runtime/Dockerfile index cd81997e4..69ec9389a 100644 --- a/runscripts/container/runtime/Dockerfile +++ b/runscripts/container/runtime/Dockerfile @@ -20,7 +20,7 @@ RUN echo "alias ls='ls --color=auto'" >> ~/.bashrc \ # Update and install in the same layer so it won't install from old updates. RUN apt-get update \ - && apt-get install -y swig gfortran llvm cmake nano libopenblas-dev + && apt-get install -y git swig gfortran llvm cmake nano libopenblas-dev # This gets more consistent results from openblas. ENV OPENBLAS_NUM_THREADS=1 diff --git a/runscripts/container/runtime/Singularity b/runscripts/container/runtime/Singularity new file mode 100644 index 000000000..26d7cab86 --- /dev/null +++ b/runscripts/container/runtime/Singularity @@ -0,0 +1,33 @@ +Bootstrap: docker +From: python:3.11.3 + +%environment + export OPENBLAS_NUM_THREADS=1 + +%labels + application "Whole Cell Model Runtime Environment" + email "allencentercovertlab@gmail.com" + license "https://github.com/CovertLab/vEcoli/blob/master/LICENSE" + organization "Covert Lab at Stanford" + website "https://www.covert.stanford.edu/" + +%files + requirements.txt /requirements.txt + +%post + echo "Setting up runtime environment..." + + echo "alias ls='ls --color=auto'" >> ~/.bashrc + echo "alias ll='ls -l'" >> ~/.bashrc + cp ~/.bashrc / + + apt-get update \ + && apt-get install -y git swig gfortran llvm cmake nano libopenblas-dev + + pip install --no-cache-dir --upgrade pip setuptools==73.0.1 wheel + pip install --no-cache-dir numpy==1.26.4 + pip install --no-cache-dir -r /requirements.txt + +%runscript + # This defines the default behavior when the container is executed. + exec /bin/bash diff --git a/runscripts/jenkins/configs/ecoli-anaerobic.json b/runscripts/jenkins/configs/ecoli-anaerobic.json index 177588476..50b5de190 100644 --- a/runscripts/jenkins/configs/ecoli-anaerobic.json +++ b/runscripts/jenkins/configs/ecoli-anaerobic.json @@ -25,5 +25,12 @@ "variants": { "condition": {"condition": {"value": ["no_oxygen"]}} }, - "jenkins": true + "sherlock": { + "runtime_image_name": "runtime-image", + "build_runtime_image": true, + "jenkins": true + }, + "parca_options": { + "cpus": 4 + } } diff --git a/runscripts/jenkins/configs/ecoli-glucose-minimal.json b/runscripts/jenkins/configs/ecoli-glucose-minimal.json index ad86e6841..f871d903b 100644 --- a/runscripts/jenkins/configs/ecoli-glucose-minimal.json +++ b/runscripts/jenkins/configs/ecoli-glucose-minimal.json @@ -11,5 +11,12 @@ "analysis_options": { "single": {"mass_fraction_summary": {}} }, - "jenkins": true + "sherlock": { + "runtime_image_name": "runtime-image", + "build_runtime_image": true, + "jenkins": true + }, + "parca_options": { + "cpus": 4 + } } diff --git a/runscripts/jenkins/configs/ecoli-new-gene-gfp.json b/runscripts/jenkins/configs/ecoli-new-gene-gfp.json index 92290de6f..45e73da66 100644 --- a/runscripts/jenkins/configs/ecoli-new-gene-gfp.json +++ b/runscripts/jenkins/configs/ecoli-new-gene-gfp.json @@ -9,7 +9,8 @@ "out_dir": "/scratch/groups/mcovert/vecoli" }, "parca_options": { - "new_genes": "gfp" + "new_genes": "gfp", + "cpus": 4 }, "analysis_options": { "single": {"mass_fraction_summary": {}} @@ -36,5 +37,9 @@ "op": "zip" } }, - "jenkins": true + "sherlock": { + "runtime_image_name": "runtime-image", + "build_runtime_image": true, + "jenkins": true + } } diff --git a/runscripts/jenkins/configs/ecoli-no-growth-rate-control.json b/runscripts/jenkins/configs/ecoli-no-growth-rate-control.json index 806fc984e..eac43d634 100644 --- a/runscripts/jenkins/configs/ecoli-no-growth-rate-control.json +++ b/runscripts/jenkins/configs/ecoli-no-growth-rate-control.json @@ -18,5 +18,12 @@ "analysis_options": { "single": {"mass_fraction_summary": {}} }, - "jenkins": true + "sherlock": { + "runtime_image_name": "runtime-image", + "build_runtime_image": true, + "jenkins": true + }, + "parca_options": { + "cpus": 4 + } } diff --git a/runscripts/jenkins/configs/ecoli-no-operons.json b/runscripts/jenkins/configs/ecoli-no-operons.json index c452ce090..a38b2758c 100644 --- a/runscripts/jenkins/configs/ecoli-no-operons.json +++ b/runscripts/jenkins/configs/ecoli-no-operons.json @@ -9,10 +9,15 @@ "out_dir": "/scratch/groups/mcovert/vecoli" }, "parca_options": { - "operons": false + "operons": false, + "cpus": 4 }, "analysis_options": { "single": {"mass_fraction_summary": {}} }, - "jenkins": true + "sherlock": { + "runtime_image_name": "runtime-image", + "build_runtime_image": true, + "jenkins": true + } } diff --git a/runscripts/jenkins/configs/ecoli-superhelical-density.json b/runscripts/jenkins/configs/ecoli-superhelical-density.json index 16b304856..1d099fc41 100644 --- a/runscripts/jenkins/configs/ecoli-superhelical-density.json +++ b/runscripts/jenkins/configs/ecoli-superhelical-density.json @@ -12,5 +12,12 @@ "analysis_options": { "single": {"mass_fraction_summary": {}} }, - "jenkins": true + "sherlock": { + "runtime_image_name": "runtime-image", + "build_runtime_image": true, + "jenkins": true + }, + "parca_options": { + "cpus": 4 + } } diff --git a/runscripts/jenkins/configs/ecoli-with-aa.json b/runscripts/jenkins/configs/ecoli-with-aa.json index 83a680ff4..5d3a12d2f 100644 --- a/runscripts/jenkins/configs/ecoli-with-aa.json +++ b/runscripts/jenkins/configs/ecoli-with-aa.json @@ -15,5 +15,12 @@ "variants": { "condition": {"condition": {"value": ["with_aa"]}} }, - "jenkins": true + "sherlock": { + "runtime_image_name": "runtime-image", + "build_runtime_image": true, + "jenkins": true + }, + "parca_options": { + "cpus": 4 + } } diff --git a/runscripts/jenkins/setup-environment.sh b/runscripts/jenkins/setup-environment.sh index c8a9fb914..96d7f764c 100644 --- a/runscripts/jenkins/setup-environment.sh +++ b/runscripts/jenkins/setup-environment.sh @@ -1,14 +1,23 @@ set -e +# Load newer Git and Java for nextflow +module load system git java/21.0.4 + +# Set PYTHONPATH to root of repo so imports work export PYTHONPATH=$PWD -module load wcEcoli/python3 java/18.0.2 +# Use one thread for OpenBLAS (better performance and reproducibility) +export OMP_NUM_THREADS=1 -export PATH="${GROUP_HOME}/pyenv/bin:${PATH}" -eval "$(pyenv init -)" -eval "$(pyenv virtualenv-init -)" +# Initialize pyenv +export PYENV_ROOT="${GROUP_HOME}/pyenv" +if [ -d "${PYENV_ROOT}" ]; then + export PATH="${PYENV_ROOT}/bin:${PATH}" + eval "$(pyenv init -)" + eval "$(pyenv virtualenv-init -)" +fi ### Edit this line to make this branch use another pyenv -pyenv local viv-ecoli +pyenv local vEcoli pyenv activate make clean compile diff --git a/runscripts/nextflow/config.template b/runscripts/nextflow/config.template index 73f2fc2d2..cdbd2f426 100644 --- a/runscripts/nextflow/config.template +++ b/runscripts/nextflow/config.template @@ -75,6 +75,8 @@ profiles { process.cpus = 1 process.executor = 'slurm' process.queue = 'owners' + process.container = 'IMAGE_NAME' + apptainer.enabled = true process.time = { if ( task.exitStatus == 140 ) { 2.h * task.attempt diff --git a/runscripts/workflow.py b/runscripts/workflow.py index ef3f17fb2..dda94990f 100644 --- a/runscripts/workflow.py +++ b/runscripts/workflow.py @@ -1,14 +1,29 @@ import argparse import json import os +import time import shutil import subprocess import warnings from datetime import datetime from urllib import parse +from typing import Optional from pyarrow import fs -from ecoli.experiments.ecoli_master_sim import SimConfig + +LIST_KEYS_TO_MERGE = ( + "save_times", + "add_processes", + "exclude_processes", + "processes", + "engine_process_reports", + "initial_state_overrides", +) +""" +Special configuration keys that are list values which are concatenated +together when they are found in multiple sources (e.g. default JSON and +user-specified JSON) instead of being directly overriden. +""" CONFIG_DIR_PATH = os.path.join( os.path.dirname(os.path.dirname(os.path.abspath(__file__))), @@ -49,6 +64,118 @@ """ +def merge_dicts(a, b): + """ + Recursively merges dictionary b into dictionary a. + This mutates dictionary a. + """ + for key, value in b.items(): + if isinstance(value, dict) and key in a and isinstance(a[key], dict): + # If both values are dictionaries, recursively merge + merge_dicts(a[key], value) + else: + # Otherwise, overwrite or add the value from b to a + a[key] = value + + +def submit_job(cmd: str, sbatch_options: Optional[list] = None) -> int: + """ + Submits a job to SLURM using sbatch and waits for it to complete. + + Args: + cmd: Command to run in batch job. + sbatch_options: Additional sbatch options as a list of strings. + + Returns: + Job ID of the submitted job. + """ + sbatch_command = ["sbatch"] + if sbatch_options: + sbatch_command.extend(sbatch_options) + sbatch_command.extend(["--wrap", cmd]) + + try: + result = subprocess.run( + sbatch_command, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + check=True, + text=True, + ) + # Extract job ID from sbatch output + output = result.stdout.strip() + # Assuming job ID is the last word in the output + job_id = int(output.split()[-1]) + print(f"Job submitted with ID: {job_id}") + return job_id + except subprocess.CalledProcessError as e: + print(f"Error submitting job: {e.stderr.strip()}") + raise + + +def wait_for_job(job_id: int, poll_interval: int = 10): + """ + Waits for a SLURM job to finish. + + Args: + job_id: SLURM job ID. + poll_interval: Time in seconds between job status checks. + """ + job_id = str(job_id) + while True: + try: + # Check job status with squeue + result = subprocess.run( + ["squeue", "--job", job_id], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, + ) + if job_id not in result.stdout: + break + except Exception as e: + print(f"Error checking job status: {e}") + raise + time.sleep(poll_interval) + + +def check_job_status(job_id: int) -> bool: + """ + Checks the exit status of a SLURM job using sacct. + + Args: + job_id: SLURM job ID. + + Returns: + True if the job succeeded (exit code 0), False otherwise. + """ + try: + # Query job status with sacct + result = subprocess.run( + ["sacct", "-j", str(job_id), "--format=JobID,State,ExitCode", "--noheader"], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True, + ) + output = result.stdout.strip() + + for line in output.splitlines(): + fields = line.split() + # Match the job ID + if str(job_id) in fields[0]: + state = fields[1] + # Extract the numeric exit code + exit_code = fields[2].split(":")[0] + print(f"Job {job_id} - State: {state}, Exit Code: {exit_code}") + return state == "COMPLETED" and exit_code == "0" + + print(f"Job {job_id} status not found in sacct output.") + return False + except Exception as e: + print(f"Error checking job status: {e}") + raise + + def generate_colony(seeds: int): """ Create strings to import and compose Nextflow processes for colony sims. @@ -225,11 +352,31 @@ def generate_code(config): return "\n".join(run_parca), "\n".join(sim_imports), "\n".join(sim_workflow) -def build_runtime_image(image_name): +def build_runtime_image(image_name, apptainer=False): build_script = os.path.join( os.path.dirname(__file__), "container", "build-runtime.sh" ) - subprocess.run([build_script, "-r", image_name], check=True) + cmd = [build_script, "-r", image_name] + if apptainer: + print("Submitting job to build runtime image.") + cmd.append("-a") + # On Sherlock, submit job to build runtime image + job_id = submit_job( + " ".join(cmd), + sbatch_options=[ + "--time=01:00:00", + "--mem=4G", + "--cpus-per-task=1", + "--partition=mcovert", + ], + ) + wait_for_job(job_id, 30) + if check_job_status(job_id): + print("Done building runtime image.") + else: + raise RuntimeError("Job to build runtime image failed.") + else: + subprocess.run([build_script, "-r", image_name], check=True) def build_wcm_image(image_name, runtime_image_name): @@ -242,9 +389,8 @@ def build_wcm_image(image_name, runtime_image_name): 'If this is correct, add this under "gcloud" > ' '"runtime_image_name" in your config JSON.' ) - subprocess.run( - [build_script, "-w", image_name, "-r", runtime_image_name], check=True - ) + cmd = [build_script, "-w", image_name, "-r", runtime_image_name] + subprocess.run(cmd, check=True) def copy_to_filesystem(source: str, dest: str, filesystem: fs.FileSystem): @@ -288,7 +434,16 @@ def main(): if args.config is not None: config_file = args.config with open(args.config, "r") as f: - SimConfig.merge_config_dicts(config, json.load(f)) + user_config = json.load(f) + for key in LIST_KEYS_TO_MERGE: + user_config.setdefault(key, []) + user_config[key].extend(config.get(key, [])) + if key == "engine_process_reports": + user_config[key] = [tuple(path) for path in user_config[key]] + # Ensures there are no duplicates in d2 + user_config[key] = list(set(user_config[key])) + user_config[key].sort() + merge_dicts(config, user_config) experiment_id = config["experiment_id"] if experiment_id is None: @@ -340,6 +495,8 @@ def main(): # By default, assume running on local device nf_profile = "standard" + # If not running on a local device, build container images according + # to options under gcloud or sherlock configuration keys cloud_config = config.get("gcloud", None) if cloud_config is not None: nf_profile = "gcloud" @@ -358,15 +515,30 @@ def main(): raise RuntimeError("Must supply name for runtime image.") build_runtime_image(runtime_image_name) wcm_image_name = cloud_config.get("wcm_image_name", None) + if wcm_image_name is None: + raise RuntimeError("Must supply name for WCM image.") if cloud_config.get("build_wcm_image", False): - if wcm_image_name is None: - raise RuntimeError("Must supply name for WCM image.") + if runtime_image_name is None: + raise RuntimeError("Must supply name for runtime image.") build_wcm_image(wcm_image_name, runtime_image_name) nf_config = nf_config.replace("IMAGE_NAME", image_prefix + wcm_image_name) - elif config.get("sherlock", None) is not None: - nf_profile = "sherlock" - elif config.get("jenkins", None) is not None: - nf_profile = "jenkins" + sherlock_config = config.get("sherlock", None) + if sherlock_config is not None: + if nf_profile == "gcloud": + raise RuntimeError( + "Cannot set both Sherlock and Google Cloud " + "options in the input JSON." + ) + runtime_image_name = sherlock_config.get("runtime_image_name", None) + if runtime_image_name is None: + raise RuntimeError("Must supply name for runtime image.") + if sherlock_config.get("build_runtime_image", False): + build_runtime_image(runtime_image_name, True) + nf_config = nf_config.replace("IMAGE_NAME", runtime_image_name) + if sherlock_config.get("jenkins", False): + nf_profile = "jenkins" + else: + nf_profile = "sherlock" local_config = os.path.join(local_outdir, "nextflow.config") with open(local_config, "w") as f: @@ -431,7 +603,7 @@ def main(): #SBATCH --time=7-00:00:00 #SBATCH --cpus-per-task 1 #SBATCH --mem=4GB -#SBATCH -p mcovert +#SBATCH --partition=mcovert nextflow -C {config_path} run {workflow_path} -profile {nf_profile} \ -with-report {report_path} -work-dir {workdir} {"-resume" if args.resume else ""} """)