Files to generate the sim_data
object that contains all parameters needed to run simulations.
reconstruction/ecoli/...
flat/
: raw data filesdataclasses/
: classes for organizing data related to processes and statesscripts/
: scripts for processing data sources to*.tsv
files inreconstruction/ecoli/flat/
knowledge_base_raw.py
: script to load raw datasimulation_data.py
: class of thesim_data
objectfit_sim_data_1.py
: script to calculate parameters from raw data to produce thesim_data
object required for simulations
models/ecoli/...
processes/
: files to simulate physiological processeslisteners/
: files to record data to disk during simulationsanalysis/
: files to analyze simulation outputsim/variants/
: files that modifysim_data
for running experiments on modified parameterssim/initial_conditions.py
: initializes cell states before simulationsim/simulation.py
: specifies classes that make an E. coli simulation
General tools that are not E. coli specific.
wholecell/...
fireworks/firetasks/
: handles inputs, outputs and options for the execution of tasks in a simulation workflow (eg. calculating parameters, running simulations, performing analysis)tests/
: test scripts to ensure proper functionalitystates/
: classes representing cell states used in simulationssim/simulation.py
: main simulation file handling states, processes, listeners and updates between them
This data will only be accessed during analysis and not during simulations.
validation/ecoli/...
flat/
: raw data filesvalidation_data_raw.py
: script to load raw datavalidation_data.py
: script to process and organize raw data
cloud/...
docker/
: Dockerfiles for building Docker containers to run simulationsbuild*.sh
: scripts to run Docker builds
Scripts used as entry points for executing workflows or performing analysis.
runscripts/...
jenkins/
: scripts to test codebase through continuous integration (CI)manual/
: scripts for running portions of workflows interactivelydebug/
: scripts useful for debugging issues like output differences and inspectingsim_data
cloud/wcm.py
: used to make Docker fireworks workflows to run locally or on Google compute enginefireworks/fw_queue.py
: used to make fireworks workflows to run locally or on Sherlocktools/
: development tools
Analysis plots will use sim_data generated by the parca, data from sims stored in listeners and can compare to validation data withheld from simulations. See models/ecoli/analysis/ for all the analysis scripts.
TODO: add schematic to show groupings for single/multigen/cohort/variant analysis
Analysis to be performed on a single simulation (for each variant, seed, generation and daughter). This is typically detailed information about each sim.
Analysis to be performed on multiple generations starting with a single parent (for each variant and seed). This is typically to see trends and evolution over time/many generations.
Analysis to be performed across multiple generations that were initialized with a different random seed (for each variant). This is typically to see variability from initial conditions and stochastic events.
Analysis to be performed on all sims run together in order to compare effects from sim_data modifications (only run for the entire set of sims). Each analysis script is typically created for a specific simulation variant to show the output differences from using different parameter values.
Create data series for interactive exploration of data through the Fathom visualization tool.
After generating data series, display the data by running the following command from the cloned repo with the proper path for your simulation output:
python site/server.py ~/wcEcoli/out/manual/kb ~/wcEcoli/out/manual/wildtype_000000/000000/generation_000000/000000/seriesOut
Alternatively, you can set your environment variable $CAUSALITY_SERVER
as the path to site/server.py
in the Causality repo (TIP: add this to your .bash_profile
) and run the manual runscript with the --show
flag from this repo to automatically display the data generated and allow for easier selection of specific seeds/generations etc:
export CAUSALITY_SERVER="~/path/to/causality/site/server.py"
python runscripts/manual/buildCausalityNetwork.py --show
Useful network topologies are saved in models/ecoli/analysis/causality_network/saved_networks. Saved topologies can be loaded and new topologies can be saved through the web interface after running the python site/server.py ...
or python runscripts/manual/buildCausalityNetwork.py --show
commands from above. If a newly saved topology is useful for the team, check it in with a new commit.
Analysis to be performed on raw_data
, sim_data
and validation_data
only. This does not require any simulation output and is only run once after the parca has run to visualize raw data and processed data.
The sections below provide step by step guides for adding different components to the whole-cell model. After completing one of these sections, you will want to create a pull request (PR) for review before merging the code into master. Some guidelines for creating a PR are listed below:
- Follow style guidelines in our style guide for consistent code
- Create a new branch from master (or another branch), add your new commits and push it to GitHub. For help with git or GitHub consider the following useful link: Git Book (mostly Ch. 1-3 and 6-8 for more advanced use cases),
- Create a descriptive title of the changes, similar to a commit message.
- Each PR should be a relatively concise set of changes. Related and dependent changes can be grouped together but there should not be a lot of new features in a single PR. Large, unrelated changes make it harder to review and track down bugs that are later found to be introduced with a PR.
- PRs should be tested and working - CI will run a quick test to make sure no errors pop up but output should be checked to see if it is reasonable.
- Wait for CI to pass (will show a green check mark and say all checks have passed on your PR) and give others a reasonable amount of time to review before merging into master.
- Ideally perform a 'squash and merge' to merge into master for a simpler commit history for easier debugging when reviewing past changes. This also makes it easier to see which PR led to the changes for additional information and does not insert commits in the commit history like a normal merge will do. After clicking 'squash and merge' you will be able to edit the commit message - remove empty lines and unhelpful commit messages (eg 'fix typo', 'address comments' etc). If you want to keep individual commits, performing a 'rebase and merge' is preferred to 'create a merge commit' since it will also prevent disrupting the commit history order that normal merging can cause.
- If you want to build new features on top of a pending PR, feel free to start working on another branch from that PR branch. You might need to rebase on master or merge master into your new branch once those changes are merged in especially if review comments lead to changes.
Raw data should always be annotated with the source and process used to generate it for reproducibility. The best way is to include it in the file as noted below and described in the PR that incorporates the data into the repo. Adding several data files and scripts to a runscript directory could also use a README.md if desired to point to sources and describe how to run the scripts/what output to expect.
- Add a raw data file to reconstruction/ecoli/flat/. Data is stored in a
.tsv
file format with special formatting handling to allow units (specified in parentheses in column headers), lists, dictionaries and comments (lines starting with#
). - Annotate where the data came from in a comment at the top of the file (URL for the data source and/or script used for processing original data - see example). If a script was required, add it to reconstruction/ecoli/scripts.
- Add the filename to
LIST_OF_DICT_FILENAMES
in knowledge_base_raw.py. This will cause the data to be loaded into the class when an instance ofKnowledgeBaseEcoli
is created. - Access, process and store the data in the appropriate reconstruction class (eg processes or states) by accessing the
raw_data
attribute for the file (eg.raw_data.new_file
for a file namednew_file.tsv
)
NOTE: if there are issues loading the new file, try saving it using tsv_writer
from reconstruction/spreadsheets.py to ensure proper formatting that can be read by tsv_reader
or JsonReader
:
from reconstruction.spreadsheets import tsv_writer
filename = 'output.tsv'
fieldnames = ['a', 'b']
with tsv_writer(filename, fieldnames) as writer:
writer.writerow({'a': 1, 'b': 2}) # write as many rows of data as needed
The steps to add validation data are very similar to that described in New raw data
above but an important distinction to make between raw data and validation data is that validation data will not be used to calculate parameters or be used in simulations at all. Validation data is only used to compare simulation results in analysis plots. Additional information about file formatting and annotating in New raw data
should also be considered here.
- Add a validation data file to validation/ecoli/flat/.
- Annotate where the data came from in a comment at the top of the file (URL for the data source and/or script used for processing original data - see example). If a script was required, add it to reconstruction/ecoli/scripts.
- Add the filename to
LIST_OF_DICT_FILENAMES
in validation_data_raw.py. This will cause the data to be loaded into the class when an instance ofValidationDataRawEcoli
is created. - Access, process and store the data as an attribute in the appropriate class (or create a new class) in validation_data.py by accessing the
validation_data_raw
attribute for the file (eg.validation_data_raw.new_file
for a file namednew_file.tsv
)
Each process models one part of the cell’s function, e.g. RNA polymerase elongation. They are modeled separately (modular), run in short time steps (assumed to be independent over a short time), and the results from each time step are integrated between processes before initiating the next time step.
Each process has three entry points during a simulation:
- initialize: called only once at the beginning of a simulation. Get needed parameters from the knowledge base, get views of bulk and unique molecules (bulk molecules are “indistinguishable” from each other, e.g. inactive RNAP molecules, unique molecules can be distinguished from each other, e.g. active RNAP molecules are each assigned to a location on the genome), create a view so that you can get counts, change counts, and change properties.
- calculateRequest: called at the beginning of each timestep. Request the resources that you want for that timestep (don’t request all unless you are certain that another process doesn’t need this resource as well, don’t forget about metabolism).
- evolveState: called after resources are allocated at each timestep. Perform the process, update counts, and update masses (mass must be conserved between steps).
Adding a process involves adding data to be used by that process in reconstruction/
as well as code to model the process in models/
. The steps to add a new process called new_process
are outlined below:
- Add any required raw data (see 'New raw data' section above)
- Create a new file called
new_process.py
in reconstruction/ecoli/dataclasses/process/. This should include a class definition forNewProcess
that loads data fromraw_data
to store as instance variables in an__init__(self, raw_data, sim_data)
function. See other files in the directory for an example. - Import the new reconstruction class and initialize an instance of it in process.py. This will make the data in the previous step accessible with
sim_data.process.new_process
.from .new_process import NewProcess ... self.new_process = NewProcess(raw_data, sim_data) ...
- Create a new file called
new_process.py
in models/ecoli/processes/. Add a class definition with the following functions as described above:class NewProcess(wholecell.processes.process.Process): """ NewProcess """ _name = "NewProcess" def __init__(self): super(NewProcess, self).__init__() def initialize(self, sim, sim_data): super(NewProcess, self).initialize(sim, sim_data) def calculateRequest(self): def evolveState(self):
- Import the new model class in simulation.py and add the class to one of the tuples in
_processClasses
. Each tuple within_processClasses
represents a set of processes that will all run before updating the cell state. All processes within a tuple are assumed to be independent of each other and will request from the same pool of resources. The tuples of processes will be executed in order so a process that requires all other processes to run first, should be in the last tuple. One time step will be completed once all of the processes in each tuple have been run. In most cases, a new class will be added to the first tuple.from models.ecoli.processes.new_process import NewProcess _processClasses = ( ( ... NewProcess, ... ), )
- Add a function to initial_conditions.py to approximate the function of the new process at steady state so that the initial cell state is representative of the state expected after the process runs.
- (Optional) Add a new listener (see 'Add new listener' section) to save important information about the new process.
- (Optional) Add new analysis plots (see 'Add new analysis' section) to show data from the new process.
- Add documentation (.tex and .pdf) about the new process in docs/processes/.
Variants are used to compare changes to sim_data
that cause different initialization and simulation conditions. Each variant can be thought of as an experiment with the model and can be used to test a hypothesis, analyze sensitivity, and/or get a better understanding of parameters. When running simulations, a range of variant indices can be selected with each one performing a different modification to sim_data
. The effect of each index will be determined by the function added in the steps below. Variants are typically paired with one or more variant level analysis scripts in order to make the desired comparisons between changes in sim_data
. The following steps outline how to add a new variant called new_variant.py
.
- Create a new variant script in models/ecoli/sim/variants/ by copying the
template.py
file to a new filename that describes what your variant does.cp models/ecoli/sim/variants/template.py models/ecoli/sim/variants/new_variant.py
- Update the new file at the points labeled
UPDATE:
. You will want to update the docstring at the top, rename the function to match the filename, make modifications tosim_data
based on theindex
argument, and return descriptions based on the changes made for the specificindex
. - Add an import statement and variant mapping (sorted in alphabetical order) to models/ecoli/sim/variants/__init__.py:
from .new_variant import new_variant nameToFunctionMapping = { ... 'new_variant': new_variant, ...
- (Optional) Create a variant analysis script (see 'New analysis' section below) to analyze results of the new variant.
To run a manual simulation with the first two indices of the new variant:
python runscripts/manual/runSim.py --variant new_variant 0 1
If you want to save new values to analyze after the simulation, you must write it out via a listener. Listeners log data during the simulation and can write attributes or columns. Attributes are values that are static throughout the simulation (eg. IDs) and only written once (usually at the beginning of simulations). Columns are values that change and are written every time step (eg. number of reactions that occurred). Listeners are organized to contain data that is similar to other data within the listener, often from the same process and used in the same analysis plots. The following steps outline how to add a new listener:
- Create a new file in wcEcoli/models/ecoli/listeners/, you can use other listeners in the directory as a template. It should contain a class definition that inherits from
wholecell.listeners.listener.Listener
. - Complete the
initialize()
function. This should save any values fromsim_data
or processes/states insim
that are required like IDs or number of expected attributes. This function is called after processes have initialized so process attributes that are set during processinitialize()
calls can be accessed here. - Complete the
allocate()
function. This should initialize values and types for columns. The initial state of sims along with all listener values is written once before the evolution of a time step so these initial values will be the first entry in a column. - Complete the
tableCreate(self, tableWriter)
function. This function is called once at the beginning of simulations and should define subcolumns (a dictionary that maps column name keys to values that contain an array with a corresponding ID for each entry in the column), write attributes (including subcolumns, if needed) usingtableWriter.writeAttributes()
, and define any columns that can be of variable length usingtableWriter.set_variable_length_columns()
. - (Optional) Complete the
update()
function. This function is called a the end of each time step before values are written and should update any instance variables that will be written to file based on the current state. Often, processes will set values to be written and this function is unnecessary. - Complete the
tableAppend(self, tableWriter)
function. This function is called at the end of each time step afterupdate()
has been called and writes values for each column to file usingtableWriter.append()
. - Add the listener to wcEcoli/models/ecoli/sim/simulation.py. You will need to import the class at the top of the file and add the class to the
_listenerClasses
tuple. - Save data during sims by calling
self.writeToListener('NewListener', 'new_column', value)
in a process to write a value to a column.value
can be a single value (float, str, etc) or a list/array of fixed length at every time step (unlesstableWriter.set_variable_length_columns()
was used intableCreate()
for the given'new_column'
). - Load data during analysis plots by creating a table reader and reading the desired attribute or column, where
simOutDir
will be passed in to thedo_plot
function:from wholecell.io.tablereader import TableReader reader = TableReader(os.path.join(simOutDir, 'NewListener')) attribute = reader.readAttribute('new_attribute') column = reader.readColumn('new_column')
This outlines how to add a new single analysis plot called new_analysis.py
. For other types of analysis, only the directory needs to be changed. New analysis plots might require additional simulation data to be saved to disk by adding entries to an existing listener or creating a new listener (see 'New listener' section above).
- Decide which level of analysis is appropriate (single, multigen, cohort, variant - see 'Analysis' section above for differences between each).
- Create a new file by copying the
template.py
file in the directory for desired analysis type (single, multigen, cohort, variant) in models/ecoli/analysis/.cp models/ecoli/analysis/single/template.py models/ecoli/analysis/single/new_analysis.py
- Add the new plot (
"new_analysis.py"
) to the desired lists in__init__.py
in the appropriate analysis directory (eg for single analysis).- Always add to
ACTIVE
so that continuous integration tests the new plot - If desired, add to
CORE
to run with default analysis if the new analysis is useful in most simulation circumstances - If desired, add to other lists in
TAGS
to run with groups of plots for more specific analysis (eg.METABOLISM
,TRANSCRIPTION
,TRANSLATION
, etc.) when using the manual scripts with the-p
flag. For example, to run all plots in the METABOLISM tag for a sim in out/manual:python runscripts/manual/analysisSingle.py out/manual -p METABOLISM
- Always add to
-
One implicit modeling design goal is that no phenomena be modeled in more than one place over a given time interval -- exactly one place, if the model is complete.
-
We have to be careful about degrees of representation. E.g. genes with known differential expression but no associated transcription factor do not currently change their expression during an environmental shift.
-
A reproducible/testable way to show what is or isn't represented is via validation against some expected behavior or withheld data set.