Skip to content

Commit

Permalink
Fix some documentation typos
Browse files Browse the repository at this point in the history
  • Loading branch information
thalassemia committed Jul 20, 2024
1 parent 0519615 commit c52d1f4
Show file tree
Hide file tree
Showing 13 changed files with 156 additions and 139 deletions.
6 changes: 3 additions & 3 deletions doc/composites.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Composites
==========

:py:class:`~ecoli.composites.ecoli_master.Ecoli` is a so-called composer
that is responsible for aggregating all the Processes, Steps, topologies,
that is responsible for aggregating Processes, Steps, topologies,
and the flow for the Steps into a unified "composite" model that vivarium-core
is able to run. Unlike a typical Vivarium composer which simply collects all
these pieces, the :py:class:`ecoli.composites.ecoli_master.Ecoli` composer
Expand All @@ -28,7 +28,7 @@ The :py:meth:`~ecoli.composites.ecoli_master.Ecoli.generate_processes_and_steps`
method of the :py:class:`~ecoli.composites.ecoli_master.Ecoli` composer is responsible
for creating these two Steps, the :py:class:`~ecoli.processes.allocator.Allocator` steps
sandwiched between them in each execution layer, and the
:py:class:`~ecoli.processes.unique_update.UniqueUpdate` Steps the run at the very end
:py:class:`~ecoli.processes.unique_update.UniqueUpdate` Steps that run at the very end
of each execution layer. It is also responsible for updating the flow to arrange
these Steps in the order described in :ref:`implementation`. As an end-user, all you
have to do to add a new partitioned process is ensure that it inherits from
Expand Down Expand Up @@ -91,7 +91,7 @@ to visualize these updates.

.. warning::
This feature should only be turned for debugging purposes and
only when using the in-memory emitter (``timeseries``).
only when using the in-memory emitter (see :ref:`ram_emitter`).

-------------
Initial State
Expand Down
52 changes: 31 additions & 21 deletions doc/experiments.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,20 @@ Experiments
interface for configuring and running single-cell simulations. We refer
to simulations as experiments, and all simulations (or batches
of simulations run in a single workflow, see :ref:`/workflows.rst`) are
identified via a unique experiment ID. If data is being persisted to
disk (i.e. not using ``timeseries`` in-memory emitter), then simulations
or workflows with the same experiment ID will overwrite data from any past
simulations or workflows with the same experiment ID.
identified via a unique experiment ID.

.. warning::
If data is being persisted to disk (see :ref:`parquet_emitter`), simulations
or workflows will overwrite data from any past simulations or workflows with
the same experiment ID.

When running workflows with :py:mod:`runscripts.workflow` (see :ref:`/workflows.rst`),
users are prevented from accidentally overwriting data by ``nextflow``, the software
used to run the workflow. Specifically, nextflow generates an HTML execution report
in the output folder for a given experiment ID (see :ref:`output`)
and will refuse to run another workflow with the same experiment ID unless
that execution report is renamed, moved, or deleted.


.. _sim_config:

Expand Down Expand Up @@ -43,25 +53,26 @@ JSON Config Files
The :py:class:`~ecoli.experiments.ecoli_master_sim.EcoliSim` class relies upon
the helper :py:class:`~ecoli.experiments.ecoli_master_sim.SimConfig` class to load
configuration options from JSON files and merge them with options specified via
the command line. The configuration options are always loaded in the following order:
the command line. The configuration options are always loaded in the following order,
with options loaded later on overriding those from earlier sources:

#. The options in the default JSON config file (located at
:py:data:`~ecoli.experiments.ecoli_master_sim.SimConfig.default_config_path`)
#. The options in the JSON config file specified via ``--config``
in the command line.
#. The other options specified via the command line.
#. The options specified via the command line.

In most cases, configuration options that appear in more than one
of the above sources are successively overriden. The sole exceptions
are configuration options listed in
of the above sources are successively overriden in their entirety. The sole
exceptions are configuration options listed in
:py:attr:`~ecoli.experiments.ecoli_master_sim.LIST_KEYS_TO_MERGE`. These
options hold list values that are concatenated with one another instead
of being overriden.
options hold lists of values that are concatenated with one another instead
of being wholly overriden.

Notice that the options in the default JSON config file are always loaded
first. This means that if you would like to run a simulation or workflow
that leaves some of these options alone, you can simply omit those options
from the JSON config file that you create and pass to the runscript
from the JSON config file that you create and pass to your runscript of choice
via ``--config``.

Below is an annotated copy of the default simulation-related configuration
Expand All @@ -82,7 +93,7 @@ documented in :ref:`/workflows.rst`.
# String that uniquely identifies simulation (or workflow if passed
# as input to runscripts/workflow.py). Avoid special characters as we
# quote experiment IDs using urlparse.parse.quote_plus, which may make
# experiment IDs with special characters hard to decipher.
# experiment IDs with special characters hard to deciphe later.
"experiment_id": "experiment_id_one"
# Whether to append date and time to experiment ID in the following format
# experiment_id_%d-%m-%Y_%H-%M-%S.
Expand All @@ -91,7 +102,7 @@ documented in :ref:`/workflows.rst`.
"description": "",
# Whether to display vivarium-core progress bar
"progress_bar" : true,
# Path to pickle file output by parameter calculator (runscripts/parca.py).
# Path to pickle file output from parameter calculator (runscripts/parca.py).
# Only used for single sim run with ecoli/experiments/ecoli_master_sim.py.
# Ignored when run with runscripts/workflow.py because each simulation is
# automatically run with the appropriate variant/baseline simulation data.
Expand All @@ -101,8 +112,8 @@ documented in :ref:`/workflows.rst`.
# to Parquet files on disk (good for workflows and more in-depth analyses)
"emitter" : "timeseries",
# If choosing "parquet" emitter, must provide "out_dir" with path (relative
# or absolute) to output folder or "out_uri" with URI for Google Cloud Storage
# bucket. ONLY CHOOSE ONE.
# or absolute) to output folder OR "out_uri" with URI for Google Cloud Storage
# bucket. Only provide one of the above.
"emitter_arg": {"out_dir": "out"},
# See API documentation on vivarium-core for vivarium.core.engine.Engine.
# Can usually leave as false.
Expand All @@ -115,7 +126,7 @@ documented in :ref:`/workflows.rst`.
"log_updates" : false,
# Controls output format for ecoli.experiments.ecoli_master_sim.EcoliSim.query.
# Should only be used if choosing "timeseries" emitter. See API documentation
# for the function above for more information.
# for the query function for more information.
"raw_output" : true,
# Initial seed used to generate the seeds that are used to initialize
# the psuedorandom number generators in the model. Only used for single
Expand Down Expand Up @@ -345,9 +356,9 @@ documented in :ref:`/workflows.rst`.
}
}
Here are some general rules to remember when writing JSON files:
Here are some general rules to remember when writing your own JSON config files:

- String must be enclosed in double quotes (not single quotes)
- Strings must be enclosed in double quotes (not single quotes)
- Booleans are lowercase
- None values are written as (unquoted) ``null``
- Trailing commas are not allowed
Expand All @@ -359,7 +370,7 @@ Output
------

If ``emitter`` was set to ``parquet``, then folders containing the simulation output are
created as described in :ref:`/output.rst`.
created as described in :ref:`parquet_emitter`.

If ``division`` is set to True, :py:mod:`~ecoli.experiments.ecoli_master_sim` will
save the initial states of the two daughter cells resulting from cell division
Expand Down Expand Up @@ -412,8 +423,7 @@ by setting ``_emit`` to ``True``.
Vivarium includes internal checks to ensure that all ports connected to a
store give the same or compatible (no conflicting keys) schemas for that store.
This means that if you would like to override the schema for a store with many
connecting ports, you will likely need to override the ports schemas for all
relevant ports.
connecting ports, you will need to override the schemas for all the relevant ports.

------------------
Colony Simulations
Expand Down
4 changes: 2 additions & 2 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ Welcome to Vivarium *E. coli*'s documentation!
==============================================

Vivarium *E. coli* is a port of the |text|_ to the `Vivarium framework <https://vivarium-collective.github.io>`_.
For more scientific details about the model, refer to the
For more scientific modeling details, refer to the
`documentation <https://github.com/CovertLab/wcEcoli/tree/master/docs/processes>`_
for the model as well its corresponding publication
for the original model as well its corresponding publication
(`10.1126/science.aav3751 <https://www.science.org/doi/10.1126/science.aav3751>`_).
This website covers how the model was implemented using Vivarium and describes the user interface
for developing and running the model. We recommend new users read through the sections below in order.
Expand Down
40 changes: 22 additions & 18 deletions doc/output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,11 @@ to that store. By default, we always emit data for:

- Bulk molecules store located at ``("bulk",)``: The
:py:func:`~ecoli.library.schema.numpy_schema` helper function that we use
to create the schema for ports to bulk and unique molecule stores automatically
to create the schema for ports to the bulk store automatically
sets ``_emit`` to True when the ``name`` argument is ``bulk``.
- Listeners located at ``("listeners",)``: The
:py:func:`~ecoli.library.schema.listener_schema` helper function that we use
to create the schema for ports to stores located somewhere in the store hierarchy
to create the schema for ports to stores located somewhere in the hierarchy
under the ``listener`` store automatically sets ``_emit`` to True

.. _serializing_emits:
Expand All @@ -45,6 +45,8 @@ For details about reading data back after it has been saved, refer to
:ref:`ram_read` for the in-memory data format and :ref:`parquet_read`
for the persistent storage format.

.. _ram_emitter:

-----------------
In-Memory Emitter
-----------------
Expand Down Expand Up @@ -90,6 +92,8 @@ of the :py:class:`~vivarium.core.registry.Serializer` instance whose
:py:meth:`~vivarium.core.registry.Serializer.can_deserialize` method returns
True on the data to deserialize.

.. _parquet_emitter:

---------------
Parquet Emitter
---------------
Expand All @@ -108,19 +112,19 @@ In Hive partitioning, certain keys in data are used to partition the data into f
In the vEcoli Parquet emitter, the keys used for this purpose are the experiment ID,
variant index, lineage seed (initial seed for cell lineage), generation, and agent ID.
These keys uniquely identify a single cell simulation, meaning each simulation process
will write data to its own folder in final output with a path like::
will write data to its own folder in the final output with a path like::

experiment_id={}/variant={}/lineage_seed={}/generation={}/agent_id={}

This allows workflows that run simulations with many variant simulation data objects,
lineage seeds, generations, and agent IDs to all write data to the same main output
folder without overwriting any data.
folder without simulations overwriting one another.

Parquet Files
=============

Because Parquet is a tabular file format (think in terms of columns like a Pandas
DataFrame), additional serialization steps must be taken after the data to save
DataFrame), additional serialization steps must be taken after the emit data
has been converted to JSON format in accordance with :ref:`serializing_emits`.
The Parquet emitter (:py:class:`~ecoli.library.parquet_emitter.ParquetEmitter`)
first calls :py:func:`~ecoli.library.parquet_emitter.flatten_dict` in order to
Expand Down Expand Up @@ -200,11 +204,11 @@ Schemas constructed with the :py:func:`~ecoli.library.schema.listener_schema` he
function can populate this metdata concisely. These metadata values are compiled for
all stores in the simulation state hierarchy by
:py:meth:`~ecoli.experiments.ecoli_master_sim.EcoliSim.get_output_metadata`. In the
saved configuration Parquet file, the metadata values will be located under
saved configuration Parquet file, the metadata values will be located in
columns with names equal to the double-underscore concatenated store path
prefixed by ``output_metadata__``. For convenience, the
:py:func:`~ecoli.library.parquet_emitter.get_field_metadata` can be used in
analysis scripts to read back this metadata.
analysis scripts to read this metadata.

``history``
-----------
Expand All @@ -213,13 +217,13 @@ Each simulation will save Parquet files containing serialized simulation output
inside its corresponding Hive partition under the ``history`` folder. The columns in
these Parquet files come from flattening the hierarchy of emitted stores. To leverage
Parquet's columnar compression and efficient reading, we batch many time steps worth
of emits into a temporary file before reading them into a
of emits into a temporary newline-delimited JSON file before reading them into a
`PyArrow <https://arrow.apache.org/docs/python/index.html>`_ table where each row
contains the column values for a single time step. This PyArrow table can then be
contains the column values for a single time step. This PyArrow table is then
written to a Parquet file named ``{batch size * number of batches}.pq`` (e.g.
``400.pq``, ``800.pq``, etc. for a batch size of 400). The default batch size of
400 has been tuned for our current model but can be adjusted via ``emits_to_batch``
under the ``emitter_arg`` option in configuration JSONs.
under the ``emitter_arg`` option in a configuration JSON.

.. _parquet_read:

Expand All @@ -240,19 +244,19 @@ to read data using DuckDB. These include:
``config_sql`` that reads data from Parquet files with filters applied when
run using :py:mod:`runscripts.analysis`.
- :py:func:`~ecoli.library.parquet_emitter.num_cells`: Quickly get a count of
the number of cells worth of data included in a SQL query
the number of cells whose data is included in a SQL query
- :py:func:`~ecoli.library.parquet_emitter.skip_n_gens`: Add a filter to an SQL
query to skip the first N generations worth of data
- :py:func:`~ecoli.library.parquet_emitter.ndlist_to_ndarray`: Convert a Parquet
- :py:func:`~ecoli.library.parquet_emitter.ndlist_to_ndarray`: Convert a PyArrow
column of nested lists into a N-D Numpy array
- :py:func:`~ecoli.library.parquet_emitter.ndarray_to_ndlist`: Convert a N-D Numpy
array into a Parquet column of nested lists
array into a PyArrow column of nested lists
- :py:func:`~ecoli.library.parquet_emitter.ndidx_to_duckdb_expr`: Get a DuckDB SQL
expression which can be included in a ``SELECT`` statement that uses Numpy-style
indexing to retrieve values from a nested list Parquet column
- :py:func:`~ecoli.library.parquet_emitter.named_idx`: Get a DuckDB SQL expression
which can be included in a ``SELECT`` statement that extracts certain indices
of values from a nested list Parquet column and returns them as new named columns
which can be included in a ``SELECT`` statement that extracts values at certain indices
from each row of a nested list Parquet column and returns them as individually named columns
- :py:func:`~ecoli.library.parquet_emitter.get_field_metadata`: Read saved store
metadata (see :ref:`configuration_parquet`)
- :py:func:`~ecoli.library.parquet_emitter.get_config_value`: Read option from
Expand All @@ -264,7 +268,7 @@ to read data using DuckDB. These include:
to large to read into memory all at once).

.. warning::
Parquet lists are 1-indexed. The :py:func:`~ecoli.library.parquet_emitter.ndidx_to_duckdb_expr`
Parquet lists are 1-indexed. :py:func:`~ecoli.library.parquet_emitter.ndidx_to_duckdb_expr`
and :py:func:`~ecoli.library.parquet_emitter.named_idx` automatically add 1 to
user-supplied indices.

Expand All @@ -274,7 +278,7 @@ Construct SQL Queries
The true power of DuckDB is unlocked when SQL queries are iteratively constructed. This can be
accomplished in one of two ways:

- For simpler queries, you can wrap a complete DuckDB SQL expression in parenthesis to use as
- For simpler queries, you can wrap a complete DuckDB SQL expression in parentheses to use as
the input table to another query. For example, to calculate the average cell and dry mass for
over all time steps for all cells accessible to an analysis script:

Expand All @@ -286,7 +290,7 @@ accomplished in one of two ways:
)
)
In this case, ``history_sql`` can be slotted in programmatically using an f-string.
``history_sql`` can be slotted in programmatically using an f-string.
- For more advanced, multi-step queries, you can use
`common table expressions <https://duckdb.org/docs/sql/query_syntax/with.html>`_ (CTEs).
For example, to run the same query above but first averaging over all time steps
Expand Down
Loading

0 comments on commit c52d1f4

Please sign in to comment.