Fix some documentation typos

CovertLab · Jul 20, 2024 · c52d1f4 · c52d1f4
1 parent 0519615
commit c52d1f4
Show file tree

Hide file tree

Showing 13 changed files with 156 additions and 139 deletions.
diff --git a/doc/composites.rst b/doc/composites.rst
@@ -3,7 +3,7 @@ Composites
 ==========
 
 :py:class:`~ecoli.composites.ecoli_master.Ecoli` is a so-called composer
-that is responsible for aggregating all the Processes, Steps, topologies,
+that is responsible for aggregating Processes, Steps, topologies,
 and the flow for the Steps into a unified "composite" model that vivarium-core
 is able to run. Unlike a typical Vivarium composer which simply collects all
 these pieces, the :py:class:`ecoli.composites.ecoli_master.Ecoli` composer
@@ -28,7 +28,7 @@ The :py:meth:`~ecoli.composites.ecoli_master.Ecoli.generate_processes_and_steps`
 method of the :py:class:`~ecoli.composites.ecoli_master.Ecoli` composer is responsible
 for creating these two Steps, the :py:class:`~ecoli.processes.allocator.Allocator` steps
 sandwiched between them in each execution layer, and the
-:py:class:`~ecoli.processes.unique_update.UniqueUpdate` Steps the run at the very end
+:py:class:`~ecoli.processes.unique_update.UniqueUpdate` Steps that run at the very end
 of each execution layer. It is also responsible for updating the flow to arrange
 these Steps in the order described in :ref:`implementation`. As an end-user, all you
 have to do to add a new partitioned process is ensure that it inherits from
@@ -91,7 +91,7 @@ to visualize these updates.
 
 .. warning::
     This feature should only be turned for debugging purposes and
-    only when using the in-memory emitter (``timeseries``).
+    only when using the in-memory emitter (see :ref:`ram_emitter`).
 
 -------------
 Initial State

diff --git a/doc/experiments.rst b/doc/experiments.rst
@@ -6,10 +6,20 @@ Experiments
 interface for configuring and running single-cell simulations. We refer
 to simulations as experiments, and all simulations (or batches
 of simulations run in a single workflow, see :ref:`/workflows.rst`) are
-identified via a unique experiment ID. If data is being persisted to
-disk (i.e. not using ``timeseries`` in-memory emitter), then simulations
-or workflows with the same experiment ID will overwrite data from any past
-simulations or workflows with the same experiment ID.
+identified via a unique experiment ID. 
+
+.. warning::
+    If data is being persisted to disk (see :ref:`parquet_emitter`), simulations
+    or workflows will overwrite data from any past simulations or workflows with
+    the same experiment ID.
+
+When running workflows with :py:mod:`runscripts.workflow` (see :ref:`/workflows.rst`),
+users are prevented from accidentally overwriting data by ``nextflow``, the software
+used to run the workflow. Specifically, nextflow generates an HTML execution report
+in the output folder for a given experiment ID (see :ref:`output`)
+and will refuse to run another workflow with the same experiment ID unless
+that execution report is renamed, moved, or deleted.
+
 
 .. _sim_config:
 
@@ -43,25 +53,26 @@ JSON Config Files
 The :py:class:`~ecoli.experiments.ecoli_master_sim.EcoliSim` class relies upon
 the helper :py:class:`~ecoli.experiments.ecoli_master_sim.SimConfig` class to load
 configuration options from JSON files and merge them with options specified via
-the command line. The configuration options are always loaded in the following order:
+the command line. The configuration options are always loaded in the following order,
+with options loaded later on overriding those from earlier sources:
 
 #. The options in the default JSON config file (located at
    :py:data:`~ecoli.experiments.ecoli_master_sim.SimConfig.default_config_path`)
 #. The options in the JSON config file specified via ``--config``
    in the command line.
-#. The other options specified via the command line.
+#. The options specified via the command line.
 
 In most cases, configuration options that appear in more than one
-of the above sources are successively overriden. The sole exceptions
-are configuration options listed in
+of the above sources are successively overriden in their entirety. The sole
+exceptions are configuration options listed in
 :py:attr:`~ecoli.experiments.ecoli_master_sim.LIST_KEYS_TO_MERGE`. These
-options hold list values that are concatenated with one another instead
-of being overriden.
+options hold lists of values that are concatenated with one another instead
+of being wholly overriden.
 
 Notice that the options in the default JSON config file are always loaded
 first. This means that if you would like to run a simulation or workflow
 that leaves some of these options alone, you can simply omit those options
-from the JSON config file that you create and pass to the runscript
+from the JSON config file that you create and pass to your runscript of choice
 via ``--config``.
 
 Below is an annotated copy of the default simulation-related configuration
@@ -82,7 +93,7 @@ documented in :ref:`/workflows.rst`.
         # String that uniquely identifies simulation (or workflow if passed
         # as input to runscripts/workflow.py). Avoid special characters as we
         # quote experiment IDs using urlparse.parse.quote_plus, which may make
-        # experiment IDs with special characters hard to decipher.
+        # experiment IDs with special characters hard to deciphe later.
         "experiment_id": "experiment_id_one"
         # Whether to append date and time to experiment ID in the following format
         # experiment_id_%d-%m-%Y_%H-%M-%S.
@@ -91,7 +102,7 @@ documented in :ref:`/workflows.rst`.
         "description": "",
         # Whether to display vivarium-core progress bar
         "progress_bar" : true,
-        # Path to pickle file output by parameter calculator (runscripts/parca.py).
+        # Path to pickle file output from parameter calculator (runscripts/parca.py).
         # Only used for single sim run with ecoli/experiments/ecoli_master_sim.py.
         # Ignored when run with runscripts/workflow.py because each simulation is
         # automatically run with the appropriate variant/baseline simulation data.
@@ -101,8 +112,8 @@ documented in :ref:`/workflows.rst`.
         # to Parquet files on disk (good for workflows and more in-depth analyses)
         "emitter" : "timeseries",
         # If choosing "parquet" emitter, must provide "out_dir" with path (relative
-        # or absolute) to output folder or "out_uri" with URI for Google Cloud Storage
-        # bucket. ONLY CHOOSE ONE.
+        # or absolute) to output folder OR "out_uri" with URI for Google Cloud Storage
+        # bucket. Only provide one of the above.
         "emitter_arg": {"out_dir": "out"},
         # See API documentation on vivarium-core for vivarium.core.engine.Engine.
         # Can usually leave as false.
@@ -115,7 +126,7 @@ documented in :ref:`/workflows.rst`.
         "log_updates" : false,
         # Controls output format for ecoli.experiments.ecoli_master_sim.EcoliSim.query.
         # Should only be used if choosing "timeseries" emitter. See API documentation
-        # for the function above for more information.
+        # for the query function for more information.
         "raw_output" : true,
         # Initial seed used to generate the seeds that are used to initialize
         # the psuedorandom number generators in the model. Only used for single
@@ -345,9 +356,9 @@ documented in :ref:`/workflows.rst`.
         }
     }
 
-Here are some general rules to remember when writing JSON files:
+Here are some general rules to remember when writing your own JSON config files:
 
-- String must be enclosed in double quotes (not single quotes)
+- Strings must be enclosed in double quotes (not single quotes)
 - Booleans are lowercase
 - None values are written as (unquoted) ``null``
 - Trailing commas are not allowed
@@ -359,7 +370,7 @@ Output
 ------
 
 If ``emitter`` was set to ``parquet``, then folders containing the simulation output are
-created as described in :ref:`/output.rst`.
+created as described in :ref:`parquet_emitter`.
 
 If ``division`` is set to True, :py:mod:`~ecoli.experiments.ecoli_master_sim` will
 save the initial states of the two daughter cells resulting from cell division
@@ -412,8 +423,7 @@ by setting ``_emit`` to ``True``.
     Vivarium includes internal checks to ensure that all ports connected to a
     store give the same or compatible (no conflicting keys) schemas for that store.
     This means that if you would like to override the schema for a store with many
-    connecting ports, you will likely need to override the ports schemas for all
-    relevant ports. 
+    connecting ports, you will need to override the schemas for all the relevant ports.
 
 ------------------
 Colony Simulations

diff --git a/doc/index.rst b/doc/index.rst
@@ -2,9 +2,9 @@ Welcome to Vivarium *E. coli*'s documentation!
 ==============================================
 
 Vivarium *E. coli* is a port of the |text|_ to the `Vivarium framework <https://vivarium-collective.github.io>`_.
-For more scientific details about the model, refer to the
+For more scientific modeling details, refer to the
 `documentation <https://github.com/CovertLab/wcEcoli/tree/master/docs/processes>`_
-for the model as well its corresponding publication
+for the original model as well its corresponding publication
 (`10.1126/science.aav3751 <https://www.science.org/doi/10.1126/science.aav3751>`_).
 This website covers how the model was implemented using Vivarium and describes the user interface
 for developing and running the model. We recommend new users read through the sections below in order.

diff --git a/doc/output.rst b/doc/output.rst
@@ -17,11 +17,11 @@ to that store. By default, we always emit data for:
 
 - Bulk molecules store located at ``("bulk",)``: The
   :py:func:`~ecoli.library.schema.numpy_schema` helper function that we use
-  to create the schema for ports to bulk and unique molecule stores automatically
+  to create the schema for ports to the bulk store automatically
   sets ``_emit`` to True when the ``name`` argument is ``bulk``.
 - Listeners located at ``("listeners",)``: The
   :py:func:`~ecoli.library.schema.listener_schema` helper function that we use
-  to create the schema for ports to stores located somewhere in the store hierarchy
+  to create the schema for ports to stores located somewhere in the hierarchy
   under the ``listener`` store automatically sets ``_emit`` to True
 
 .. _serializing_emits:
@@ -45,6 +45,8 @@ For details about reading data back after it has been saved, refer to
 :ref:`ram_read` for the in-memory data format and :ref:`parquet_read`
 for the persistent storage format.
 
+.. _ram_emitter:
+
 -----------------
 In-Memory Emitter
 -----------------
@@ -90,6 +92,8 @@ of the :py:class:`~vivarium.core.registry.Serializer` instance whose
 :py:meth:`~vivarium.core.registry.Serializer.can_deserialize` method returns
 True on the data to deserialize.
 
+.. _parquet_emitter:
+
 ---------------
 Parquet Emitter
 ---------------
@@ -108,19 +112,19 @@ In Hive partitioning, certain keys in data are used to partition the data into f
 In the vEcoli Parquet emitter, the keys used for this purpose are the experiment ID,
 variant index, lineage seed (initial seed for cell lineage), generation, and agent ID.
 These keys uniquely identify a single cell simulation, meaning each simulation process
-will write data to its own folder in final output with a path like::
+will write data to its own folder in the final output with a path like::
 
     experiment_id={}/variant={}/lineage_seed={}/generation={}/agent_id={}
 
 This allows workflows that run simulations with many variant simulation data objects,
 lineage seeds, generations, and agent IDs to all write data to the same main output
-folder without overwriting any data.
+folder without simulations overwriting one another.
 
 Parquet Files
 =============
 
 Because Parquet is a tabular file format (think in terms of columns like a Pandas
-DataFrame), additional serialization steps must be taken after the data to save
+DataFrame), additional serialization steps must be taken after the emit data
 has been converted to JSON format in accordance with :ref:`serializing_emits`.
 The Parquet emitter (:py:class:`~ecoli.library.parquet_emitter.ParquetEmitter`)
 first calls :py:func:`~ecoli.library.parquet_emitter.flatten_dict` in order to
@@ -200,11 +204,11 @@ Schemas constructed with the :py:func:`~ecoli.library.schema.listener_schema` he
 function can populate this metdata concisely. These metadata values are compiled for
 all stores in the simulation state hierarchy by
 :py:meth:`~ecoli.experiments.ecoli_master_sim.EcoliSim.get_output_metadata`. In the
-saved configuration Parquet file, the metadata values will be located under
+saved configuration Parquet file, the metadata values will be located in
 columns with names equal to the double-underscore concatenated store path
 prefixed by ``output_metadata__``. For convenience, the
 :py:func:`~ecoli.library.parquet_emitter.get_field_metadata` can be used in
-analysis scripts to read back this metadata.
+analysis scripts to read this metadata.
 
 ``history``
 -----------
@@ -213,13 +217,13 @@ Each simulation will save Parquet files containing serialized simulation output
 inside its corresponding Hive partition under the ``history`` folder. The columns in
 these Parquet files come from flattening the hierarchy of emitted stores. To leverage
 Parquet's columnar compression and efficient reading, we batch many time steps worth
-of emits into a temporary file before reading them into a
+of emits into a temporary newline-delimited JSON file before reading them into a
 `PyArrow <https://arrow.apache.org/docs/python/index.html>`_ table where each row
-contains the column values for a single time step. This PyArrow table can then be
+contains the column values for a single time step. This PyArrow table is then
 written to a Parquet file named ``{batch size * number of batches}.pq`` (e.g.
 ``400.pq``, ``800.pq``, etc. for a batch size of 400). The default batch size of
 400 has been tuned for our current model but can be adjusted via ``emits_to_batch``
-under the ``emitter_arg`` option in configuration JSONs.
+under the ``emitter_arg`` option in a configuration JSON.
 
 .. _parquet_read:
 
@@ -240,19 +244,19 @@ to read data using DuckDB. These include:
   ``config_sql`` that reads data from Parquet files with filters applied when
   run using :py:mod:`runscripts.analysis`.
 - :py:func:`~ecoli.library.parquet_emitter.num_cells`: Quickly get a count of
-  the number of cells worth of data included in a SQL query
+  the number of cells whose data is included in a SQL query
 - :py:func:`~ecoli.library.parquet_emitter.skip_n_gens`: Add a filter to an SQL
   query to skip the first N generations worth of data
-- :py:func:`~ecoli.library.parquet_emitter.ndlist_to_ndarray`: Convert a Parquet
+- :py:func:`~ecoli.library.parquet_emitter.ndlist_to_ndarray`: Convert a PyArrow
   column of nested lists into a N-D Numpy array
 - :py:func:`~ecoli.library.parquet_emitter.ndarray_to_ndlist`: Convert a N-D Numpy
-  array into a Parquet column of nested lists
+  array into a PyArrow column of nested lists
 - :py:func:`~ecoli.library.parquet_emitter.ndidx_to_duckdb_expr`: Get a DuckDB SQL
   expression which can be included in a ``SELECT`` statement that uses Numpy-style
   indexing to retrieve values from a nested list Parquet column
 - :py:func:`~ecoli.library.parquet_emitter.named_idx`: Get a DuckDB SQL expression
-  which can be included in a ``SELECT`` statement that extracts certain indices
-  of values from a nested list Parquet column and returns them as new named columns
+  which can be included in a ``SELECT`` statement that extracts values at certain indices
+  from each row of a nested list Parquet column and returns them as individually named columns
 - :py:func:`~ecoli.library.parquet_emitter.get_field_metadata`: Read saved store
   metadata (see :ref:`configuration_parquet`)
 - :py:func:`~ecoli.library.parquet_emitter.get_config_value`: Read option from
@@ -264,7 +268,7 @@ to read data using DuckDB. These include:
   to large to read into memory all at once).
 
 .. warning::
-    Parquet lists are 1-indexed. The :py:func:`~ecoli.library.parquet_emitter.ndidx_to_duckdb_expr`
+    Parquet lists are 1-indexed. :py:func:`~ecoli.library.parquet_emitter.ndidx_to_duckdb_expr`
     and :py:func:`~ecoli.library.parquet_emitter.named_idx` automatically add 1 to
     user-supplied indices.
 
@@ -274,7 +278,7 @@ Construct SQL Queries
 The true power of DuckDB is unlocked when SQL queries are iteratively constructed. This can be
 accomplished in one of two ways:
 
-- For simpler queries, you can wrap a complete DuckDB SQL expression in parenthesis to use as
+- For simpler queries, you can wrap a complete DuckDB SQL expression in parentheses to use as
   the input table to another query. For example, to calculate the average cell and dry mass for
   over all time steps for all cells accessible to an analysis script:
 
@@ -286,7 +290,7 @@ accomplished in one of two ways:
             )
         )
     
-  In this case, ``history_sql`` can be slotted in programmatically using an f-string.
+  ``history_sql`` can be slotted in programmatically using an f-string.
 - For more advanced, multi-step queries, you can use
   `common table expressions <https://duckdb.org/docs/sql/query_syntax/with.html>`_ (CTEs).
   For example, to run the same query above but first averaging over all time steps