format rst to md, and text clean-up

caporaso-lab · Mar 11, 2024 · 8d0a786 · 8d0a786
1 parent ae44903
commit 8d0a786
Show file tree

Hide file tree

Showing 2 changed files with 138 additions and 188 deletions.
diff --git a/book/framework/explanations/formats.md b/book/framework/explanations/formats.md
@@ -1,195 +1,144 @@
 (formats-explanation)=
 # File Formats and Directory Formats
 
-```{eval-rst}
-
-:term:`Formats <Format>` are on-disk representations of :term:`semantic types
-<Semantic Type>`, and are the materialized ``data/`` saved in the
-:term:`payload <Payload>` by the :term:`framework <Framework>` when reading and
-writing :term:`artifacts <Artifact>`.
-
-QIIME 2 doesn't have much of an opinion on how data should be represented or
-stored, so long as it *can* be represented or stored in an :term:`archive`.
-
-File Formats
-------------
-
-The simplest ``Formats`` are the :class:`TextFileFormat
-<qiime2.plugin.TextFileFormat>` and the :class:`BinaryFileFormat
-<qiime2.plugin.BinaryFileFormat>`. These formats represent a single file with a
-fixed on-disk representation.
-
-Validation
-..........
-
-Both types of ``FileFormat`` support validation hooks --- this is (typically) a
-small bit of code that is run when initially loading a file from an
-:term:`archive` - it allows the framework to ensure that the the data contained
-within the :term:`archive` at least *looks* like its declared :term:`type`.
-This works very well for on-the-fly loading and saving, and goes a long way to
-preventing corrupt or invalid data from persisting. The one "gotcha" here is
-that in order to keep things quick, we typically recommend that "minimal"
-validation occurs over a limited subset of the file (e.g. the first 10 records
-in a FASTQ file). Because of this, formats allow for multiple levels of
-sniffing to be defined (currently there are two levels, ``min`` and ``max``).
-
-.. code-block:: python
-
-   class IntSequenceFormat(TextFileFormat):
-       """
-       A sequence of integers stored on new lines in a file. Since this is a
-       sequence, the integers have an order and repetition of elements is
-       allowed. Sequential values must have an inter-value distance other than
-       3 to be valid.
-       """
-       def _validate_n_ints(self, n):
-           with self.open() as fh:
-               last_val = None
-               for idx, line in enumerate(fh, 1):
-                   if n is not None and idx >= n:
-                       break
-                   try:
-                       val = int(line.rstrip('\n'))
-                   except (TypeError, ValueError):
-                       raise ValidationError("Line %d is not an integer." % idx)
-                   if last_val is not None and last_val + 3 == val:
-                       raise ValidationError("Line %d is 3 more than line %d"
-                                             % (idx, idx-1))
-                   last_val = val
-
-       # The hook is ``_validate_``, but the public method exposed by the
-       # Framework is ``validate``.
-       def _validate_(self, level):
-           record_map = {'min': 5, 'max': None}
-           self._validate_n_ints(record_map[level])
-
-   format_instance = IntSequenceFormat(temp_dir.name, mode='r')
-   format_instance.validate()  # Shouldn't error!
-
-In the fictional format example above, when ``validate`` is called with
-``level='min'``, the ``_validate_`` hook will check the first 5 records,
-otherwise, when ``level='max'``, it will check the entire file.
-
-Astute observers might notice that the method defined in the
-``IntSequenceFormat`` is called ``_validate_``, but the method called on the
-``format_instance`` was ``validate`` --- this is because defining format
-validation is optional (although highly recommended!).  Every format has a
-``validate`` method available to interfaces (for performing ad-hoc validation),
-the framework will check for the presence of a ``_validate_`` method on the
-format in question, and will include that method as part of more general
-validations that the framework will perform. The aim here is that the framework
-is capable of ensuring common basic patterns, like presence of required files,
-while the ``_validate_`` method is the place for the format developer to
-declare any special "business" logic necessary for ensuring the validity of
-their format.
-
-Text File Formats
-.................
-
-The :class:`TextFileFormat <qiime2.plugin.TextFileFormat>` is for creating
-text-based formats (e.g. FASTQ, TSV, etc.). An example of one of these formats
-is the |DNAFASTAFormat|_, used for storing FASTA data.
-
-.. note:: This format defines a ``sniff`` hook, instead of ``_validate_`` -
-   this is a now-deprecated form of validation that is being replaced with the
-   multi-level validation supported with ``_validate_``.
-
-Binary File Formats
-...................
-
-The :class:`BinaryFileFormat <qiime2.plugin.BinaryFileFormat>` is for creating
-binary formats (e.g. BIOM, gzip, etc.). An example of one of these formats is
-the |FastqGzFormat|_, the format for gzipped FASTQ files.
-
-Directory Formats
------------------
-
-While many formats can accurately be described using a single file, many
-formats exist that require the presence of more than one file present together
-as a set.  QIIME 2 allows more than one ``FileFormat`` to be combined together
-as a :class:`DirectoryFormat <qiime2.plugin.DirectoryFormat>`. The exciting
-thing about this is that all of the sniffing, validation, and type-safety of
-the individual file formats is multiplied by however many members are expected
-to be present within the :class:`DirectoryFormat
-<qiime2.plugin.DirectoryFormat>`!
-
-Fixed Layouts
-..............
-
-Some directory layouts can be accurately described with a fixed number of
-members. An example of this is the |EMPPairedEndDirFmt|_ --- this directory
-format should always be composed of three |FastqGzFormat|_ files --- one for
-the forward reads, one for the reverse reads, and one for the barcodes.  The
-|FastqGzFormat|_ is defined once (the format doesn't need to know about the
-sematic difference between biological reads and barcode reads).
-
-.. code-block:: python
-
+{term}`Formats <Format>` in QIIME 2 are on-disk representations of {term}`semantic types <Semantic Type>`. 
+They are the stored in or read from the `data/` when reading and writing {term}`artifacts <Artifact>`.
+
+QIIME 2 doesn't have much of an opinion on how data should be represented or stored, so long as it *can* be represented or stored in an {term}`archive`.
+
+## File Formats
+
+The simplest `Formats` are the `TextFileFormat` (`qiime2.plugin.TextFileFormat`) and the `BinaryFileFormat` (`qiime2.plugin.BinaryFileFormat`).
+These formats represent a single file with a fixed on-disk representation.
+
+### Validation
+
+Both types of `FileFormat` support validation.
+This is (typically) a small bit of code that is run when initially loading a file from an {term}`archive` that allows the framework to ensure that the the data contained within the {term}`archive` at least *looks* like its declared {term}`type`.
+This works very well for on-the-fly loading and saving, and goes a long way to preventing corrupt or invalid data from persisting.
+The one "gotcha" here is that in order to keep things quick, we typically recommend that "minimal" validation occurs over a limited subset of the file (e.g. the first 10 records in a FASTQ file).
+Because of this, formats allow for multiple levels of sniffing to be defined.
+As of this writing (March 2024) there currently there are two validation levels: `min` and `max`.
+
+Here we provide an example of a `TextFileFormat` definition, with a focus on the `_validate_` function.
+
+```python
+class IntSequenceFormat(TextFileFormat):
+    """
+    A sequence of integers stored on new lines in a file.
+    To make validation more interesting, values in the list can be any integer as long 
+    as that integer is not equal to the previous value plus 3
+    (i.e., `line[i] != (line[i-1]) + 3`). 
+    """
+    def _validate_n_ints(self, n):
+        with self.open() as fh:
+            previous_val = None
+            for idx, line in enumerate(fh, 1):
+                if n is not None and idx >= n:
+                    # we have passed the min validation level,
+                    # so bail out
+                    break
+                try:
+                    val = int(line.rstrip('\n'))
+                except (TypeError, ValueError):
+                    raise ValidationError(
+                        f"Line {idx} contains {val}, but must be an integer.")
+                if previous_val is not None and previous_val + 3 == val:
+                    raise ValidationError(
+                        f"Value on line {idx} is 3 more than the value on "
+                        f"line {idx-1}.")
+                previous_val = val
+
+    # `_validate_` is exposed through the public method `validate`.
+    def _validate_(self, level):
+        record_map = {'min': 5, 'max': None}
+        self._validate_n_ints(record_map[level])
+
+format_instance = IntSequenceFormat(temp_dir.name, mode='r')
+format_instance.validate()  # Shouldn't error!
+```
+
+In the `IntSequenceFormat` example, when `validate` is called with `level='min'`, `_validate_` will check the first 5 records.
+Otherwise, when `level='max'`, it will check the entire file.
+
+Astute observers might notice that the method defined in the `IntSequenceFormat` is called `_validate_`, but the method called on the `format_instance` was `validate`.
+This is because defining format validation is optional (although highly recommended!).
+
+```{warning}
+We consider skipping validation all together when defining formats to be a plugin development anti-pattern.
+For more information, see [](antipattern-skipping-validation).
+```
+
+Every format has a `validate` method available to interfaces (for performing ad-hoc validation).
+The framework will check for the presence of a `_validate_` method on the format in question, and if it exists it will include that method as part of more general validations that the framework will perform.
+The aim here is that the framework is capable of ensuring common basic patterns, like presence of required files, while the `_validate_` method is the place for the format developer to declare any special "business" logic necessary for ensuring the validity of their format.
+
+### Text File Formats
+
+The `TextFileFormat` (`qiime2.plugin.TextFileFormat`) is for creating text-based formats (e.g. FASTQ, TSV, etc.).
+An example of one of these formats is the [`DNAFASTAFormat`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/feature_data/_format.py#L147), used for storing FASTA data.
+
+
+### Binary File Formats
+
+The `BinaryFileFormat` (`qiime2.plugin.BinaryFileFormat`) is for creating binary formats (e.g. BIOM, gzip, etc.).
+An example of one of these formats is the [`FastqGzFormat`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_format.py#L236), a format for storing gzipped FASTQ files.
+
+## Directory Formats
+
+While many formats can accurately be described using a single file, many formats exist that require the presence of more than one file present together as a set.
+QIIME 2 allows more than one `FileFormat` to be combined together as a `DirectoryFormat` (`qiime2.plugin.DirectoryFormat`).
+
+### Fixed Layouts
+
+Some directory layouts can be accurately described with a fixed number of members.
+An example of this is the [`EMPPairedEndDirFmt`](https://github.com/qiime2/q2-demux/blob/6e9a0cc8841a9cfbb5f517a256872700c7b75732/q2_demux/_format.py#L28).
+This directory format is always composed of three [`FastqGzFormat`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_format.py#L236) files: one for the forward reads (`forward.fastq.gz`), one for the reverse reads (`reverse.fastq.gz`), and one for the barcodes (`barcodes.fastq.gz`).
+The underlying `FastqGzFormat` is defined once --- it doesn't need to know about the sematic difference between biological reads and barcode reads, unlike the `EMPPairedEndDirFmt` which must be able to differentiate these.
+
+```python
    class EMPPairedEndDirFmt(model.DirectoryFormat):
        forward = model.File(r'forward.fastq.gz', format=FastqGzFormat)
        reverse = model.File(r'reverse.fastq.gz', format=FastqGzFormat)
        barcodes = model.File(r'barcodes.fastq.gz', format=FastqGzFormat)
-
-The individual members are defined using the :class:`File
-<qiime2.plugin.model.File>` class.
-
-Variable Layouts
-................
-
-While some layouts are accurately described with a fixed set of members, others
-are highly variable, preventing formats from accurately knowing how many files
-to expect in its :term:`payload`. An example of this kind of format are any of
-the demultiplexed file formats --- when sequences are demultiplexed there is
-one (or two) files per sample, but how many samples are there? One study might
-have 5 samples, while another has 5000. For these situations the
-:class:`DirectoryFormat <qiime2.plugin.DirectoryFormat>` can be configured to
-watch for set pattern of filenames present in its :term:`payload`.
-
-.. code-block:: python
-
-   class CasavaOneEightSingleLanePerSampleDirFmt(model.DirectoryFormat):
-       sequences = model.FileCollection(
-           r'.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz',
-           format=FastqGzFormat)
-
-       @sequences.set_path_maker
-       def sequences_path_maker(self, sample_id, barcode_id, lane_number,
-                                read_number):
-           return '%s_%s_L%03d_R%d_001.fastq.gz' % (sample_id, barcode_id,
-                                                    lane_number, read_number)
-
-Single File Directory Formats
-.............................
-
-Currently QIIME 2 requires that all formats registered to a :term:`Semantic
-Type` be a directory format, which would be a major pain in the case of the
-single file formats detailed above. For those cases, there exists a factory for
-quickly constructing directory layouts that contain *only a single file*. This
-requirement might be removed in the future, but for now it is a necessary evil
-(and also isn't too much extra work for format developers).
-
-.. code-block:: python
-
-   DNASequencesDirectoryFormat = model.SingleFileDirectoryFormat(
-       'DNASequencesDirectoryFormat', 'dna-sequences.fasta', DNAFASTAFormat)
-
-Associating Formats with a Type
--------------------------------
-
-Formats on their own aren't of much use - it is only once they are registered
-as a *representation* of a :term:`Semantic Type` that things become
-interesting.
-
-.. code-block:: python
-
-   plugin.register_formats(EMPPairedEndDirFmt)
-   # ``RawSequences`` is a Semantic Type
-   plugin.register_semantic_types(RawSequences, EMPPairedEndSequences)
-
-.. |DNAFASTAFormat| replace:: ``DNAFASTAFormat``
-.. _`DNAFASTAFormat`: https://github.com/qiime2/q2-types/blob/master/q2_types/feature_data/_format.py#L133
-.. |FastqGzFormat| replace:: ``FastqGzFormat``
-.. _`FastqGzFormat`: https://github.com/qiime2/q2-types/blob/master/q2_types/per_sample_sequences/_format.py#L106
-.. |EMPPairedEndDirFmt| replace:: ``EMPPairedEndDirFmt``
-.. _`EMPPairedEndDirFmt`: https://github.com/qiime2/q2-demux/blob/master/q2_demux/_format.py
-```
+```
+
+The component files of this `DirectoryFormat` are defined using the `File` (`qiime2.plugin.model.File`) class.
+
+### Variable Layouts
+
+While some layouts are accurately described with a fixed set of members, others are highly variable, preventing formats from accurately knowing how many files to expect in its {term}`payload`.
+An example of this kind of format are any of the demultiplexed file formats --- when sequences are demultiplexed there is one (or two) files per sample, but how many samples are there?
+One study might have 5 samples, while another has 5000.
+For these situations the `DirectoryFormat` (`qiime2.plugin.DirectoryFormat`) can be configured to watch for set pattern of filenames present in its {term}`payload`.
+An example of this is the [`CasavaOneEightSingleLanePerSampleDirFmt`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_format.py#L292) class, which stores demultiplexed sequence data in files named with a pattern used by Illumina's Casava v1.8 software.
+
+```python
+class CasavaOneEightSingleLanePerSampleDirFmt(model.DirectoryFormat):
+    sequences = model.FileCollection(
+        r'.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz',
+        format=FastqGzFormat)
+
+    @sequences.set_path_maker
+    def sequences_path_maker(self, sample_id, barcode_id, lane_number,
+                            read_number):
+        return '%s_%s_L%03d_R%d_001.fastq.gz' % (sample_id, barcode_id,
+                                                lane_number, read_number)
+```
+
+## Single File Directory Formats
+Currently QIIME 2 requires that all formats registered to a {term}`Semantic Type` be a directory format.
+For those cases, there exists a factory for quickly constructing directory layouts that contain *only a single file*.
+This requirement might be removed in the future, but for now it is a necessary evil (and also isn't too much extra work for format developers).
+
+```python
+DNASequencesDirectoryFormat = model.SingleFileDirectoryFormat(
+    'DNASequencesDirectoryFormat', 'dna-sequences.fasta', DNAFASTAFormat)
+```
+
+## Associating Formats with a Type
+
+Formats on their own aren't of much use.
+It is only once they are registered with a {term}`Semantic Type` to define an *artifact class* that things become interesting. 
+Artifact classes define the data that can be provided as input or generated as output by QIIME 2 `Actions`.
+An example of this can be seen in the [registration of the `SampleData[PairedEndSequencesWithQuality]` artifact class](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_type.py#L66). 
diff --git a/book/plugins/references/antipatterns.md b/book/plugins/references/antipatterns.md
@@ -80,6 +80,7 @@ This is essential for ensuring that workflows using your `Action(s)` will be ful
 These are two of the key benefits of making your methods accessible through QIIME 2 plugins, and are expectations of QIIME 2 users.
 The cost is going through the upfront work of associating inputs with semantic types and formats.
 
+(antipattern-skipping-validation)=
 ## Skipping format validation
 
 To save time (either during development, or at run time) plugin developers will sometimes skip implementation of format validation when they create new formats. For example: