-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
ae44903
commit 8d0a786
Showing
2 changed files
with
138 additions
and
188 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,195 +1,144 @@ | ||
(formats-explanation)= | ||
# File Formats and Directory Formats | ||
|
||
```{eval-rst} | ||
:term:`Formats <Format>` are on-disk representations of :term:`semantic types | ||
<Semantic Type>`, and are the materialized ``data/`` saved in the | ||
:term:`payload <Payload>` by the :term:`framework <Framework>` when reading and | ||
writing :term:`artifacts <Artifact>`. | ||
QIIME 2 doesn't have much of an opinion on how data should be represented or | ||
stored, so long as it *can* be represented or stored in an :term:`archive`. | ||
File Formats | ||
------------ | ||
The simplest ``Formats`` are the :class:`TextFileFormat | ||
<qiime2.plugin.TextFileFormat>` and the :class:`BinaryFileFormat | ||
<qiime2.plugin.BinaryFileFormat>`. These formats represent a single file with a | ||
fixed on-disk representation. | ||
Validation | ||
.......... | ||
Both types of ``FileFormat`` support validation hooks --- this is (typically) a | ||
small bit of code that is run when initially loading a file from an | ||
:term:`archive` - it allows the framework to ensure that the the data contained | ||
within the :term:`archive` at least *looks* like its declared :term:`type`. | ||
This works very well for on-the-fly loading and saving, and goes a long way to | ||
preventing corrupt or invalid data from persisting. The one "gotcha" here is | ||
that in order to keep things quick, we typically recommend that "minimal" | ||
validation occurs over a limited subset of the file (e.g. the first 10 records | ||
in a FASTQ file). Because of this, formats allow for multiple levels of | ||
sniffing to be defined (currently there are two levels, ``min`` and ``max``). | ||
.. code-block:: python | ||
class IntSequenceFormat(TextFileFormat): | ||
""" | ||
A sequence of integers stored on new lines in a file. Since this is a | ||
sequence, the integers have an order and repetition of elements is | ||
allowed. Sequential values must have an inter-value distance other than | ||
3 to be valid. | ||
""" | ||
def _validate_n_ints(self, n): | ||
with self.open() as fh: | ||
last_val = None | ||
for idx, line in enumerate(fh, 1): | ||
if n is not None and idx >= n: | ||
break | ||
try: | ||
val = int(line.rstrip('\n')) | ||
except (TypeError, ValueError): | ||
raise ValidationError("Line %d is not an integer." % idx) | ||
if last_val is not None and last_val + 3 == val: | ||
raise ValidationError("Line %d is 3 more than line %d" | ||
% (idx, idx-1)) | ||
last_val = val | ||
# The hook is ``_validate_``, but the public method exposed by the | ||
# Framework is ``validate``. | ||
def _validate_(self, level): | ||
record_map = {'min': 5, 'max': None} | ||
self._validate_n_ints(record_map[level]) | ||
format_instance = IntSequenceFormat(temp_dir.name, mode='r') | ||
format_instance.validate() # Shouldn't error! | ||
In the fictional format example above, when ``validate`` is called with | ||
``level='min'``, the ``_validate_`` hook will check the first 5 records, | ||
otherwise, when ``level='max'``, it will check the entire file. | ||
Astute observers might notice that the method defined in the | ||
``IntSequenceFormat`` is called ``_validate_``, but the method called on the | ||
``format_instance`` was ``validate`` --- this is because defining format | ||
validation is optional (although highly recommended!). Every format has a | ||
``validate`` method available to interfaces (for performing ad-hoc validation), | ||
the framework will check for the presence of a ``_validate_`` method on the | ||
format in question, and will include that method as part of more general | ||
validations that the framework will perform. The aim here is that the framework | ||
is capable of ensuring common basic patterns, like presence of required files, | ||
while the ``_validate_`` method is the place for the format developer to | ||
declare any special "business" logic necessary for ensuring the validity of | ||
their format. | ||
Text File Formats | ||
................. | ||
The :class:`TextFileFormat <qiime2.plugin.TextFileFormat>` is for creating | ||
text-based formats (e.g. FASTQ, TSV, etc.). An example of one of these formats | ||
is the |DNAFASTAFormat|_, used for storing FASTA data. | ||
.. note:: This format defines a ``sniff`` hook, instead of ``_validate_`` - | ||
this is a now-deprecated form of validation that is being replaced with the | ||
multi-level validation supported with ``_validate_``. | ||
Binary File Formats | ||
................... | ||
The :class:`BinaryFileFormat <qiime2.plugin.BinaryFileFormat>` is for creating | ||
binary formats (e.g. BIOM, gzip, etc.). An example of one of these formats is | ||
the |FastqGzFormat|_, the format for gzipped FASTQ files. | ||
Directory Formats | ||
----------------- | ||
While many formats can accurately be described using a single file, many | ||
formats exist that require the presence of more than one file present together | ||
as a set. QIIME 2 allows more than one ``FileFormat`` to be combined together | ||
as a :class:`DirectoryFormat <qiime2.plugin.DirectoryFormat>`. The exciting | ||
thing about this is that all of the sniffing, validation, and type-safety of | ||
the individual file formats is multiplied by however many members are expected | ||
to be present within the :class:`DirectoryFormat | ||
<qiime2.plugin.DirectoryFormat>`! | ||
Fixed Layouts | ||
.............. | ||
Some directory layouts can be accurately described with a fixed number of | ||
members. An example of this is the |EMPPairedEndDirFmt|_ --- this directory | ||
format should always be composed of three |FastqGzFormat|_ files --- one for | ||
the forward reads, one for the reverse reads, and one for the barcodes. The | ||
|FastqGzFormat|_ is defined once (the format doesn't need to know about the | ||
sematic difference between biological reads and barcode reads). | ||
.. code-block:: python | ||
{term}`Formats <Format>` in QIIME 2 are on-disk representations of {term}`semantic types <Semantic Type>`. | ||
They are the stored in or read from the `data/` when reading and writing {term}`artifacts <Artifact>`. | ||
|
||
QIIME 2 doesn't have much of an opinion on how data should be represented or stored, so long as it *can* be represented or stored in an {term}`archive`. | ||
|
||
## File Formats | ||
|
||
The simplest `Formats` are the `TextFileFormat` (`qiime2.plugin.TextFileFormat`) and the `BinaryFileFormat` (`qiime2.plugin.BinaryFileFormat`). | ||
These formats represent a single file with a fixed on-disk representation. | ||
|
||
### Validation | ||
|
||
Both types of `FileFormat` support validation. | ||
This is (typically) a small bit of code that is run when initially loading a file from an {term}`archive` that allows the framework to ensure that the the data contained within the {term}`archive` at least *looks* like its declared {term}`type`. | ||
This works very well for on-the-fly loading and saving, and goes a long way to preventing corrupt or invalid data from persisting. | ||
The one "gotcha" here is that in order to keep things quick, we typically recommend that "minimal" validation occurs over a limited subset of the file (e.g. the first 10 records in a FASTQ file). | ||
Because of this, formats allow for multiple levels of sniffing to be defined. | ||
As of this writing (March 2024) there currently there are two validation levels: `min` and `max`. | ||
|
||
Here we provide an example of a `TextFileFormat` definition, with a focus on the `_validate_` function. | ||
|
||
```python | ||
class IntSequenceFormat(TextFileFormat): | ||
""" | ||
A sequence of integers stored on new lines in a file. | ||
To make validation more interesting, values in the list can be any integer as long | ||
as that integer is not equal to the previous value plus 3 | ||
(i.e., `line[i] != (line[i-1]) + 3`). | ||
""" | ||
def _validate_n_ints(self, n): | ||
with self.open() as fh: | ||
previous_val = None | ||
for idx, line in enumerate(fh, 1): | ||
if n is not None and idx >= n: | ||
# we have passed the min validation level, | ||
# so bail out | ||
break | ||
try: | ||
val = int(line.rstrip('\n')) | ||
except (TypeError, ValueError): | ||
raise ValidationError( | ||
f"Line {idx} contains {val}, but must be an integer.") | ||
if previous_val is not None and previous_val + 3 == val: | ||
raise ValidationError( | ||
f"Value on line {idx} is 3 more than the value on " | ||
f"line {idx-1}.") | ||
previous_val = val | ||
|
||
# `_validate_` is exposed through the public method `validate`. | ||
def _validate_(self, level): | ||
record_map = {'min': 5, 'max': None} | ||
self._validate_n_ints(record_map[level]) | ||
|
||
format_instance = IntSequenceFormat(temp_dir.name, mode='r') | ||
format_instance.validate() # Shouldn't error! | ||
``` | ||
|
||
In the `IntSequenceFormat` example, when `validate` is called with `level='min'`, `_validate_` will check the first 5 records. | ||
Otherwise, when `level='max'`, it will check the entire file. | ||
|
||
Astute observers might notice that the method defined in the `IntSequenceFormat` is called `_validate_`, but the method called on the `format_instance` was `validate`. | ||
This is because defining format validation is optional (although highly recommended!). | ||
|
||
```{warning} | ||
We consider skipping validation all together when defining formats to be a plugin development anti-pattern. | ||
For more information, see [](antipattern-skipping-validation). | ||
``` | ||
|
||
Every format has a `validate` method available to interfaces (for performing ad-hoc validation). | ||
The framework will check for the presence of a `_validate_` method on the format in question, and if it exists it will include that method as part of more general validations that the framework will perform. | ||
The aim here is that the framework is capable of ensuring common basic patterns, like presence of required files, while the `_validate_` method is the place for the format developer to declare any special "business" logic necessary for ensuring the validity of their format. | ||
|
||
### Text File Formats | ||
|
||
The `TextFileFormat` (`qiime2.plugin.TextFileFormat`) is for creating text-based formats (e.g. FASTQ, TSV, etc.). | ||
An example of one of these formats is the [`DNAFASTAFormat`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/feature_data/_format.py#L147), used for storing FASTA data. | ||
|
||
|
||
### Binary File Formats | ||
|
||
The `BinaryFileFormat` (`qiime2.plugin.BinaryFileFormat`) is for creating binary formats (e.g. BIOM, gzip, etc.). | ||
An example of one of these formats is the [`FastqGzFormat`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_format.py#L236), a format for storing gzipped FASTQ files. | ||
|
||
## Directory Formats | ||
|
||
While many formats can accurately be described using a single file, many formats exist that require the presence of more than one file present together as a set. | ||
QIIME 2 allows more than one `FileFormat` to be combined together as a `DirectoryFormat` (`qiime2.plugin.DirectoryFormat`). | ||
|
||
### Fixed Layouts | ||
|
||
Some directory layouts can be accurately described with a fixed number of members. | ||
An example of this is the [`EMPPairedEndDirFmt`](https://github.com/qiime2/q2-demux/blob/6e9a0cc8841a9cfbb5f517a256872700c7b75732/q2_demux/_format.py#L28). | ||
This directory format is always composed of three [`FastqGzFormat`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_format.py#L236) files: one for the forward reads (`forward.fastq.gz`), one for the reverse reads (`reverse.fastq.gz`), and one for the barcodes (`barcodes.fastq.gz`). | ||
The underlying `FastqGzFormat` is defined once --- it doesn't need to know about the sematic difference between biological reads and barcode reads, unlike the `EMPPairedEndDirFmt` which must be able to differentiate these. | ||
|
||
```python | ||
class EMPPairedEndDirFmt(model.DirectoryFormat): | ||
forward = model.File(r'forward.fastq.gz', format=FastqGzFormat) | ||
reverse = model.File(r'reverse.fastq.gz', format=FastqGzFormat) | ||
barcodes = model.File(r'barcodes.fastq.gz', format=FastqGzFormat) | ||
The individual members are defined using the :class:`File | ||
<qiime2.plugin.model.File>` class. | ||
Variable Layouts | ||
................ | ||
While some layouts are accurately described with a fixed set of members, others | ||
are highly variable, preventing formats from accurately knowing how many files | ||
to expect in its :term:`payload`. An example of this kind of format are any of | ||
the demultiplexed file formats --- when sequences are demultiplexed there is | ||
one (or two) files per sample, but how many samples are there? One study might | ||
have 5 samples, while another has 5000. For these situations the | ||
:class:`DirectoryFormat <qiime2.plugin.DirectoryFormat>` can be configured to | ||
watch for set pattern of filenames present in its :term:`payload`. | ||
.. code-block:: python | ||
class CasavaOneEightSingleLanePerSampleDirFmt(model.DirectoryFormat): | ||
sequences = model.FileCollection( | ||
r'.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz', | ||
format=FastqGzFormat) | ||
@sequences.set_path_maker | ||
def sequences_path_maker(self, sample_id, barcode_id, lane_number, | ||
read_number): | ||
return '%s_%s_L%03d_R%d_001.fastq.gz' % (sample_id, barcode_id, | ||
lane_number, read_number) | ||
Single File Directory Formats | ||
............................. | ||
Currently QIIME 2 requires that all formats registered to a :term:`Semantic | ||
Type` be a directory format, which would be a major pain in the case of the | ||
single file formats detailed above. For those cases, there exists a factory for | ||
quickly constructing directory layouts that contain *only a single file*. This | ||
requirement might be removed in the future, but for now it is a necessary evil | ||
(and also isn't too much extra work for format developers). | ||
.. code-block:: python | ||
DNASequencesDirectoryFormat = model.SingleFileDirectoryFormat( | ||
'DNASequencesDirectoryFormat', 'dna-sequences.fasta', DNAFASTAFormat) | ||
Associating Formats with a Type | ||
------------------------------- | ||
Formats on their own aren't of much use - it is only once they are registered | ||
as a *representation* of a :term:`Semantic Type` that things become | ||
interesting. | ||
.. code-block:: python | ||
plugin.register_formats(EMPPairedEndDirFmt) | ||
# ``RawSequences`` is a Semantic Type | ||
plugin.register_semantic_types(RawSequences, EMPPairedEndSequences) | ||
.. |DNAFASTAFormat| replace:: ``DNAFASTAFormat`` | ||
.. _`DNAFASTAFormat`: https://github.com/qiime2/q2-types/blob/master/q2_types/feature_data/_format.py#L133 | ||
.. |FastqGzFormat| replace:: ``FastqGzFormat`` | ||
.. _`FastqGzFormat`: https://github.com/qiime2/q2-types/blob/master/q2_types/per_sample_sequences/_format.py#L106 | ||
.. |EMPPairedEndDirFmt| replace:: ``EMPPairedEndDirFmt`` | ||
.. _`EMPPairedEndDirFmt`: https://github.com/qiime2/q2-demux/blob/master/q2_demux/_format.py | ||
``` | ||
``` | ||
|
||
The component files of this `DirectoryFormat` are defined using the `File` (`qiime2.plugin.model.File`) class. | ||
|
||
### Variable Layouts | ||
|
||
While some layouts are accurately described with a fixed set of members, others are highly variable, preventing formats from accurately knowing how many files to expect in its {term}`payload`. | ||
An example of this kind of format are any of the demultiplexed file formats --- when sequences are demultiplexed there is one (or two) files per sample, but how many samples are there? | ||
One study might have 5 samples, while another has 5000. | ||
For these situations the `DirectoryFormat` (`qiime2.plugin.DirectoryFormat`) can be configured to watch for set pattern of filenames present in its {term}`payload`. | ||
An example of this is the [`CasavaOneEightSingleLanePerSampleDirFmt`](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_format.py#L292) class, which stores demultiplexed sequence data in files named with a pattern used by Illumina's Casava v1.8 software. | ||
|
||
```python | ||
class CasavaOneEightSingleLanePerSampleDirFmt(model.DirectoryFormat): | ||
sequences = model.FileCollection( | ||
r'.+_.+_L[0-9][0-9][0-9]_R[12]_001\.fastq\.gz', | ||
format=FastqGzFormat) | ||
|
||
@sequences.set_path_maker | ||
def sequences_path_maker(self, sample_id, barcode_id, lane_number, | ||
read_number): | ||
return '%s_%s_L%03d_R%d_001.fastq.gz' % (sample_id, barcode_id, | ||
lane_number, read_number) | ||
``` | ||
|
||
## Single File Directory Formats | ||
Currently QIIME 2 requires that all formats registered to a {term}`Semantic Type` be a directory format. | ||
For those cases, there exists a factory for quickly constructing directory layouts that contain *only a single file*. | ||
This requirement might be removed in the future, but for now it is a necessary evil (and also isn't too much extra work for format developers). | ||
|
||
```python | ||
DNASequencesDirectoryFormat = model.SingleFileDirectoryFormat( | ||
'DNASequencesDirectoryFormat', 'dna-sequences.fasta', DNAFASTAFormat) | ||
``` | ||
|
||
## Associating Formats with a Type | ||
|
||
Formats on their own aren't of much use. | ||
It is only once they are registered with a {term}`Semantic Type` to define an *artifact class* that things become interesting. | ||
Artifact classes define the data that can be provided as input or generated as output by QIIME 2 `Actions`. | ||
An example of this can be seen in the [registration of the `SampleData[PairedEndSequencesWithQuality]` artifact class](https://github.com/qiime2/q2-types/blob/e25f9355958755343977e037bbe39110cfb56a63/q2_types/per_sample_sequences/_type.py#L66). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters