Skip to content

Roadmap Move Away From HDF5

Ben Murray edited this page Jun 22, 2021 · 1 revision

Roadmap: Move away from HDF5

Issues with HDF5

Deleting data

When HDF5 datasets (the HDF5 equivalent of a numpy array) is deleted, the series is marked as having been deleted but the space is not reclaimed. This can only be done by running the h5repack command, which takes many minutes to run on datasets the size of the Covid Symptom Study. It is very easy for a user to thus consume their entire drive by creating and deleting temporary fields. As a result, the user has to adopt unnecessarily complex workflows that categorise datasets as 'source', 'temporary', and 'destination', which adds to the cognative load of using ExeTera.

Fragility

As any write-based interaction with a HDF5 dataset is capable of leaving a dataset in an invalid and irretrievable state simply by interrupting the execution of one of the HDF5 commands, we have to protect all HDF5 write commands in such a way that they cannot be interrupted. We currently use threads for this purpose. This is another reason why we suggest to users that once a dataset is generated, they treat it as a read-only 'source' dataset.

Read-based concurrency

TODO: qualify this statement with HDF5 documentation. Read based concurrency appears to be unnecessarily limited in HDF5 files.

h5py iterator Performance

Whilst iteration will always be slower than performing numpy-style operations on fields, h5py adds several further orders of magnitude of overhead to iterative access. This can certainly be improved upcoming

Replacement datastore format

This is the datastore intended to replace the usage of the hdf5 datastore.

  • Data is stored in numpy npy/npz files.
  • Encodings are stored with data.
  • Metadata is stored in a json file at the top-level folder

DataStore metadata schema

  • options
    • centralised json document
      • store all metadata in the same document. This requires that any writes to a field are updated in the central metadata document before being considered complete
    • decentralised json fragments
      • store each metadata item with each field. This requires that all metadata is gathered from the datastore directories as part of loading

DataStore field to directory mapping

  • options
    • direct mapping with folders
      • folders represent tables at the top level
    • indirect mapping with folders
      • json metadata schema includes mapping from field to logical location; fields stored to optimise file system usage

DataStore concurrency

In general, the DataStore is not designed to be written to (or read from) by multiple users. There is no system being proposed to allow multiple users to treat it like a concurrent access database, although some provision is made for the notion of coordinated reads / writes from multiple threads.

  • options
    • file-based locking:
      • use a file-based locking mechanism to "lock" datasets for writing. This approach may have issues as file locks are os-specific, although some libraries, such as https://pypi.org/project/filelock/ exist to facilitate this, there is no 'standard' library in the ecosystem for it
    • cross-process synchronisation:
      • use a library based on OS-synchronisation primitives
        • asyncio
        • multiprocessing
        • dask.distributed (uses asyncio)

Initial design thinking

Concurrency: initially don't sync on concurrency; assume one process is reading from / writing to data.

Logical <-> directory mapping: this can be implemented immediately, or it can be left for the initial implementation. The primary concern should be ease of upgrade path. The absence of a logical mapping tag can be taken as meaning direct mapping, or an identity logical mapping tag can be provided. The latter is suggested as the best approach.

Metadata schema: a centralised schema can be tried in the initial instance. Decentralising the schema subsequently should be a simple operation if it is required

The primary consideration is how to keep serialization up to date. Changes to group contents should always be reflected in the data store schema file, whether the schema file elements are scattered or centralised. For scattered elements this is relatively easy, as it can be written to the appropriate location at the appropriate point time (i.e. when a field is created). When centralised, it is more complicated, as the schema serialisation must either be constantly updated and saved, or it must be constantly updated but saved before leaving exception handling, or it must be created and serialised before leaving exception handling. All of these things can be done, but require use of the signal module to do properly.