Writing to H5AD fails if AnnData contains datetime64 objects #455

lazappi · 2020-11-16T14:32:08Z

Writing an AnnData object to disk fails if it contains numpy datetime64 objects. Example:

import anndata
import numpy as np

ad = anndata.AnnData()

ad.uns['now'] = np.datetime64('now')

ad.write("test_h5py.h5ad")

ERROR MESSAGE

Traceback (most recent call last):
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/utils.py", line 188, in func_wrapper
    return func(elem, key, val, *args, **kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 144, in write_not_implemented
    raise NotImplementedError(
NotImplementedError: Failed to write value for uns/now, since a writer for type <class 'numpy.datetime64'> has not been implemented yet.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/luke.zappia/Desktop/test_h5py.py", line 8, in <module>
    ad.write("test_h5py.h5ad")
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1846, in write_h5ad
    _write_h5ad(
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 112, in write_h5ad
    write_attribute(f, "uns", adata.uns, dataset_kwargs=dataset_kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/functools.py", line 877, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 126, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 286, in write_mapping
    write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs=dataset_kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/functools.py", line 877, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 126, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/utils.py", line 191, in func_wrapper
    raise type(e)(
NotImplementedError: Failed to write value for uns/now, since a writer for type <class 'numpy.datetime64'> has not been implemented yet.

Above error raised while writing key 'uns/now' of <class 'h5py._hl.files.File'> from /.

This behaviour is from h5py rather than anndata. From v3.0.0 they suggest storing these values using opaque dtypes https://docs.h5py.org/en/stable/special.html#opaque-dtypes.

Not sure how big a problem this is but it came up here theislab/zellkonverter#24 so I just wanted to raise it. Would be nice if anndata could handle this (at least with a nicer error) but maybe not a priority.

The text was updated successfully, but these errors were encountered:

ivirshup · 2020-11-17T09:01:41Z

We are aiming to support more types, as well as an interface for registering extensions, for reading and writing. Date times seem like a good case for first party support, depending on how complicated this gets. I'm worried this is going to be complicated.

Some points on this:

pandas uses different a different datetime type than numpy, to accommodate for timezones.
Could we consolidate to a single type?
The arrow representation of times could be a useful reference
- arrow timestamps in the context of python

lazappi · 2020-11-17T09:25:06Z

Yeah, I agree it could get complicated so up to maintainers whether it is worth the effort. The datetime64 type came up as this is what reticulate converts R datetime to, but possibly not many use cases beyond that.

flying-sheep · 2020-11-17T09:53:42Z

datetime64 basically means “UTC time”, so if we use it, we should make sure to convert e.g. datetime objects with a set timezone to UTC.

Timezones are hella complicated and change all the time, so what we shouldn’t do is convert them to offsets and store them that way I think …

ivirshup · 2020-11-19T07:31:26Z

@flying-sheep, I think we would need to store them in a way that pd.Timestamp object read out exactly equal to the way they are written. We can't try to normalizes timezone due to problems like "mean expression of samples collected on tuesday" giving different results after you read the data from disk.

Looking at the arrow implementation, this might actually be straightforward. A date time column would be stored as a dataset of 64 bit integers. The attributes include the time unit (typically nanoseconds) and optionally a timezone. All times in a column must share a timezone (pandas represents these as object arrays anyways). Numpy date times are the same except there is no timezone.

Draft functions (these will need decorators for dispatch, additional metadata etc.):

import numpy as np
import pandas as pd
import h5py

def write_datetime(f, k, v, *, dataset_kwargs={}):
    d = f.create_dataset(k, data=v.view(np.int64), **dataset_kwargs)
    d.attrs["unit"] = np.datetime_data(v)[0]

def read_datetime(d: h5py.Dataset) -> np.ndarray:
    unit = d.attrs["unit"]
    return d[...].view(f"datetime[{unit}]")

def write_timestamp(f, k, v, *, dataset_kwargs={}):
    write_datetime(f, k, v, dataset_kwargs=dataset_kwargs)
    if v.tz is not None:
        d.attrs["tz"] = v.tz

def read_timestamp(d: h5py.Dataset) -> pd.arrays.DatetimeArray:
    arr = read_datetime(d)
    if "tz" in d.attrs:
        tz = d.attrs["tz"]
        unit = d.attrs["unit"]
        dtype = pd.DatetimeTZDtype(unit=unit, tz=tz)
    else:
        dtype = arr.dtype
    return pd.arrays.DatetimeArray(arr, dtype=dtype)

This should be similar for zarr, except that zarr natively supports the numpy datetime dtype, so no need for read_datetime or write_datetime.

ivirshup · 2021-12-21T17:40:26Z

So, it turns out datetimes are super complicated and contentious. To some extent we are beholden to whatever pandas does, but it looks like what pandas does is under active development (pandas-dev/pandas#40932).

Some notes on the state of time types in the our dependencies:

Currently, pandas only really supports nanosecond resolution. This is one of the main things they want to change.
Pandas also has "date ranges" with frequencies that are encoded to the dtype.
Numpy supports many time resolutions
Numpy also supports time resolutions with steps (e.g. the unit for the dtype is 5 seconds), unclear how common this is. There is no documentation for this in the "Datetimes and Timedeltas" doc page

Other libraries:

apache arrow, has a limited set of units, no steps.

ivirshup added the enhancement label Nov 17, 2020

ivirshup mentioned this issue Dec 10, 2021

Meta issue: new data types #662

Open

8 tasks

ivirshup mentioned this issue Jan 14, 2022

Datetimes #684

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Writing to H5AD fails if AnnData contains datetime64 objects #455

Writing to H5AD fails if AnnData contains datetime64 objects #455

lazappi commented Nov 16, 2020 •

edited by ivirshup

Loading

ivirshup commented Nov 17, 2020

lazappi commented Nov 17, 2020

flying-sheep commented Nov 17, 2020

ivirshup commented Nov 19, 2020

ivirshup commented Dec 21, 2021

Writing to H5AD fails if AnnData contains datetime64 objects #455

Writing to H5AD fails if AnnData contains datetime64 objects #455

Comments

lazappi commented Nov 16, 2020 • edited by ivirshup Loading

ivirshup commented Nov 17, 2020

lazappi commented Nov 17, 2020

flying-sheep commented Nov 17, 2020

ivirshup commented Nov 19, 2020

ivirshup commented Dec 21, 2021

lazappi commented Nov 16, 2020 •

edited by ivirshup

Loading