Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing to H5AD fails if AnnData contains datetime64 objects #455

Open
lazappi opened this issue Nov 16, 2020 · 5 comments
Open

Writing to H5AD fails if AnnData contains datetime64 objects #455

lazappi opened this issue Nov 16, 2020 · 5 comments

Comments

@lazappi
Copy link

lazappi commented Nov 16, 2020

Writing an AnnData object to disk fails if it contains numpy datetime64 objects. Example:

import anndata
import numpy as np

ad = anndata.AnnData()

ad.uns['now'] = np.datetime64('now')

ad.write("test_h5py.h5ad")
ERROR MESSAGE
Traceback (most recent call last):
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/utils.py", line 188, in func_wrapper
    return func(elem, key, val, *args, **kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 144, in write_not_implemented
    raise NotImplementedError(
NotImplementedError: Failed to write value for uns/now, since a writer for type <class 'numpy.datetime64'> has not been implemented yet.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/luke.zappia/Desktop/test_h5py.py", line 8, in <module>
    ad.write("test_h5py.h5ad")
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_core/anndata.py", line 1846, in write_h5ad
    _write_h5ad(
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 112, in write_h5ad
    write_attribute(f, "uns", adata.uns, dataset_kwargs=dataset_kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/functools.py", line 877, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 126, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 286, in write_mapping
    write_attribute(f, f"{key}/{sub_key}", sub_value, dataset_kwargs=dataset_kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/functools.py", line 877, in wrapper
    return dispatch(args[0].__class__)(*args, **kw)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/h5ad.py", line 126, in write_attribute_h5ad
    _write_method(type(value))(f, key, value, *args, **kwargs)
  File "/Users/luke.zappia/miniconda3/envs/test-h5py/lib/python3.9/site-packages/anndata/_io/utils.py", line 191, in func_wrapper
    raise type(e)(
NotImplementedError: Failed to write value for uns/now, since a writer for type <class 'numpy.datetime64'> has not been implemented yet.

Above error raised while writing key 'uns/now' of <class 'h5py._hl.files.File'> from /.

This behaviour is from h5py rather than anndata. From v3.0.0 they suggest storing these values using opaque dtypes https://docs.h5py.org/en/stable/special.html#opaque-dtypes.

Not sure how big a problem this is but it came up here theislab/zellkonverter#24 so I just wanted to raise it. Would be nice if anndata could handle this (at least with a nicer error) but maybe not a priority.

@ivirshup
Copy link
Member

We are aiming to support more types, as well as an interface for registering extensions, for reading and writing. Date times seem like a good case for first party support, depending on how complicated this gets. I'm worried this is going to be complicated.

Some points on this:

  • pandas uses different a different datetime type than numpy, to accommodate for timezones.
  • Could we consolidate to a single type?
  • The arrow representation of times could be a useful reference

@lazappi
Copy link
Author

lazappi commented Nov 17, 2020

Yeah, I agree it could get complicated so up to maintainers whether it is worth the effort. The datetime64 type came up as this is what reticulate converts R datetime to, but possibly not many use cases beyond that.

@flying-sheep
Copy link
Member

datetime64 basically means “UTC time”, so if we use it, we should make sure to convert e.g. datetime objects with a set timezone to UTC.

Timezones are hella complicated and change all the time, so what we shouldn’t do is convert them to offsets and store them that way I think …

@ivirshup
Copy link
Member

@flying-sheep, I think we would need to store them in a way that pd.Timestamp object read out exactly equal to the way they are written. We can't try to normalizes timezone due to problems like "mean expression of samples collected on tuesday" giving different results after you read the data from disk.


Looking at the arrow implementation, this might actually be straightforward. A date time column would be stored as a dataset of 64 bit integers. The attributes include the time unit (typically nanoseconds) and optionally a timezone. All times in a column must share a timezone (pandas represents these as object arrays anyways). Numpy date times are the same except there is no timezone.

Draft functions (these will need decorators for dispatch, additional metadata etc.):

import numpy as np
import pandas as pd
import h5py

def write_datetime(f, k, v, *, dataset_kwargs={}):
    d = f.create_dataset(k, data=v.view(np.int64), **dataset_kwargs)
    d.attrs["unit"] = np.datetime_data(v)[0]

def read_datetime(d: h5py.Dataset) -> np.ndarray:
    unit = d.attrs["unit"]
    return d[...].view(f"datetime[{unit}]")

def write_timestamp(f, k, v, *, dataset_kwargs={}):
    write_datetime(f, k, v, dataset_kwargs=dataset_kwargs)
    if v.tz is not None:
        d.attrs["tz"] = v.tz

def read_timestamp(d: h5py.Dataset) -> pd.arrays.DatetimeArray:
    arr = read_datetime(d)
    if "tz" in d.attrs:
        tz = d.attrs["tz"]
        unit = d.attrs["unit"]
        dtype = pd.DatetimeTZDtype(unit=unit, tz=tz)
    else:
        dtype = arr.dtype
    return pd.arrays.DatetimeArray(arr, dtype=dtype)

This should be similar for zarr, except that zarr natively supports the numpy datetime dtype, so no need for read_datetime or write_datetime.

@ivirshup
Copy link
Member

So, it turns out datetimes are super complicated and contentious. To some extent we are beholden to whatever pandas does, but it looks like what pandas does is under active development (pandas-dev/pandas#40932).

Some notes on the state of time types in the our dependencies:

  • Currently, pandas only really supports nanosecond resolution. This is one of the main things they want to change.
  • Pandas also has "date ranges" with frequencies that are encoded to the dtype.
  • Numpy supports many time resolutions
  • Numpy also supports time resolutions with steps (e.g. the unit for the dtype is 5 seconds), unclear how common this is. There is no documentation for this in the "Datetimes and Timedeltas" doc page

Other libraries:

@ivirshup ivirshup mentioned this issue Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants