Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datetimes #684

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Datetimes #684

wants to merge 2 commits into from

Conversation

ivirshup
Copy link
Member

Basic datetime IO support.

Currently this converts everything to numpy datetime arrays at write time. I'm not preserving pandas array types since there are multiple seemingly overlapping ways to deal with datetimes in pandas. This implementation also does not support time zones but that would be easy to add.

It would be good to get someone working with time series data to try this out and see if it meets their needs.

(I thought this would solve #455, but now see that was for datetime scalars which this does not currently support)

Currently converts everything to numpy datetimes when writing
@codecov
Copy link

codecov bot commented Jan 14, 2022

Codecov Report

Merging #684 (7907db0) into master (a5727a5) will increase coverage by 0.04%.
The diff coverage is 93.93%.

@@            Coverage Diff             @@
##           master     #684      +/-   ##
==========================================
+ Coverage   83.12%   83.16%   +0.04%     
==========================================
  Files          34       34              
  Lines        4396     4419      +23     
==========================================
+ Hits         3654     3675      +21     
- Misses        742      744       +2     
Impacted Files Coverage Δ
anndata/_io/specs/methods.py 84.14% <91.30%> (+0.44%) ⬆️
anndata/_io/specs/registry.py 91.48% <100.00%> (ø)

@Zethson
Copy link
Member

Zethson commented Jan 14, 2022

It would be good to get someone working with time series data to try this out and see if it meets their needs.

we do technically. If we detect datetime we always just copy it to obs directly.

@ivirshup
Copy link
Member Author

An example (or you giving this branch a shot) would be great.

Do you have a way of saving these AnnData's at the moment?

@Zethson
Copy link
Member

Zethson commented Jan 14, 2022

@Imipenem can you help here?

@Imipenem
Copy link

Do you have a way of saving these AnnData's at the moment?

At ehrapy, it just worked out of the box when writing these AnnDatas to .h5ad files. But this might be due to the fact, that we save columns with datetime values in obs only (and pandas treats these datetimes kind of different in comparison to numpy from what I've read), neither in uns or X. So we do not have any np.datetime values stored in the AnnData object at any time, which (IMO) fits our needs here (for now). So this would not affect us currently or do I miss something @Zethson?

@Zethson
Copy link
Member

Zethson commented Jan 14, 2022

Thought so as well. We didn't run into any issues.

@ivirshup
Copy link
Member Author

ivirshup commented Jan 17, 2022

I'm a little confused here. If I put any sorts of dates into obs, that anndata will fail to write to h5ad in 0.7.8.

Can you make an example of this? For me:

Failing example
import anndata as ad, pandas as pd, numpy as np
from vega_datasets import data
print(ad.__version__)
0.7.8
cars = data.cars()

dt_array = cars["Year"]
np_dt_array = dt_array.to_numpy()

N = np_dt_array.shape[0]
adata = ad.AnnData(X=np.ones((N, N)), obs=pd.DataFrame({"dt": dt_array}))

adata.write_h5ad("test_dt.h5ad")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_array(f, key, value, dataset_kwargs)
    184         value = _to_hdf5_vlen_strings(value)
--> 185     f.create_dataset(key, data=value, **dataset_kwargs)
    186 

/usr/local/lib/python3.9/site-packages/h5py/_hl/group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    148 
--> 149             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    150             dset = dataset.Dataset(dsid)

/usr/local/lib/python3.9/site-packages/h5py/_hl/dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
     90             dtype = numpy.dtype(dtype)
---> 91         tid = h5t.py_create(dtype, logical=1)
     92 

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

h5py/h5t.pyx in h5py.h5t.py_create()

TypeError: No conversion path for dtype: dtype('<M8[ns]')

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_series(group, key, series, dataset_kwargs)
    288     else:
--> 289         write_array(group, key, series.values, dataset_kwargs=dataset_kwargs)
    290 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"

TypeError: No conversion path for dtype: dtype('<M8[ns]')

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    208         try:
--> 209             return func(elem, key, val, *args, **kwargs)
    210         except Exception as e:

~/github/anndata/anndata/_io/h5ad.py in write_dataframe(f, key, df, dataset_kwargs)
    262     for col_name, (_, series) in zip(col_names, df.items()):
--> 263         write_series(group, col_name, series, dataset_kwargs=dataset_kwargs)
    264 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"

TypeError: No conversion path for dtype: dtype('<M8[ns]')

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
/var/folders/bd/43q20k0n6z15tdfzxvd22r7c0000gn/T/ipykernel_4792/2332825967.py in <module>
----> 1 adata.write_h5ad("test_dt.h5ad")

~/github/anndata/anndata/_core/anndata.py in write_h5ad(self, filename, compression, compression_opts, force_dense, as_dense)
   1910             filename = self.filename
   1911 
-> 1912         _write_h5ad(
   1913             Path(filename),
   1914             self,

~/github/anndata/anndata/_io/h5ad.py in write_h5ad(filepath, adata, force_dense, as_dense, dataset_kwargs, **kwargs)
    109         else:
    110             write_attribute(f, "raw", adata.raw, dataset_kwargs=dataset_kwargs)
--> 111         write_attribute(f, "obs", adata.obs, dataset_kwargs=dataset_kwargs)
    112         write_attribute(f, "var", adata.var, dataset_kwargs=dataset_kwargs)
    113         write_attribute(f, "obsm", adata.obsm, dataset_kwargs=dataset_kwargs)

/usr/local/Cellar/[email protected]/3.9.9/Frameworks/Python.framework/Versions/3.9/lib/python3.9/functools.py in wrapper(*args, **kw)
    875                             '1 positional argument')
    876 
--> 877         return dispatch(args[0].__class__)(*args, **kw)
    878 
    879     funcname = getattr(func, '__name__', 'singledispatch function')

~/github/anndata/anndata/_io/h5ad.py in write_attribute_h5ad(f, key, value, *args, **kwargs)
    128     if key in f:
    129         del f[key]
--> 130     _write_method(type(value))(f, key, value, *args, **kwargs)
    131 
    132 

~/github/anndata/anndata/_io/utils.py in func_wrapper(elem, key, val, *args, **kwargs)
    210         except Exception as e:
    211             parent = _get_parent(elem)
--> 212             raise type(e)(
    213                 f"{e}\n\n"
    214                 f"Above error raised while writing key {key!r} of {type(elem)}"

TypeError: No conversion path for dtype: dtype('<M8[ns]')

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'dt' of <class 'h5py._hl.group.Group'> from /.

Above error raised while writing key 'obs' of <class 'h5py._hl.files.File'> from /.

@Zethson
Copy link
Member

Zethson commented Jan 17, 2022

Sure:

import ehrapy.api as ep

adatas = ep.dt.mimic_3_demo(encoded=False, mudata=False)
print(adatas["INPUTEVENTS_CV"].obs)
adata = adatas["INPUTEVENTS_CV"]
# This may take 5-20 minutes
ep.pp.knn_impute(adata)
adata_encoded = ep.pp.encode(adata, autodetect=True)
ep.io.write("test.h5ad", adata_encoded)

I would not be surprised if we store things differently than you somewhere, but feel free to play around with it. I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals. They are not real datetimes. Feedback is always appreciated!

@ivirshup
Copy link
Member Author

I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals.

That seems to be the case.

adata_encoded.obs["charttime"].cat.categories.dtype
dtype('O')

Would it be useful if these were actual datetimes? The you could do things like ask how far apart the times were.

@Zethson
Copy link
Member

Zethson commented Jan 18, 2022

I have the suspicion that the datetimes are somewhere just read as strings and then mapped to categoricals.

That seems to be the case.

adata_encoded.obs["charttime"].cat.categories.dtype
dtype('O')

Would it be useful if these were actual datetimes? The you could do things like ask how far apart the times were.

Not surprised. Our primary motivation was the coloring of plots and things like that.

Yeah, your suggested use-case is a good one. Although, in general I am trying to reduce the dependency on real time as much as possible with ehrapy and to work more with pseudotime :)

@Zethson
Copy link
Member

Zethson commented Aug 21, 2023

@ivirshup is this PR still one approach that you'd follow or did it change since Pandas 2.0 got released? Datetime support would still be great for ehrapy - especially for stuff like comparing them and more

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants