best practice for storing and versioning large X and obs/obsm #965

djarecka · 2023-03-31T13:59:39Z

I'm wondering if anyone was thinking about best practice when using and versioning anndata files in a situation that:

X is big (e.g. 200GB or more) and doesn't change over time,
obs and obsm are much smaller but they do change (scientists changing the classifications, adding columns, etc.).

It seems to me that it is not the best idea to create a new 200GB file every single time when obs is modified. Right now we are thinking about creating two separate files, one that has only X, and the other that has an empty X and current version of obs and obsm. But perhaps there are already better ways of dealing with this issue?

The text was updated successfully, but these errors were encountered:

grst · 2023-04-03T06:39:05Z

Maybe a use-case for lamin db. @Zethson may be able to tell you more.

Zethson · 2023-04-03T10:24:30Z

@grst thank you for the ping.

When reading this I actually first thought of the partial reading/writing capabilities of slots that we're working on?

djarecka · 2023-04-05T20:44:01Z

@Zethson - i don't know the details of the work you're doing on the partial reading/writing capabilities, but our problems is also even the storage/versioning of the file. Every time obs or obsm is updated (and it is quite often) we have to create a new version of the entire file even if the biggest part of the file, X, is unchanged.

ivirshup · 2023-04-17T14:10:56Z

Hey @djarecka.

At the moment I'm pursuing this upstream. I would like to have anndata objects defined by "manifests", similar to the idea proposed here:

Beyond consolidated metadata for V3: inspiration from Apache Iceberg zarr-developers/zarr-specs#154

There's a lot of complexity around the versioning side of things, especially being able to tell whether anything has changed, unless the data is managed using something like dask.

In the nearer term I would like to make it more possible for users to handle this manually with a merge function, so you can handle your "deltas" manually.

ad.merge function #658

In your case, if you are only updating entries in obsm, obs, I would suggest potentially saving those seperatley (using the read_elem, write_elem functions) then you could do something like:

adata = ad.read_h5ad("whole_v1.h5ad")

with h5py.File("obs_v2.h5") as f:
    adata.obs = read_elem(f["/"])

with h5py.File("obsm_v2.h5") as f:
    adata.obsm = read_elem(f["/"])

ivirshup added enhancement question topic: backed topic: io and removed topic: backed labels Apr 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

best practice for storing and versioning large X and obs/obsm #965

best practice for storing and versioning large X and obs/obsm #965

djarecka commented Mar 31, 2023

grst commented Apr 3, 2023

Zethson commented Apr 3, 2023

djarecka commented Apr 5, 2023

ivirshup commented Apr 17, 2023 •

edited

Loading

best practice for storing and versioning large X and obs/obsm #965

best practice for storing and versioning large X and obs/obsm #965

Comments

djarecka commented Mar 31, 2023

grst commented Apr 3, 2023

Zethson commented Apr 3, 2023

djarecka commented Apr 5, 2023

ivirshup commented Apr 17, 2023 • edited Loading

ivirshup commented Apr 17, 2023 •

edited

Loading