Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

best practice for storing and versioning large X and obs/obsm #965

Open
djarecka opened this issue Mar 31, 2023 · 4 comments
Open

best practice for storing and versioning large X and obs/obsm #965

djarecka opened this issue Mar 31, 2023 · 4 comments

Comments

@djarecka
Copy link

I'm wondering if anyone was thinking about best practice when using and versioning anndata files in a situation that:

  • X is big (e.g. 200GB or more) and doesn't change over time,
  • obs and obsm are much smaller but they do change (scientists changing the classifications, adding columns, etc.).

It seems to me that it is not the best idea to create a new 200GB file every single time when obs is modified. Right now we are thinking about creating two separate files, one that has only X, and the other that has an empty X and current version of obs and obsm. But perhaps there are already better ways of dealing with this issue?

@grst
Copy link
Contributor

grst commented Apr 3, 2023

Maybe a use-case for lamin db. @Zethson may be able to tell you more.

@Zethson
Copy link
Member

Zethson commented Apr 3, 2023

@grst thank you for the ping.

When reading this I actually first thought of the partial reading/writing capabilities of slots that we're working on?

@djarecka
Copy link
Author

djarecka commented Apr 5, 2023

@Zethson - i don't know the details of the work you're doing on the partial reading/writing capabilities, but our problems is also even the storage/versioning of the file. Every time obs or obsm is updated (and it is quite often) we have to create a new version of the entire file even if the biggest part of the file, X, is unchanged.

@ivirshup
Copy link
Member

ivirshup commented Apr 17, 2023

Hey @djarecka.

At the moment I'm pursuing this upstream. I would like to have anndata objects defined by "manifests", similar to the idea proposed here:

There's a lot of complexity around the versioning side of things, especially being able to tell whether anything has changed, unless the data is managed using something like dask.

In the nearer term I would like to make it more possible for users to handle this manually with a merge function, so you can handle your "deltas" manually.

In your case, if you are only updating entries in obsm, obs, I would suggest potentially saving those seperatley (using the read_elem, write_elem functions) then you could do something like:

adata = ad.read_h5ad("whole_v1.h5ad")

with h5py.File("obs_v2.h5") as f:
    adata.obs = read_elem(f["/"])

with h5py.File("obsm_v2.h5") as f:
    adata.obsm = read_elem(f["/"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants