Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specs for all elements #554

Merged
merged 44 commits into from
Dec 25, 2021
Merged

Specs for all elements #554

merged 44 commits into from
Dec 25, 2021

Conversation

ivirshup
Copy link
Member

@ivirshup ivirshup commented Apr 15, 2021

This PR adds specs for each element of an anndata object. That is, on disk each element will be have attributes "encoding-type" and "encoding-version". For more info see this issue #555

TODO:

  • Remove old code
  • Handling of dynamic types
    • For elements like lists, we convert them to arrays. But we handle arrays differently based on the type of their values. All Most of this system is build around element-type -> on-disk-format, but these types break this a little. Currently handling this by not setting encoding-type/ encoding-version if they're already set, so we can set them for the converted value.
    • Not 100% sure what I meant by this. Writing lists works, but they are converted to arrays. Should they come back out as lists, or is that getting too python specific?
  • AnnData level element
  • Zarr backend
    • Zarr backend for new implementation
    • Zarr backwards compatibility backend
  • Warnings for ancient anndata objects
  • More tests
    • [ ] Tests for SparseDataset indexing (sometimes it wasn't returning a csr_matrix out of scope
    • Test encoding metadata for all types
  • Example cases for current formats
  • Docs
  • [ ] Don't allow unlabelled elements in more recently written objects
    • This means, all backwards compat is turned off if we know the file was written using a more current spec
    • Future task, figure out what to do with unrecognized elements.

Maybe this PR, maybe not:

I have decided: not

- [ ] Consolidate "modified" reading methods
- [ ] Figure out what the scope is for an individual resolver
- [ ] Partial IO? (probably not until during the 0.8.x release series)

This might also be the right time to cut a 0.7 maintenance branch, and make master 0.8 specific.

@ivirshup ivirshup added this to the 0.8 milestone Apr 15, 2021
@codecov
Copy link

codecov bot commented Apr 15, 2021

Codecov Report

Merging #554 (ee1e3df) into master (9520e59) will decrease coverage by 0.33%.
The diff coverage is 86.71%.

@@            Coverage Diff             @@
##           master     #554      +/-   ##
==========================================
- Coverage   83.41%   83.07%   -0.34%     
==========================================
  Files          31       34       +3     
  Lines        4148     4349     +201     
==========================================
+ Hits         3460     3613     +153     
- Misses        688      736      +48     
Impacted Files Coverage Δ
anndata/_io/__init__.py 100.00% <ø> (ø)
anndata/compat/__init__.py 85.14% <77.77%> (-1.11%) ⬇️
anndata/_io/specs/methods.py 83.28% <83.28%> (ø)
anndata/_io/h5ad.py 92.45% <90.62%> (-2.57%) ⬇️
anndata/_io/specs/registry.py 91.48% <91.48%> (ø)
anndata/_io/zarr.py 88.60% <95.83%> (-5.24%) ⬇️
anndata/__init__.py 100.00% <100.00%> (ø)
anndata/_core/sparse_dataset.py 90.19% <100.00%> (-0.36%) ⬇️
anndata/_io/specs/__init__.py 100.00% <100.00%> (ø)
anndata/_io/utils.py 72.89% <100.00%> (+2.10%) ⬆️
... and 7 more

@ivirshup
Copy link
Member Author

We now have warnings for when an element is found that doesn't have a spec. These are not user visible, but we should silence them further for the tests.

@ivirshup
Copy link
Member Author

ivirshup commented Nov 23, 2021

Working on zarr implementation. Seems mostly good, but running into the json only being able to represent a limited set of values.

Current problem: bool. Maybe I just encode these attributes as 0 and 1 for now?

Forgot about the json stdlib module not liking any numpy types.

@ivirshup
Copy link
Member Author

Huh, pretty sure I meant to call c0112fc something different. Like "start zarr support".

Still needs some work. Like most backwards compat and figuring out what I can delete.

* convert np.bool_ to bool so json module doesn't error
* actually label `str` elements as strings
* read strings back as `str` not `np.str_`
@giovp giovp mentioned this pull request Nov 29, 2021
17 tasks
@ivirshup ivirshup mentioned this pull request Dec 10, 2021
8 tasks
@flying-sheep
Copy link
Member

I think with #662, there’s no real way around encoding-type: "array", "element-type": "string", right?

@ivirshup
Copy link
Member Author

Long term yes, but I think this can go through without it. I think designing that system is worth a bit more care, and we may be able to base that design (or at least get help with it) from a "single cell data interchange" schema (will hopefully hear more on that in January).

The way around this is for now is basically just long encoding names. So "string-array", "datetime-array", "nullable-datetime-array", etc.

@ivirshup ivirshup marked this pull request as ready for review December 20, 2021 14:36
@ivirshup ivirshup enabled auto-merge (squash) December 25, 2021 17:08
@ivirshup ivirshup merged commit 664e32b into scverse:master Dec 25, 2021
@ivirshup ivirshup deleted the specs-basic branch December 25, 2021 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants