Fix memory consumption increase for anndata objects #363

fhausmann · 2020-05-05T08:42:30Z

Fixes the increased memory consumption when creating multiple anndata objects.
Introducing weak circular references which doesn't interfere with the standard reference count based garbage collector.
Add one additional test for checking this behaviour for future changes.

Fixes #360.

ivirshup · 2020-05-05T11:29:24Z

Thanks for the PR!

One potential issue, what should this do?

from anndata import AnnData
import numpy as np

obsm = AnnData(np.ones((10, 10))).obsm
obsm["a"] = np.zeros((10, 5))

traceback as of bf3e6bc

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-2-82d6b89026db> in <module>
      1 obsm = AnnData(np.ones((10, 10))).obsm
----> 2 obsm["a"] = np.zeros((10, 5))

~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
    150 
    151     def __setitem__(self, key: str, value: V):
--> 152         value = self._validate_value(value, key)
    153         self._data[key] = value
    154 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
    214                 f"value.index does not match parent’s axis {self.axes[0]} names"
    215             )
--> 216         return super()._validate_value(val, key)
    217 
    218 

~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
     49         """Raises an error if value is invalid"""
     50         for i, axis in enumerate(self.axes):
---> 51             if self.parent.shape[axis] != val.shape[i]:
     52                 right_shape = tuple(self.parent.shape[a] for a in self.axes)
     53                 raise ValueError(

AttributeError: 'NoneType' object has no attribute 'shape'

fhausmann · 2020-05-05T12:21:00Z

I think, this is the right/exected behaviour.
For example if you do: y = dict(x=5)['x'] the dictionary is evicted from memory and you cannot access it any more. The same happens in the code above. The anndata object is evicted from memory, the weakref in AlignedMapping gets None and only obsm is kept in memory.
So I think it's the expected behaviour expect that the error message is unclear.

For example the following code works as expected.

adata = AnnData(np.ones((10, 10)))
obsm =adata.obsm
obsm["a"] = np.zeros((10, 5))

However, if you want to keep the current behaviour, I would suggest to store the parent.shape in the AlignedMapping object and update it whenever accessed (and the parent is not None). This would make AlignedMapping self-contained.

What are your thoughts on this ?

ivirshup · 2020-05-05T23:59:50Z

I think it's confusing from an API standpoint, since to users the aligned mapping classes should just be like dicts. This is what stopped me from using weakref.proxys here in the first place.

However, if you want to keep the current behaviour, I would suggest to store the parent.shape in the AlignedMapping object and update it whenever accessed (and the parent is not None).

That could work! You'd also need to keep a reference to the indexes.

As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular. For example, something like:

    @property
    def obsm(self) -> Union[AxisArrays, AxisArraysView]:
        """\
        Multi-dimensional annotation of observations
        (mutable structured :class:`~numpy.ndarray`).

        Stores for each key a two or higher-dimensional :class:`~numpy.ndarray`
        of length `n_obs`.
        Is sliced with `data` and `obs` but behaves otherwise like a :term:`mapping`.
        """
        if self.is_view:
            obsm = AxisArrays(self._adata_ref, 0, self._obsm)
            return obsm._view(self, self.obs_names)
        else:
            return AxisArrays(self, 0, self._obsm)

fhausmann · 2020-05-06T08:14:24Z

That could work! You'd also need to keep a reference to the indexes.

This is already present in self.dim_names, therefore I guess it's fine.

As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular.

Yes this would also work, but has the disadvantage to keep the anndata object in memory if only obsm/varm is accessible, which means you keep the potential memory expensive X array, even if you can't use it. Additionally, you would do these _validate_value each time you access obsm/varm, which could be expensive in terms of running time, but I'm not sure about this.

Therefore I tried the shape thing first. I'm happy to hear your comments!

ivirshup · 2020-05-06T09:16:28Z

but has the disadvantage to keep the anndata object in memory if only obsm/varm is accessible

A lot of the behavior in this packages assumes there will be a live instance of an AnnData object. I don't think these classes make much sense without a live parent object, since they largely exist to maintain copy-on-write semantics.

Additionally, you would do these _validate_value each time you access obsm/varm, which could be expensive in terms of running time, but I'm not sure about this.

I think this could be avoided. You would only need to validate items whenever they're added, and on AnnData instantiation. Otherwise, they should only be modified through the __getitem__ and __setitem__ methods.

fhausmann · 2020-05-06T11:10:05Z

Ok, I can understand that this is a valid design assumption, however a very strong one in my eyes, because it's hard to maintain a living parent for all children without inducing circular references.

Do you think your idea above is enough to have less API / behaviour changes while resolving the issue ?
Then I'm happy to implement your idea, but otherwise I think I don't have enough time to think about and implement different ideas.

ivirshup · 2020-05-06T13:39:26Z

because it's hard to maintain a living parent for all children without inducing circular references.

Ah, but until now it didn't seem like there was any downside to having circular references 😄. Looks like we've just managed to hit an edge case of python's garbage collection (a small amount of large objects)

Do you think your idea above is enough to have less API / behaviour changes while resolving the issue ?

I'm happy for whatever solution. It's very possible there would be problems with my idea that I haven't considered yet. It's just what I had thought of before you opened this PR. And my bias is towards less code that deals with state.

fhausmann · 2020-05-06T14:57:43Z

Looks like we've just managed to hit an edge case of python's garbage collection (a small amount of large objects)

Yes, it's definitely an edge case and a downside of the python garbage collection, which you hit. You lucky one 😄

And my bias is towards less code that deals with state.

Yes this is a downside of the new code. One could make a base class which handles the state, where all these objects inherit from, but still you have to manage them at least once with the current solution.

ivirshup · 2020-05-07T02:33:09Z

anndata/_core/aligned_mapping.py

+    @property
+    def parent_shape(self) -> Tuple[int, int]:
+        return self._parent.shape
+


Could the shape tuple just be an attribute storing a tuple? Previously, I avoided this to reduce redundancy, but I think there's more complexity introduced by all the getters and setters. This would also remove the need for the AnnData.__del__ method.

I was checking to see if this might break inplace subsetting, but it looks like that was already a bit broken:

from anndata.tests.helpers import gen_adata a = gen_adata((200, 100)) o1 = a.obsm a._inplace_subset_obs(slice(50)) o2 = a.obsm assert o1["array"].shape != o2["array"].shape assert o1.parent is o2.parent

I introduced this for exactly this case (the anndata object changes shape), which then should be propagated to the AxisArrays. We can drop these getters, setters and the __del__ method for sure, when propagated at all functions inducing different shapes explicitly. However, I'm not sure, that's so easy to find them all (or it is only these _inplace_subset_* functions) ?

I guess also on master it's expected that o1 is changed ?

or it is only these inplace_subset* functions

The shape of the instance should only change via the _inplace_subset_ methods.

We can drop these getters, setters and the del method for sure, when propagated at all functions inducing different shapes explicitly.

Which properties are you referring to here?

I guess also on master it's expected that o1 is changed ?

I think it would either make sense that it change, or it no longer referred to the parent of the wrong shape.

Don't feel obligated to fix things that were already broken on master in this PR. If it's not intimately related, it's probably better to open an issue and fix it in a separate PR.

Which properties are you referring to here?

@property def parent_shape(self) -> Tuple[int, int]: if self._parent: self._parent_shape = self._parent.shape return self._parent_shape @parent_shape.setter def parent_shape(self, shape: Tuple[int, int]): self._parent_shape = shape

for example. When we can ensure, that the parent does not change shape other than _inplace_subset_, like you said, we can drop these parent_shape property of AxisArrays and the __del__ method of anndata, store a tuple of the parent shape instead and update it in the _inplace_subset_ methods when necessary.

I updated the code to drop the __del__ method and the parent_shape property.
We still need it once in the AlignedMapping class otherwise, a lot of subclasses has to be changed to.
parent_shape does not need to be updated in _inplace_subsets_ because they get newly created there.

Additionally I had a closer look to your solution above.

As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular.

However, then _obsm cannot be AxisArrays, otherwise you still have the circularity and if it's not an AxisArrays it breaks a lot of other functions and would require a lot of code changes to fix it or am I wrong ?

That would be great, thanks. I created a branch here: https://github.com/fhausmann/anndata/tree/memory_fix_lazy_obsm

I think I see what's happening. We try to normalize passed indices to slice, int, array[int], or array[bool] types as soon as possible. The normalized indices are stored in view in the _oidx and _vidx attributes. AxisArray._view is only expecting to see those types as subset indices. This can be fixed by changing how the view is made in your obsm getter to:

return obsm._view(self, self._oidx) # Instead of return obsm._view(self, self.obs_names)

I fixed this, including some other small changes. However there is now another issue.
I had to introduce this https://github.com/fhausmann/anndata/blob/aa3e55b5e6251bc373a8ab94e8e11d7fc6d16dba/anndata/_core/aligned_mapping.py#L234-L237
To update the parent when obsm is modified.

However there still seem to be an issue pointed out by the following test: anndata/tests/test_base.py::test_setting_dim_index[obs]
If you create an anndata object containing obsm=dict('df'=pd.Dataframe) , copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True at https://github.com/theislab/anndata/blob/58886f09b2e387c6389a2de20ed0bc7d20d1b843/anndata/tests/test_base.py#L187

I think it can be fixed with creating a copy when modifying a value in _obsm. However this leads to infinite recursion when trying to create an anndata object from a view inplace.

Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?

Oh, the AxisArray shouldn't have it's own version of the values. That should just be a reference to the values held by the parent. That is, I would think the __init__ should just do something like self._data = parent._obsm.

If you create an anndata object containing obsm=dict('df'=pd.Dataframe), copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True at

Is the code for this like:

curr = orig[:, :].copy() # or curr = orig.copy()[:, :]

Either way, I agree this doesn't look right. But if curr is a view, I'm not sure it should have values for ._obsm. It also looks like the .copy method isn't actually making a copy of the dataframe if this is true, so that would be another thing to look into.

Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?

I wouldn't worry too much about "architecture changes". A lot of the work I've done on this package has been to make changing the architecture easier. It's up to you how you'd like to organize this, but I often find starting a fresh PR/ branch helpful.

Should we close this than ? Or do you think it could be worth considering if the lazy creation doesn't work out ?

- __del__ from anndata because subssetting only occur in 'inplace_subset_' methods where obsm, varm where also updated - Therefore also the parent_shape setter and getter of AxisArrays was now unused

As mentioned in scverse#363 (comment)

keithgmitchell · 2022-05-19T17:52:42Z

Any update on this? Thanks.

codecov · 2023-09-01T13:54:10Z

Codecov Report

Merging #363 (5fff6cf) into main (22f33bb) will decrease coverage by 2.02%.
The diff coverage is 100.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #363      +/-   ##
==========================================
- Coverage   84.90%   82.88%   -2.02%     
==========================================
  Files          36       36              
  Lines        5133     5189      +56     
==========================================
- Hits         4358     4301      -57     
- Misses        775      888     +113

Flag	Coverage Δ
gpu-tests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage Δ
anndata/_core/aligned_mapping.py	`94.73% <100.00%> (+1.07%)`	⬆️
anndata/_core/file_backing.py	`92.47% <100.00%> (+1.00%)`	⬆️

... and 7 files with indirect coverage changes

fhausmann added 2 commits May 5, 2020 10:36

Fix memory consumption increase for anndata objects (scverse#360)

4bb72a9

Add documentation to test, Fix codacy issue

bf3e6bc

Fix parent.shape not available after anndata deletion

2363113

ivirshup reviewed May 7, 2020

View reviewed changes

fhausmann added 2 commits May 7, 2020 09:15

Include scverse#366

d79ee0a

Remove __del__ and parent_shape property

a696ba8

- __del__ from anndata because subssetting only occur in 'inplace_subset_' methods where obsm, varm where also updated - Therefore also the parent_shape setter and getter of AxisArrays was now unused

fhausmann mentioned this pull request May 14, 2020

Inplace subset of anndata object doesn't update AxisArrays #373

Open

ivirshup mentioned this pull request May 21, 2020

Shallow copies of views aren't intuitive #383

Open

Merge branch 'master' into memory_fix

e5bbad1

fhausmann added a commit to fhausmann/anndata that referenced this pull request Jun 30, 2020

Add lazy creation of obsm

ddceb40

As mentioned in scverse#363 (comment)

fhausmann mentioned this pull request Jul 21, 2020

[WIP] Make obsm/varm creation lazy #402

Closed

Merge branch 'master' into memory_fix

582b790

Merge branch 'main' into memory_fix

5fff6cf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory consumption increase for anndata objects #363

Fix memory consumption increase for anndata objects #363

fhausmann commented May 5, 2020

ivirshup commented May 5, 2020

fhausmann commented May 5, 2020 •

edited

Loading

ivirshup commented May 5, 2020

fhausmann commented May 6, 2020

ivirshup commented May 6, 2020

fhausmann commented May 6, 2020

ivirshup commented May 6, 2020

fhausmann commented May 6, 2020

ivirshup May 7, 2020

fhausmann May 7, 2020 •

edited

Loading

ivirshup May 13, 2020

fhausmann May 13, 2020

fhausmann May 14, 2020

fhausmann Jun 30, 2020 •

edited

Loading

ivirshup Jul 2, 2020

fhausmann Jul 10, 2020

ivirshup Jul 13, 2020

fhausmann Jul 21, 2020

keithgmitchell commented May 19, 2022

codecov bot commented Sep 1, 2023

Fix memory consumption increase for anndata objects #363

Are you sure you want to change the base?

Fix memory consumption increase for anndata objects #363

Conversation

fhausmann commented May 5, 2020

ivirshup commented May 5, 2020

fhausmann commented May 5, 2020 • edited Loading

ivirshup commented May 5, 2020

fhausmann commented May 6, 2020

ivirshup commented May 6, 2020

fhausmann commented May 6, 2020

ivirshup commented May 6, 2020

fhausmann commented May 6, 2020

Choose a reason for hiding this comment

fhausmann May 7, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fhausmann Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keithgmitchell commented May 19, 2022

codecov bot commented Sep 1, 2023

Codecov Report

fhausmann commented May 5, 2020 •

edited

Loading

fhausmann May 7, 2020 •

edited

Loading

fhausmann Jun 30, 2020 •

edited

Loading