-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory consumption increase for anndata objects #363
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR! One potential issue, what should this do? from anndata import AnnData
import numpy as np
obsm = AnnData(np.ones((10, 10))).obsm
obsm["a"] = np.zeros((10, 5)) traceback as of bf3e6bc---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-2-82d6b89026db> in <module>
1 obsm = AnnData(np.ones((10, 10))).obsm
----> 2 obsm["a"] = np.zeros((10, 5))
~/github/anndata/anndata/_core/aligned_mapping.py in __setitem__(self, key, value)
150
151 def __setitem__(self, key: str, value: V):
--> 152 value = self._validate_value(value, key)
153 self._data[key] = value
154
~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
214 f"value.index does not match parent’s axis {self.axes[0]} names"
215 )
--> 216 return super()._validate_value(val, key)
217
218
~/github/anndata/anndata/_core/aligned_mapping.py in _validate_value(self, val, key)
49 """Raises an error if value is invalid"""
50 for i, axis in enumerate(self.axes):
---> 51 if self.parent.shape[axis] != val.shape[i]:
52 right_shape = tuple(self.parent.shape[a] for a in self.axes)
53 raise ValueError(
AttributeError: 'NoneType' object has no attribute 'shape' |
I think, this is the right/exected behaviour. For example the following code works as expected.
However, if you want to keep the current behaviour, I would suggest to store the parent.shape in the AlignedMapping object and update it whenever accessed (and the parent is not None). This would make AlignedMapping self-contained. What are your thoughts on this ? |
I think it's confusing from an API standpoint, since to users the aligned mapping classes should just be like
That could work! You'd also need to keep a reference to the indexes. As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular. For example, something like: @property
def obsm(self) -> Union[AxisArrays, AxisArraysView]:
"""\
Multi-dimensional annotation of observations
(mutable structured :class:`~numpy.ndarray`).
Stores for each key a two or higher-dimensional :class:`~numpy.ndarray`
of length `n_obs`.
Is sliced with `data` and `obs` but behaves otherwise like a :term:`mapping`.
"""
if self.is_view:
obsm = AxisArrays(self._adata_ref, 0, self._obsm)
return obsm._view(self, self.obs_names)
else:
return AxisArrays(self, 0, self._obsm) |
This is already present in
Yes this would also work, but has the disadvantage to keep the anndata object in memory if only obsm/varm is accessible, which means you keep the potential memory expensive Therefore I tried the shape thing first. I'm happy to hear your comments! |
A lot of the behavior in this packages assumes there will be a live instance of an AnnData object. I don't think these classes make much sense without a live parent object, since they largely exist to maintain copy-on-write semantics.
I think this could be avoided. You would only need to validate items whenever they're added, and on AnnData instantiation. Otherwise, they should only be modified through the |
Ok, I can understand that this is a valid design assumption, however a very strong one in my eyes, because it's hard to maintain a living parent for all children without inducing circular references. Do you think your idea above is enough to have less API / behaviour changes while resolving the issue ? |
Ah, but until now it didn't seem like there was any downside to having circular references 😄. Looks like we've just managed to hit an edge case of python's garbage collection (a small amount of large objects)
I'm happy for whatever solution. It's very possible there would be problems with my idea that I haven't considered yet. It's just what I had thought of before you opened this PR. And my bias is towards less code that deals with state. |
Yes, it's definitely an edge case and a downside of the python garbage collection, which you hit. You lucky one 😄
Yes this is a downside of the new code. One could make a base class which handles the state, where all these objects inherit from, but still you have to manage them at least once with the current solution. |
anndata/_core/aligned_mapping.py
Outdated
@property | ||
def parent_shape(self) -> Tuple[int, int]: | ||
return self._parent.shape | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the shape tuple just be an attribute storing a tuple? Previously, I avoided this to reduce redundancy, but I think there's more complexity introduced by all the getters and setters. This would also remove the need for the AnnData.__del__
method.
I was checking to see if this might break inplace subsetting, but it looks like that was already a bit broken:
from anndata.tests.helpers import gen_adata
a = gen_adata((200, 100))
o1 = a.obsm
a._inplace_subset_obs(slice(50))
o2 = a.obsm
assert o1["array"].shape != o2["array"].shape
assert o1.parent is o2.parent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I introduced this for exactly this case (the anndata object changes shape), which then should be propagated to the AxisArrays
. We can drop these getters, setters and the __del__
method for sure, when propagated at all functions inducing different shapes explicitly. However, I'm not sure, that's so easy to find them all (or it is only these _inplace_subset_*
functions) ?
I guess also on master it's expected that o1
is changed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or it is only these inplace_subset* functions
The shape of the instance should only change via the _inplace_subset_
methods.
We can drop these getters, setters and the del method for sure, when propagated at all functions inducing different shapes explicitly.
Which properties are you referring to here?
I guess also on master it's expected that o1 is changed ?
I think it would either make sense that it change, or it no longer referred to the parent of the wrong shape.
Don't feel obligated to fix things that were already broken on master in this PR. If it's not intimately related, it's probably better to open an issue and fix it in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which properties are you referring to here?
@property
def parent_shape(self) -> Tuple[int, int]:
if self._parent:
self._parent_shape = self._parent.shape
return self._parent_shape
@parent_shape.setter
def parent_shape(self, shape: Tuple[int, int]):
self._parent_shape = shape
for example. When we can ensure, that the parent does not change shape other than _inplace_subset_
, like you said, we can drop these parent_shape
property of AxisArrays and the __del__
method of anndata, store a tuple of the parent shape instead and update it in the _inplace_subset_
methods when necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code to drop the __del__
method and the parent_shape
property.
We still need it once in the AlignedMapping
class otherwise, a lot of subclasses has to be changed to.
parent_shape
does not need to be updated in _inplace_subsets_
because they get newly created there.
Additionally I had a closer look to your solution above.
As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular.
However, then _obsm
cannot be AxisArrays
, otherwise you still have the circularity and if it's not an AxisArrays
it breaks a lot of other functions and would require a lot of code changes to fix it or am I wrong ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great, thanks. I created a branch here: https://github.com/fhausmann/anndata/tree/memory_fix_lazy_obsm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I see what's happening. We try to normalize passed indices to slice
, int
, array[int]
, or array[bool]
types as soon as possible. The normalized indices are stored in view in the _oidx
and _vidx
attributes. AxisArray._view
is only expecting to see those types as subset indices. This can be fixed by changing how the view is made in your obsm
getter to:
return obsm._view(self, self._oidx)
# Instead of
return obsm._view(self, self.obs_names)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed this, including some other small changes. However there is now another issue.
I had to introduce this https://github.com/fhausmann/anndata/blob/aa3e55b5e6251bc373a8ab94e8e11d7fc6d16dba/anndata/_core/aligned_mapping.py#L234-L237
To update the parent when obsm
is modified.
However there still seem to be an issue pointed out by the following test: anndata/tests/test_base.py::test_setting_dim_index[obs]
If you create an anndata object containing obsm=dict('df'=pd.Dataframe)
, copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True
at https://github.com/theislab/anndata/blob/58886f09b2e387c6389a2de20ed0bc7d20d1b843/anndata/tests/test_base.py#L187
I think it can be fixed with creating a copy when modifying a value in _obsm
. However this leads to infinite recursion when trying to create an anndata object from a view inplace.
Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, the AxisArray shouldn't have it's own version of the values. That should just be a reference to the values held by the parent. That is, I would think the __init__
should just do something like self._data = parent._obsm
.
If you create an anndata object containing
obsm=dict('df'=pd.Dataframe)
, copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True at
Is the code for this like:
curr = orig[:, :].copy()
# or
curr = orig.copy()[:, :]
Either way, I agree this doesn't look right. But if curr
is a view, I'm not sure it should have values for ._obsm
. It also looks like the .copy
method isn't actually making a copy of the dataframe if this is true, so that would be another thing to look into.
Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?
I wouldn't worry too much about "architecture changes". A lot of the work I've done on this package has been to make changing the architecture easier. It's up to you how you'd like to organize this, but I often find starting a fresh PR/ branch helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we close this than ? Or do you think it could be worth considering if the lazy creation doesn't work out ?
- __del__ from anndata because subssetting only occur in 'inplace_subset_' methods where obsm, varm where also updated - Therefore also the parent_shape setter and getter of AxisArrays was now unused
As mentioned in scverse#363 (comment)
Any update on this? Thanks. |
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #363 +/- ##
==========================================
- Coverage 84.90% 82.88% -2.02%
==========================================
Files 36 36
Lines 5133 5189 +56
==========================================
- Hits 4358 4301 -57
- Misses 775 888 +113
Flags with carried forward coverage won't be shown. Click here to find out more.
|
Fixes the increased memory consumption when creating multiple anndata objects.
Introducing weak circular references which doesn't interfere with the standard reference count based garbage collector.
Add one additional test for checking this behaviour for future changes.
Fixes #360.