Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory consumption increase for anndata objects #363

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
74 changes: 69 additions & 5 deletions anndata/_core/aligned_mapping.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
from typing import Union, Optional, Type, ClassVar, TypeVar # Special types
from typing import Iterator, Mapping, Sequence # ABCs
from typing import Tuple, List, Dict # Generic base types
import weakref

import numpy as np
import pandas as pd
Expand Down Expand Up @@ -47,8 +48,8 @@ def _ipython_key_completions_(self) -> List[str]:
def _validate_value(self, val: V, key: str) -> V:
"""Raises an error if value is invalid"""
for i, axis in enumerate(self.axes):
if self.parent.shape[axis] != val.shape[i]:
right_shape = tuple(self.parent.shape[a] for a in self.axes)
if self.parent_shape[axis] != val.shape[i]:
right_shape = tuple(self.parent_shape[a] for a in self.axes)
raise ValueError(
f"Value passed for key {key!r} is of incorrect shape. "
f"Values of {self.attrname} must match dimensions "
Expand Down Expand Up @@ -81,6 +82,10 @@ def is_view(self) -> bool:
def parent(self) -> Union["anndata.AnnData", "raw.Raw"]:
return self._parent

@property
def parent_shape(self) -> Tuple[int, int]:
return self._parent.shape

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the shape tuple just be an attribute storing a tuple? Previously, I avoided this to reduce redundancy, but I think there's more complexity introduced by all the getters and setters. This would also remove the need for the AnnData.__del__ method.

I was checking to see if this might break inplace subsetting, but it looks like that was already a bit broken:

from anndata.tests.helpers import gen_adata

a = gen_adata((200, 100))
o1 = a.obsm
a._inplace_subset_obs(slice(50))
o2 = a.obsm
assert o1["array"].shape != o2["array"].shape
assert o1.parent is o2.parent

Copy link
Author

@fhausmann fhausmann May 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I introduced this for exactly this case (the anndata object changes shape), which then should be propagated to the AxisArrays. We can drop these getters, setters and the __del__ method for sure, when propagated at all functions inducing different shapes explicitly. However, I'm not sure, that's so easy to find them all (or it is only these _inplace_subset_* functions) ?

I guess also on master it's expected that o1 is changed ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it is only these inplace_subset* functions

The shape of the instance should only change via the _inplace_subset_ methods.

We can drop these getters, setters and the del method for sure, when propagated at all functions inducing different shapes explicitly.

Which properties are you referring to here?

I guess also on master it's expected that o1 is changed ?

I think it would either make sense that it change, or it no longer referred to the parent of the wrong shape.

Don't feel obligated to fix things that were already broken on master in this PR. If it's not intimately related, it's probably better to open an issue and fix it in a separate PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which properties are you referring to here?

    @property
    def parent_shape(self) -> Tuple[int, int]:
        if self._parent:
            self._parent_shape = self._parent.shape
        return self._parent_shape

    @parent_shape.setter
    def parent_shape(self, shape: Tuple[int, int]):
        self._parent_shape = shape

for example. When we can ensure, that the parent does not change shape other than _inplace_subset_, like you said, we can drop these parent_shape property of AxisArrays and the __del__ method of anndata, store a tuple of the parent shape instead and update it in the _inplace_subset_ methods when necessary.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to drop the __del__ method and the parent_shape property.
We still need it once in the AlignedMapping class otherwise, a lot of subclasses has to be changed to.
parent_shape does not need to be updated in _inplace_subsets_ because they get newly created there.

Additionally I had a closer look to your solution above.

As an alternative, I was thinking the relationship could be reversed. The aligned mapping gets a normal reference to the anndata object, but the anndata object wraps the underlying dict whenever the attribute is accessed. This way the reference isn't circular.

However, then _obsm cannot be AxisArrays, otherwise you still have the circularity and if it's not an AxisArrays it breaks a lot of other functions and would require a lot of code changes to fix it or am I wrong ?

Copy link
Author

@fhausmann fhausmann Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great, thanks. I created a branch here: https://github.com/fhausmann/anndata/tree/memory_fix_lazy_obsm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see what's happening. We try to normalize passed indices to slice, int, array[int], or array[bool] types as soon as possible. The normalized indices are stored in view in the _oidx and _vidx attributes. AxisArray._view is only expecting to see those types as subset indices. This can be fixed by changing how the view is made in your obsm getter to:

return obsm._view(self, self._oidx)
# Instead of
return obsm._view(self, self.obs_names)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed this, including some other small changes. However there is now another issue.
I had to introduce this https://github.com/fhausmann/anndata/blob/aa3e55b5e6251bc373a8ab94e8e11d7fc6d16dba/anndata/_core/aligned_mapping.py#L234-L237
To update the parent when obsm is modified.

However there still seem to be an issue pointed out by the following test: anndata/tests/test_base.py::test_setting_dim_index[obs]
If you create an anndata object containing obsm=dict('df'=pd.Dataframe) , copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True at https://github.com/theislab/anndata/blob/58886f09b2e387c6389a2de20ed0bc7d20d1b843/anndata/tests/test_base.py#L187

I think it can be fixed with creating a copy when modifying a value in _obsm. However this leads to infinite recursion when trying to create an anndata object from a view inplace.

Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, the AxisArray shouldn't have it's own version of the values. That should just be a reference to the values held by the parent. That is, I would think the __init__ should just do something like self._data = parent._obsm.

If you create an anndata object containing obsm=dict('df'=pd.Dataframe), copy it and create a view it looks like as all objects are referring to the same dataframe:
id(curr._obsm['df']) == id(orig._obsm['df']) # True at

Is the code for this like:

curr = orig[:, :].copy()
# or 
curr = orig.copy()[:, :]

Either way, I agree this doesn't look right. But if curr is a view, I'm not sure it should have values for ._obsm. It also looks like the .copy method isn't actually making a copy of the dataframe if this is true, so that would be another thing to look into.

Additionally I think, these changes are now (not only) for fixing the original issue, but changing the anndata architecture. Should we now create a new pull request for this?

I wouldn't worry too much about "architecture changes". A lot of the work I've done on this package has been to make changing the architecture easier. It's up to you how you'd like to organize this, but I often find starting a fresh PR/ branch helpful.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we close this than ? Or do you think it could be worth considering if the lazy creation doesn't work out ?

def copy(self):
d = self._actual_class(self.parent, self._axis)
for k, v in self.items():
Expand Down Expand Up @@ -222,15 +227,48 @@ def __init__(
axis: int,
vals: Union[Mapping, AxisArraysBase, None] = None,
):
self._parent = parent
if isinstance(parent, anndata.AnnData):
self._parent_ref = weakref.ref(parent)
self._is_weak = True
else:
self._parent_ref = parent
self._is_weak = False
if axis not in (0, 1):
raise ValueError()
self._axis = axis
self._parent_shape = parent.shape
self.dim_names = (parent.obs_names, parent.var_names)[self._axis]
self._data = dict()
if vals is not None:
self.update(vals)

@property
def _parent(self) -> Union["anndata.AnnData", "raw.Raw"]:
if self._is_weak:
return self._parent_ref()
return self._parent_ref

@property
def parent_shape(self) -> Tuple[int, int]:
if self._parent:
self._parent_shape = self._parent.shape
return self._parent_shape

@parent_shape.setter
def parent_shape(self, shape: Tuple[int, int]):
self._parent_shape = shape

def __getstate__(self):
state = self.__dict__.copy()
if self._is_weak:
state["_parent_ref"] = state["_parent_ref"]()
return state

def __setstate__(self, state):
self.__dict__ = state.copy()
if self._is_weak:
self.__dict__["_parent_ref"] = weakref.ref(state["_parent_ref"])


class AxisArraysView(AlignedViewMixin, AxisArraysBase):
def __init__(
Expand Down Expand Up @@ -270,11 +308,24 @@ def copy(self) -> "Layers":

class Layers(AlignedActualMixin, LayersBase):
def __init__(self, parent: "anndata.AnnData", vals: Optional[Mapping] = None):
self._parent = parent
self._parent_ref = weakref.ref(parent)
self._data = dict()
if vals is not None:
self.update(vals)

@property
def _parent(self):
return self._parent_ref()

def __getstate__(self):
state = self.__dict__.copy()
state["_parent_ref"] = state["_parent_ref"]()
return state

def __setstate__(self, state):
self.__dict__ = state.copy()
self.__dict__["_parent_ref"] = weakref.ref(state["_parent_ref"])


class LayersView(AlignedViewMixin, LayersBase):
def __init__(
Expand Down Expand Up @@ -320,14 +371,27 @@ class PairwiseArrays(AlignedActualMixin, PairwiseArraysBase):
def __init__(
self, parent: "anndata.AnnData", axis: int, vals: Optional[Mapping] = None,
):
self._parent = parent
self._parent_ref = weakref.ref(parent)
if axis not in (0, 1):
raise ValueError()
self._axis = axis
self._data = dict()
if vals is not None:
self.update(vals)

@property
def _parent(self):
return self._parent_ref()

def __getstate__(self):
state = self.__dict__.copy()
state["_parent_ref"] = state["_parent_ref"]()
return state

def __setstate__(self, state):
self.__dict__ = state.copy()
self.__dict__["_parent_ref"] = weakref.ref(state["_parent_ref"])


class PairwiseArraysView(AlignedViewMixin, PairwiseArraysBase):
def __init__(
Expand Down
6 changes: 6 additions & 0 deletions anndata/_core/anndata.py
Original file line number Diff line number Diff line change
Expand Up @@ -1977,3 +1977,9 @@ def _get_and_delete_multicol_field(self, a, key_multicol):
values = getattr(self, a)[keys].values
getattr(self, a).drop(keys, axis=1, inplace=True)
return values

def __del__(self):
if isinstance(self._obsm, AxisArrays):
self._obsm.parent_shape = self.shape
if isinstance(self._varm, AxisArrays):
self._varm.parent_shape = self.shape
16 changes: 15 additions & 1 deletion anndata/_core/file_backing.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from os import PathLike
from pathlib import Path
from typing import Optional, Union, Iterator
import weakref

import h5py

Expand All @@ -18,13 +19,26 @@ def __init__(
filename: Optional[PathLike] = None,
filemode: Optional[Literal["r", "r+"]] = None,
):
self._adata = adata
self._adata_ref = weakref.ref(adata)
self.filename = filename
self._filemode = filemode
self._file = None
if filename:
self.open()

def __getstate__(self):
state = self.__dict__.copy()
state["_adata_ref"] = state["_adata_ref"]()
return state

def __setstate__(self, state):
self.__dict__ = state.copy()
self.__dict__["_adata_ref"] = weakref.ref(state["_adata_ref"])

@property
def _adata(self):
return self._adata_ref()

def __repr__(self) -> str:
if self.filename is None:
return "Backing file manager: no file is set."
Expand Down
47 changes: 47 additions & 0 deletions anndata/tests/test_base.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
from itertools import product
import tracemalloc

import numpy as np
from numpy import ma
Expand Down Expand Up @@ -573,3 +574,49 @@ def assert_eq_not_id(a, b):
assert_eq_not_id(map_sprs.keys(), map_copy.keys())
for key in map_sprs.keys():
assert_eq_not_id(map_sprs[key], map_copy[key])


def test_memory_usage():
N, M = 100, 200
RUNS = 10
obs_df = pd.DataFrame(
dict(
cat=pd.Categorical(np.arange(N, dtype=int)),
int=np.arange(N, dtype=int),
float=np.arange(N, dtype=float),
obj=[str(i) for i in np.arange(N, dtype=int)],
),
index=[f"cell{i}" for i in np.arange(N, dtype=int)],
)
var_df = pd.DataFrame(
dict(
cat=pd.Categorical(np.arange(M, dtype=int)),
int=np.arange(M, dtype=int),
float=np.arange(M, dtype=float),
obj=[str(i) for i in np.arange(M, dtype=int)],
),
index=[f"gene{i}" for i in np.arange(M, dtype=int)],
)

def get_memory(snapshot, key_type="lineno"):
snapshot = snapshot.filter_traces(
(
tracemalloc.Filter(False, "<frozen importlib._bootstrap>"),
tracemalloc.Filter(False, "<unknown>"),
)
)
total = sum(stat.size for stat in snapshot.statistics(key_type))
return total

total = np.zeros(RUNS)
# Intantiate the anndata object first before memory calculation to
# only look at memory changes due to deletion of such a object.
adata = AnnData(X=np.random.random((N, M)), obs=obs_df, var=var_df)
adata.X[0, 0] = 1.0 # Disable Codacy issue
tracemalloc.start()
for i in range(RUNS):
adata = AnnData(X=np.random.random((N, M)), obs=obs_df, var=var_df)
total[i] = get_memory(tracemalloc.take_snapshot())
tracemalloc.stop()
relative_increase = total[:-1] / total[1:]
np.testing.assert_allclose(relative_increase, 1.0, atol=0.2)