Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create abstraction for caching #458

Merged
merged 32 commits into from
Dec 19, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
ff476b9
Move existing caching module to new memcached module withing new cach…
AnesBenmerzoug Oct 30, 2023
5337451
Refactor caching into separate classes and add 2 more implementations
AnesBenmerzoug Nov 23, 2023
f5c947e
Adapt Utility to caching change
AnesBenmerzoug Nov 23, 2023
ab96b81
Adapt tests
AnesBenmerzoug Nov 23, 2023
0811ede
Remove caching section from readme
AnesBenmerzoug Nov 23, 2023
44c5f57
Rename CacheBackendBase to CacheBackend, improve docstrings
AnesBenmerzoug Nov 23, 2023
073745f
Make pymemcached an optional dependency, define new memcached extra
AnesBenmerzoug Nov 23, 2023
6b2e3a7
Add joblib documentation inventory
AnesBenmerzoug Nov 23, 2023
6b5e6ee
Update and improve installation and first-steps docs
AnesBenmerzoug Nov 23, 2023
82fa952
Add link to extras section of docs to readme
AnesBenmerzoug Nov 23, 2023
16ae640
Update changelog
AnesBenmerzoug Nov 23, 2023
fbc96cf
Fix type hints
AnesBenmerzoug Nov 24, 2023
ab4578a
Remove leftover uses of enable_cache argument
AnesBenmerzoug Nov 24, 2023
6ce679a
Use name cache_backend instead of cache
AnesBenmerzoug Nov 24, 2023
ab9a20a
Fix tests
AnesBenmerzoug Nov 24, 2023
8cb02ac
More fixes
AnesBenmerzoug Nov 24, 2023
89472bc
Handle usage of MemcachedCacheBackend when pymemcache is not installed
AnesBenmerzoug Nov 26, 2023
030763d
Fix and improve caching package's docstring
AnesBenmerzoug Nov 26, 2023
763f6ec
Fix tests
AnesBenmerzoug Nov 26, 2023
bc92356
Add test for case when pymemcache is not installed
AnesBenmerzoug Dec 9, 2023
bb69b77
Add test for cache backend serialization
AnesBenmerzoug Dec 9, 2023
486b43a
Use newly created temporary directory for DiskCacheBackend
AnesBenmerzoug Dec 9, 2023
ef8bd33
Set default value of cached_func_options to None
AnesBenmerzoug Dec 9, 2023
798c232
Merge branch 'develop' into feature/create-abstraction-for-cache
AnesBenmerzoug Dec 13, 2023
73e8e54
Set backend time_threshold to 0.3
AnesBenmerzoug Dec 13, 2023
287ca1d
Fix test of utility with cache
AnesBenmerzoug Dec 13, 2023
196b310
Add hash_prefix parameter to CachedFuncConfig, use it in utility
AnesBenmerzoug Dec 14, 2023
02da342
Please mypy
AnesBenmerzoug Dec 14, 2023
a2662f2
Use builtin hash to compute hash_prefix
AnesBenmerzoug Dec 14, 2023
6b0b60c
Merge branch 'develop' into feature/create-abstraction-for-cache
AnesBenmerzoug Dec 17, 2023
471a64e
Add suggestions from review session
schroedk Dec 18, 2023
e0f4fc5
Merge branch 'develop' into feature/create-abstraction-for-cache
AnesBenmerzoug Dec 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Changelog


## Unreleased

### Added

- New cache backends: InMemoryCacheBackend and DiskCacheBackend
[PR #458](https://github.com/aai-institute/pyDVL/pull/458)
- New influence function interface `InfluenceFunctionModel`
- Data parallel computation with `DaskInfluenceCalculator`
[PR #26](https://github.com/aai-institute/pyDVL/issues/26)
Expand All @@ -15,6 +16,8 @@

### Changed

- Refactor and simplify caching implementation
[PR #458](https://github.com/aai-institute/pyDVL/pull/458)
- Simplify display of computation progress
[PR #466](https://github.com/aai-institute/pyDVL/pull/466)
- Improve readme and explain better the examples
Expand Down
95 changes: 75 additions & 20 deletions docs/getting-started/first-steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,7 @@ alias:

!!! Warning
Make sure you have read [[installation]] before using the library.
In particular read about how caching and parallelization work,
since they might require additional setup.
In particular read about which extra dependencies you may need.

## Main concepts

Expand All @@ -23,7 +22,6 @@ should be enough to get you started.
computation and related methods.
* [[influence-values]] for instructions on how to compute influence functions.


## Running the examples

If you are somewhat familiar with the concepts of data valuation, you can start
Expand All @@ -36,23 +34,22 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
have to install jupyter first manually since it's not a dependency of the
library.

# Advanced usage
## Advanced usage

Besides the do's and don'ts of data valuation itself, which are the subject of
Besides the dos and don'ts of data valuation itself, which are the subject of
the examples and the documentation of each method, there are two main things to
keep in mind when using pyDVL.

## Caching

pyDVL uses [memcached](https://memcached.org/) to cache the computation of the
utility function and speed up some computations (see the [installation
guide](installation.md/#setting-up-the-cache)).
### Caching

Caching of the utility function is disabled by default. When it is enabled it
takes into account the data indices passed as argument and the utility function
wrapped into the [Utility][pydvl.utils.utility.Utility] object. This means that
PyDVL can cache (memoize) the computation of the utility function
and speed up some computations for data valuation.
It is however disabled by default.
When it is enabled it takes into account the data indices passed as argument
and the utility function wrapped into the
[Utility][pydvl.utils.utility.Utility] object. This means that
care must be taken when reusing the same utility function with different data,
see the documentation for the [caching module][pydvl.utils.caching] for more
see the documentation for the [caching package][pydvl.utils.caching] for more
information.

In general, caching won't play a major role in the computation of Shapley values
Expand All @@ -61,24 +58,82 @@ the same utility function computation, is very low. However, it can be very
useful when comparing methods that use the same utility function, or when
running multiple experiments with the same data.

pyDVL supports 3 different caching backends:

- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
an in-memory cache backend that uses a dictionary to store and retrieve
cached values. This is used to share cached values between threads
in a single process.
- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
a disk-based cache backend that uses pickled values written to and read from disk.
This is used to share cached values between processes in a single machine.
- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
and read from a Memcached server. This is used to share cached values
between processes across multiple machines.

**Note** This specific backend requires optional dependencies.
See [[installation#extras]] for more information)

!!! tip "When is the cache really necessary?"
Crucially, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
to be enabled, or they will take twice as long as the direct implementation
in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].

## Parallelization
!!! tip "Using the cache"
Continue reading about the cache in the documentation
for the [caching package][pydvl.utils.caching].

#### Setting up the Memcached cache

[Memcached](https://memcached.org/) is an in-memory key-value store accessible
over the network. pyDVL can use it to cache the computation of the utility function
and speed up some computations (in particular, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
may benefit as well).

You can either install it as a package or run it inside a docker container (the
simplest). For installation instructions, refer to the [Getting
started](https://github.com/memcached/memcached/wiki#getting-started) section in
memcached's wiki. Then you can run it with:

pyDVL supports [joblib](https://joblib.readthedocs.io/en/latest/) for local
parallelization (within one machine) and [ray](https://ray.io) for distributed
parallelization (across multiple machines).
```shell
memcached -u user
```

The former works out of the box but for the latter you will need to provide a
running cluster (or run ray in local mode).
To run memcached inside a container in daemon mode instead, use:

```shell
docker container run -d --rm -p 11211:11211 memcached:latest
```

### Parallelization

pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
parallelization (within one machine) and supports using
[Ray](https://ray.io) for distributed parallelization (across multiple machines).

The former works out of the box but for the latter you will need to install
additional dependencies (see [[installation#extras]] )
and to provide a running cluster (or run ray in local mode).

As of v0.7.0 pyDVL does not allow requesting resources per task sent to the
cluster, so you will need to make sure that each worker has enough resources to
handle the tasks it receives. A data valuation task using game-theoretic methods
will typically make a copy of the whole model and dataset to each worker, even
if the re-training only happens on a subset of the data. This means that you
should make sure that each worker has enough memory to handle the whole dataset.

#### Ray

Please follow the instructions in Ray's documentation to set up a cluster.
Once you have a running cluster, you can use it by passing the address
of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].

For a local ray cluster you would use:

```python
from pydvl.parallel.config import ParallelConfig
config = ParallelConfig(backend="ray")
```
123 changes: 63 additions & 60 deletions docs/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,75 +13,78 @@ To install the latest release use:
pip install pyDVL
```

To use all features of influence functions use instead:

```shell
pip install pyDVL[influence]
```

This includes a dependency on [PyTorch](https://pytorch.org/) (Version 2.0 and
above) and thus is left out by default.

In case that you have a supported version of CUDA installed (v11.2 to 11.8 as of
this writing), you can enable eigenvalue computations for low-rank approximations
with [CuPy](https://docs.cupy.dev/en/stable/index.html) on the GPU by using:

```shell
pip install pyDVL[cupy]
```

If you use a different version of CUDA, please install CuPy
[manually](https://docs.cupy.dev/en/stable/install.html).

In order to check the installation you can use:

```shell
python -c "import pydvl; print(pydvl.__version__)"
```

You can also install the latest development version from
[TestPyPI](https://test.pypi.org/project/pyDVL/):

```shell
pip install pyDVL --index-url https://test.pypi.org/simple/
```

## Dependencies

pyDVL requires Python >= 3.8, [Memcached](https://memcached.org/) for caching
and [Ray](https://ray.io) for parallelization in a cluster (locally it uses joblib).
Additionally, the [Influence functions][pydvl.influence] module requires PyTorch
(see [[installation]]).

ray is used to distribute workloads across nodes in a cluster (it can be used
locally as well, but for this we recommend joblib instead). Please follow the
instructions in their documentation to set up the cluster. Once you have a
running cluster, you can use it by passing the address of the head node to
parallel methods via [ParallelConfig][pydvl.utils.parallel].

## Setting up the cache

[memcached](https://memcached.org/) is an in-memory key-value store accessible
over the network. pyDVL uses it to cache the computation of the utility function
and speed up some computations (in particular, semi-value computations with the
[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
may benefit as well).

You can either install it as a package or run it inside a docker container (the
simplest). For installation instructions, refer to the [Getting
started](https://github.com/memcached/memcached/wiki#getting-started) section in
memcached's wiki. Then you can run it with:
In order to check the installation you can use:

```shell
memcached -u user
python -c "import pydvl; print(pydvl.__version__)"
```

To run memcached inside a container in daemon mode instead, do:

```shell
docker container run -d --rm -p 11211:11211 memcached:latest
```
## Dependencies

!!! tip "Using the cache"
Continue reading about the cache in the [First Steps](first-steps.md#caching)
and the documentation for the [caching module][pydvl.utils.caching].
pyDVL requires Python >= 3.8, [numpy](https://numpy.org/),
[scikit-learn](https://scikit-learn.org/stable/), [scipy](https://scipy.org/),
[cvxpy](https://www.cvxpy.org/) for the Core methods,
and [joblib](https://joblib.readthedocs.io/en/stable/)
for parallelization locally. Additionally,the [Influence functions][pydvl.influence]
module requires PyTorch (see [[installation#extras]]).

### Extras

pyDVL has a few [extra](https://peps.python.org/pep-0508/#extras) dependencies
that can be optionally installed:

- `influence`:

To use all features of influence functions use instead:

```shell
pip install pyDVL[influence]
```

This includes a dependency on [PyTorch](https://pytorch.org/) (Version 2.0 and
above) and thus is left out by default.

- `cupy`:

In case that you have a supported version of CUDA installed (v11.2 to 11.8 as of
this writing), you can enable eigenvalue computations for low-rank approximations
with [CuPy](https://docs.cupy.dev/en/stable/index.html) on the GPU by using:

```shell
pip install pyDVL[cupy]
```

This installs [cupy-cuda11x](https://pypi.org/project/cupy-cuda11x/).

If you use a different version of CUDA, please install CuPy
[manually](https://docs.cupy.dev/en/stable/install.html).

- `ray`:

If you want to use [Ray](https://www.ray.io/) to distribute data valuation
workloads across nodes in a cluster (it can be used locally as well,
but for this we recommend joblib instead) install pyDVL using:

```shell
pip install pyDVL[ray]
```

see [[getting-started#ray]] for more details on how to use it.

- `memcached`:

If you want to use [Memcached](https://memcached.org/) for caching
utility evaluations, use:

```shell
pip install pyDVL[memcached]
```

This installs [pymemcache](https://github.com/pinterest/pymemcache) additionally.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ plugins:
- https://scikit-learn.org/stable/objects.inv
- https://pytorch.org/docs/stable/objects.inv
- https://pymemcache.readthedocs.io/en/latest/objects.inv
- https://joblib.readthedocs.io/en/stable/objects.inv
- https://docs.dask.org/en/latest/objects.inv
- https://distributed.dask.org/en/latest/objects.inv
paths: [ src ] # search packages in the src folder
Expand Down
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@ scikit-learn
scipy>=1.7.0
cvxpy>=1.3.0
joblib
pymemcache
cloudpickle
tqdm
matplotlib
1 change: 1 addition & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
tests_require=["pytest"],
extras_require={
"cupy": ["cupy-cuda11x>=12.1.0"],
"memcached": ["pymemcache"],
"influence": [
"torch>=2.0.0",
"dask>=2023.5.0",
Expand Down
Loading
Loading