Skip to content

Commit

Permalink
Fix and improve caching package's docstring
Browse files Browse the repository at this point in the history
  • Loading branch information
AnesBenmerzoug committed Nov 26, 2023
1 parent 89472bc commit 030763d
Show file tree
Hide file tree
Showing 2 changed files with 43 additions and 44 deletions.
14 changes: 7 additions & 7 deletions docs/getting-started/first-steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,15 +34,15 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
have to install jupyter first manually since it's not a dependency of the
library.

# Advanced usage
## Advanced usage

Besides the dos and don'ts of data valuation itself, which are the subject of
the examples and the documentation of each method, there are two main things to
keep in mind when using pyDVL.

## Caching
### Caching

PyDVL can cache the computation of the utility function
PyDVL can cache (memoize) the computation of the utility function
and speed up some computations for data valuation.
It is however disabled by default.
When it is enabled it takes into account the data indices passed as argument
Expand All @@ -58,7 +58,7 @@ the same utility function computation, is very low. However, it can be very
useful when comparing methods that use the same utility function, or when
running multiple experiments with the same data.

pyDVL supports different caching backends:
pyDVL supports 3 different caching backends:

- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
an in-memory cache backend that uses a dictionary to store and retrieve
Expand All @@ -85,7 +85,7 @@ pyDVL supports different caching backends:
Continue reading about the cache in the documentation
for the [caching package][pydvl.utils.caching].

### Setting up the Memcached cache
#### Setting up the Memcached cache

[Memcached](https://memcached.org/) is an in-memory key-value store accessible
over the network. pyDVL can use it to cache the computation of the utility function
Expand All @@ -108,7 +108,7 @@ To run memcached inside a container in daemon mode instead, use:
docker container run -d --rm -p 11211:11211 memcached:latest
```

## Parallelization
### Parallelization

pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
parallelization (within one machine) and supports using
Expand All @@ -125,7 +125,7 @@ will typically make a copy of the whole model and dataset to each worker, even
if the re-training only happens on a subset of the data. This means that you
should make sure that each worker has enough memory to handle the whole dataset.

### Ray
#### Ray

Please follow the instructions in Ray's documentation to set up a cluster.
Once you have a running cluster, you can use it by passing the address
Expand Down
73 changes: 36 additions & 37 deletions src/pydvl/utils/caching/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
"""Caching of functions.
pyDVL caches (memoizes) utility values to allow reusing previously computed evaluations.
PyDVL can cache (memoize) the computation of the utility function
and speed up some computations for data valuation.
!!! Warning
Function evaluations are cached with a key based on the function's signature
Expand All @@ -10,67 +11,65 @@
# Configuration
Memoization is disabled by default but can be enabled easily,
Caching is disabled by default but can be enabled easily,
see [Setting up the cache](#setting-up-the-cache).
When enabled, it will be added to any callable used to construct a
[Utility][pydvl.utils.utility.Utility] (done with the decorator [@memcached][pydvl.utils.caching.memcached]).
[Utility][pydvl.utils.utility.Utility] (done with the wrap method of
[CacheBackend][pydvl.utils.caching.base.CacheBackend]).
Depending on the nature of the utility you might want to
enable the computation of a running average of function values, see
[Usage with stochastic functions](#usaage-with-stochastic-functions).
You can see all configuration options under [MemcachedConfig][pydvl.utils.config.MemcachedConfig].
You can see all configuration options under
[CachedFuncConfig][pydvl.utils.caching.config.CachedFuncConfig].
## Default configuration
# Supported Backends
```python
default_config = dict(
server=('localhost', 11211),
connect_timeout=1.0,
timeout=0.1,
# IMPORTANT! Disable small packet consolidation:
no_delay=True,
serde=serde.PickleSerde(pickle_version=PICKLE_VERSION)
)
```
pyDVL supports 3 different caching backends:
# Supported Backends
- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
an in-memory cache backend that uses a dictionary to store and retrieve
cached values. This is used to share cached values between threads
in a single process.
- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
a disk-based cache backend that uses pickled values written to and read from disk.
This is used to share cached values between processes in a single machine.
- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
and read from a Memcached server. This is used to share cached values
between processes across multiple machines.
- [InMemoryCacheBackend][]
- [DiskCacheBackend][]
- [MemcachedCacheBackend][]
**Note** This specific backend requires optional dependencies.
See [[installation#extras]] for more information)
# Usage with stochastic functions
In addition to standard memoization, the decorator
[memcached()][pydvl.utils.caching.memcached] can compute running average and
standard error of repeated evaluations for the same input. This can be useful
for stochastic functions with high variance (e.g. model training for small
sample sizes), but drastically reduces the speed benefits of memoization.
In addition to standard memoization, the wrapped functions
can compute running average and standard error of repeated evaluations
for the same input. This can be useful for stochastic functions with high variance
(e.g. model training for small sample sizes), but drastically reduces
the speed benefits of memoization.
This behaviour can be activated with the argument `allow_repeated_evaluations`
to [memcached()][pydvl.utils.caching.memcached].
This behaviour can be activated with the option
[allow_repeated_evaluations][pydvl.utils.caching.config.CachedFuncConfig]..
# Cache reuse
When working directly with [memcached()][pydvl.utils.caching.memcached], it is
When working directly with [CachedFunc][pydvl.utils.caching.base.CachedFunc], it is
essential to only cache pure functions. If they have any kind of state, either
internal or external (e.g. a closure over some data that may change), then the
cache will fail to notice this and the same value will be returned.
When a function is wrapped with [memcached()][pydvl.utils.caching.memcached] for
memoization, its signature (input and output names) and code are used as a key
for the cache. Alternatively you can pass a custom value to be used as key with
```python
cached_fun = memcached(**asdict(cache_options))(fun, signature=custom_signature)
```
When a function is wrapped with [CachedFunc][pydvl.utils.caching.base.CachedFunc]
for memoization, its signature (input and output names) and code are used as a key
for the cache.
If you are running experiments with the same [Utility][pydvl.utils.utility.Utility]
but different datasets, this will lead to evaluations of the utility on new data
returning old values because utilities only use sample indices as arguments (so
there is no way to tell the difference between '1' for dataset A and '1' for
dataset 2 from the point of view of the cache). One solution is to empty the
cache between runs, but the preferred one is to **use a different Utility
object for each dataset**.
cache between runs by calling the `clear` method of the cache backend instance,
but the preferred one is to **use a different Utility object for each dataset**.
# Unexpected cache misses
Expand All @@ -79,7 +78,7 @@
run across multiple processes and some reporting arguments are added (like a
`job_id` for logging purposes), these will be part of the signature and make the
functions distinct to the eyes of the cache. This can be avoided with the use of
[ignore_args][pydvl.utils.config.MemcachedConfig] in the configuration.
[ignore_args][pydvl.utils.caching.config.CachedFuncConfig] option in the configuration.
"""
from .base import *
Expand Down

0 comments on commit 030763d

Please sign in to comment.