diff --git a/docs/getting-started/first-steps.md b/docs/getting-started/first-steps.md index dcbe54d24..403724362 100644 --- a/docs/getting-started/first-steps.md +++ b/docs/getting-started/first-steps.md @@ -34,15 +34,15 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either: have to install jupyter first manually since it's not a dependency of the library. -# Advanced usage +## Advanced usage Besides the dos and don'ts of data valuation itself, which are the subject of the examples and the documentation of each method, there are two main things to keep in mind when using pyDVL. -## Caching +### Caching -PyDVL can cache the computation of the utility function +PyDVL can cache (memoize) the computation of the utility function and speed up some computations for data valuation. It is however disabled by default. When it is enabled it takes into account the data indices passed as argument @@ -58,7 +58,7 @@ the same utility function computation, is very low. However, it can be very useful when comparing methods that use the same utility function, or when running multiple experiments with the same data. -pyDVL supports different caching backends: +pyDVL supports 3 different caching backends: - [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]: an in-memory cache backend that uses a dictionary to store and retrieve @@ -85,7 +85,7 @@ pyDVL supports different caching backends: Continue reading about the cache in the documentation for the [caching package][pydvl.utils.caching]. -### Setting up the Memcached cache +#### Setting up the Memcached cache [Memcached](https://memcached.org/) is an in-memory key-value store accessible over the network. pyDVL can use it to cache the computation of the utility function @@ -108,7 +108,7 @@ To run memcached inside a container in daemon mode instead, use: docker container run -d --rm -p 11211:11211 memcached:latest ``` -## Parallelization +### Parallelization pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local parallelization (within one machine) and supports using @@ -125,7 +125,7 @@ will typically make a copy of the whole model and dataset to each worker, even if the re-training only happens on a subset of the data. This means that you should make sure that each worker has enough memory to handle the whole dataset. -### Ray +#### Ray Please follow the instructions in Ray's documentation to set up a cluster. Once you have a running cluster, you can use it by passing the address diff --git a/src/pydvl/utils/caching/__init__.py b/src/pydvl/utils/caching/__init__.py index bc72741e0..1089628bc 100644 --- a/src/pydvl/utils/caching/__init__.py +++ b/src/pydvl/utils/caching/__init__.py @@ -1,6 +1,7 @@ """Caching of functions. -pyDVL caches (memoizes) utility values to allow reusing previously computed evaluations. +PyDVL can cache (memoize) the computation of the utility function +and speed up some computations for data valuation. !!! Warning Function evaluations are cached with a key based on the function's signature @@ -10,67 +11,65 @@ # Configuration -Memoization is disabled by default but can be enabled easily, +Caching is disabled by default but can be enabled easily, see [Setting up the cache](#setting-up-the-cache). When enabled, it will be added to any callable used to construct a -[Utility][pydvl.utils.utility.Utility] (done with the decorator [@memcached][pydvl.utils.caching.memcached]). +[Utility][pydvl.utils.utility.Utility] (done with the wrap method of +[CacheBackend][pydvl.utils.caching.base.CacheBackend]). Depending on the nature of the utility you might want to enable the computation of a running average of function values, see [Usage with stochastic functions](#usaage-with-stochastic-functions). -You can see all configuration options under [MemcachedConfig][pydvl.utils.config.MemcachedConfig]. +You can see all configuration options under +[CachedFuncConfig][pydvl.utils.caching.config.CachedFuncConfig]. -## Default configuration +# Supported Backends -```python -default_config = dict( - server=('localhost', 11211), - connect_timeout=1.0, - timeout=0.1, - # IMPORTANT! Disable small packet consolidation: - no_delay=True, - serde=serde.PickleSerde(pickle_version=PICKLE_VERSION) -) -``` +pyDVL supports 3 different caching backends: -# Supported Backends +- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]: + an in-memory cache backend that uses a dictionary to store and retrieve + cached values. This is used to share cached values between threads + in a single process. +- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]: + a disk-based cache backend that uses pickled values written to and read from disk. + This is used to share cached values between processes in a single machine. +- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]: + a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to + and read from a Memcached server. This is used to share cached values + between processes across multiple machines. -- [InMemoryCacheBackend][] -- [DiskCacheBackend][] -- [MemcachedCacheBackend][] + **Note** This specific backend requires optional dependencies. + See [[installation#extras]] for more information) # Usage with stochastic functions -In addition to standard memoization, the decorator -[memcached()][pydvl.utils.caching.memcached] can compute running average and -standard error of repeated evaluations for the same input. This can be useful -for stochastic functions with high variance (e.g. model training for small -sample sizes), but drastically reduces the speed benefits of memoization. +In addition to standard memoization, the wrapped functions +can compute running average and standard error of repeated evaluations +for the same input. This can be useful for stochastic functions with high variance +(e.g. model training for small sample sizes), but drastically reduces +the speed benefits of memoization. -This behaviour can be activated with the argument `allow_repeated_evaluations` -to [memcached()][pydvl.utils.caching.memcached]. +This behaviour can be activated with the option +[allow_repeated_evaluations][pydvl.utils.caching.config.CachedFuncConfig].. # Cache reuse -When working directly with [memcached()][pydvl.utils.caching.memcached], it is +When working directly with [CachedFunc][pydvl.utils.caching.base.CachedFunc], it is essential to only cache pure functions. If they have any kind of state, either internal or external (e.g. a closure over some data that may change), then the cache will fail to notice this and the same value will be returned. -When a function is wrapped with [memcached()][pydvl.utils.caching.memcached] for -memoization, its signature (input and output names) and code are used as a key -for the cache. Alternatively you can pass a custom value to be used as key with - -```python -cached_fun = memcached(**asdict(cache_options))(fun, signature=custom_signature) -``` +When a function is wrapped with [CachedFunc][pydvl.utils.caching.base.CachedFunc] +for memoization, its signature (input and output names) and code are used as a key +for the cache. If you are running experiments with the same [Utility][pydvl.utils.utility.Utility] but different datasets, this will lead to evaluations of the utility on new data returning old values because utilities only use sample indices as arguments (so there is no way to tell the difference between '1' for dataset A and '1' for dataset 2 from the point of view of the cache). One solution is to empty the -cache between runs, but the preferred one is to **use a different Utility -object for each dataset**. +cache between runs by calling the `clear` method of the cache backend instance, +but the preferred one is to **use a different Utility object for each dataset**. # Unexpected cache misses @@ -79,7 +78,7 @@ run across multiple processes and some reporting arguments are added (like a `job_id` for logging purposes), these will be part of the signature and make the functions distinct to the eyes of the cache. This can be avoided with the use of -[ignore_args][pydvl.utils.config.MemcachedConfig] in the configuration. +[ignore_args][pydvl.utils.caching.config.CachedFuncConfig] option in the configuration. """ from .base import *