aai-institute · AnesBenmerzoug · Dec 19, 2023 · Oct 30, 2023 · Nov 23, 2023 · Nov 23, 2023
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,10 +1,11 @@
 # Changelog
 
-
 ## Unreleased
 
 ### Added
 
+- New cache backends: InMemoryCacheBackend and DiskCacheBackend
+  [PR #458](https://github.com/aai-institute/pyDVL/pull/458)
 - New influence function interface `InfluenceFunctionModel`
 - Data parallel computation with `DaskInfluenceCalculator`
   [PR #26](https://github.com/aai-institute/pyDVL/issues/26)
@@ -15,6 +16,8 @@
 
 ### Changed
 
+- Refactor and simplify caching implementation 
+  [PR #458](https://github.com/aai-institute/pyDVL/pull/458)
 - Simplify display of computation progress
   [PR #466](https://github.com/aai-institute/pyDVL/pull/466)
 - Improve readme and explain better the examples

diff --git a/docs/getting-started/first-steps.md b/docs/getting-started/first-steps.md
@@ -9,8 +9,7 @@ alias:
 
 !!! Warning
     Make sure you have read [[installation]] before using the library. 
-    In particular read about how caching and parallelization work,
-    since they might require additional setup.
+    In particular read about which extra dependencies you may need.
 
 ## Main concepts
 
@@ -23,7 +22,6 @@ should be enough to get you started.
   computation and related methods.
 * [[influence-values]] for instructions on how to compute influence functions.
 
-
 ## Running the examples
 
 If you are somewhat familiar with the concepts of data valuation, you can start
@@ -36,23 +34,22 @@ by browsing our worked-out examples illustrating pyDVL's capabilities either:
   have to install jupyter first manually since it's not a dependency of the
   library.
 
-# Advanced usage
+## Advanced usage
 
-Besides the do's and don'ts of data valuation itself, which are the subject of
+Besides the dos and don'ts of data valuation itself, which are the subject of
 the examples and the documentation of each method, there are two main things to
 keep in mind when using pyDVL.
 
-## Caching
-
-pyDVL uses [memcached](https://memcached.org/) to cache the computation of the
-utility function and speed up some computations (see the [installation
-guide](installation.md/#setting-up-the-cache)).
+### Caching
 
-Caching of the utility function is disabled by default. When it is enabled it
-takes into account the data indices passed as argument and the utility function
-wrapped into the [Utility][pydvl.utils.utility.Utility] object. This means that
+PyDVL can cache (memoize) the computation of the utility function
+and speed up some computations for data valuation.
+It is however disabled by default.
+When it is enabled it takes into account the data indices passed as argument
+and the utility function wrapped into the
+[Utility][pydvl.utils.utility.Utility] object. This means that
 care must be taken when reusing the same utility function with different data,
-see the documentation for the [caching module][pydvl.utils.caching] for more
+see the documentation for the [caching package][pydvl.utils.caching] for more
 information.
 
 In general, caching won't play a major role in the computation of Shapley values
@@ -61,24 +58,82 @@ the same utility function computation, is very low. However, it can be very
 useful when comparing methods that use the same utility function, or when
 running multiple experiments with the same data.
 
+pyDVL supports 3 different caching backends:
+
+- [InMemoryCacheBackend][pydvl.utils.caching.memory.InMemoryCacheBackend]:
+  an in-memory cache backend that uses a dictionary to store and retrieve
+  cached values. This is used to share cached values between threads
+  in a single process.
+- [DiskCacheBackend][pydvl.utils.caching.disk.DiskCacheBackend]:
+  a disk-based cache backend that uses pickled values written to and read from disk.  
+  This is used to share cached values between processes in a single machine.
+- [MemcachedCacheBackend][pydvl.utils.caching.memcached.MemcachedCacheBackend]:
+  a [Memcached](https://memcached.org/)-based cache backend that uses pickled values written to
+  and read from a Memcached server. This is used to share cached values
+  between processes across multiple machines.
+
+  **Note** This specific backend requires optional dependencies.
+  See [[installation#extras]] for more information)
+
 !!! tip "When is the cache really necessary?"
     Crucially, semi-value computations with the
     [PermutationSampler][pydvl.value.sampler.PermutationSampler] require caching
     to be enabled, or they will take twice as long as the direct implementation
     in [compute_shapley_values][pydvl.value.shapley.compute_shapley_values].
 
-## Parallelization
+!!! tip "Using the cache"
+    Continue reading about the cache in the documentation
+    for the [caching package][pydvl.utils.caching].
+
+#### Setting up the Memcached cache
+
+[Memcached](https://memcached.org/) is an in-memory key-value store accessible
+over the network. pyDVL can use it to cache the computation of the utility function
+and speed up some computations (in particular, semi-value computations with the
+[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
+may benefit as well).
+
+You can either install it as a package or run it inside a docker container (the
+simplest). For installation instructions, refer to the [Getting
+started](https://github.com/memcached/memcached/wiki#getting-started) section in
+memcached's wiki. Then you can run it with:
 
-pyDVL supports [joblib](https://joblib.readthedocs.io/en/latest/) for local
-parallelization (within one machine) and [ray](https://ray.io) for distributed
-parallelization (across multiple machines).
+```shell
+memcached -u user
+```
 
-The former works out of the box but for the latter you will need to provide a
-running cluster (or run ray in local mode).
+To run memcached inside a container in daemon mode instead, use:
+
+```shell
+docker container run -d --rm -p 11211:11211 memcached:latest
+```
+
+### Parallelization
+
+pyDVL uses [joblib](https://joblib.readthedocs.io/en/latest/) for local
+parallelization (within one machine) and supports using
+[Ray](https://ray.io) for distributed parallelization (across multiple machines).
+
+The former works out of the box but for the latter you will need to install
+additional dependencies (see [[installation#extras]] )
+and to provide a running cluster (or run ray in local mode).
 
 As of v0.7.0 pyDVL does not allow requesting resources per task sent to the
 cluster, so you will need to make sure that each worker has enough resources to
 handle the tasks it receives. A data valuation task using game-theoretic methods
 will typically make a copy of the whole model and dataset to each worker, even
 if the re-training only happens on a subset of the data. This means that you
 should make sure that each worker has enough memory to handle the whole dataset.
+
+#### Ray
+
+Please follow the instructions in Ray's documentation to set up a cluster.
+Once you have a running cluster, you can use it by passing the address
+of the head node to parallel methods via [ParallelConfig][pydvl.parallel.config.ParallelConfig].
+
+For a local ray cluster you would use:
+
+```python
+from pydvl.parallel.config import ParallelConfig
+config = ParallelConfig(backend="ray") 
+```
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
@@ -13,75 +13,78 @@ To install the latest release use:
 pip install pyDVL
 ```
 
-To use all features of influence functions use instead:
-
-```shell
-pip install pyDVL[influence]
-```
-
-This includes a dependency on [PyTorch](https://pytorch.org/) (Version 2.0 and
-above) and thus is left out by default.
-
-In case that you have a supported version of CUDA installed (v11.2 to 11.8 as of
-this writing), you can enable eigenvalue computations for low-rank approximations
-with [CuPy](https://docs.cupy.dev/en/stable/index.html) on the GPU by using:
-
-```shell
-pip install pyDVL[cupy]
-```
-
-If you use a different version of CUDA, please install CuPy
-[manually](https://docs.cupy.dev/en/stable/install.html).
-
-In order to check the installation you can use:
-
-```shell
-python -c "import pydvl; print(pydvl.__version__)"
-```
-
 You can also install the latest development version from
 [TestPyPI](https://test.pypi.org/project/pyDVL/):
 
 ```shell
 pip install pyDVL --index-url https://test.pypi.org/simple/
 ```
 
-## Dependencies
-
-pyDVL requires Python >= 3.8, [Memcached](https://memcached.org/) for caching
-and [Ray](https://ray.io) for parallelization in a cluster (locally it uses joblib).
-Additionally, the [Influence functions][pydvl.influence] module requires PyTorch
-(see [[installation]]).
-
-ray is used to distribute workloads across nodes in a cluster (it can be used
-locally as well, but for this we recommend joblib instead). Please follow the
-instructions in their documentation to set up the cluster. Once you have a
-running cluster, you can use it by passing the address of the head node to
-parallel methods via [ParallelConfig][pydvl.utils.parallel].
-
-## Setting up the cache
-
-[memcached](https://memcached.org/) is an in-memory key-value store accessible
-over the network. pyDVL uses it to cache the computation of the utility function
-and speed up some computations (in particular, semi-value computations with the
-[PermutationSampler][pydvl.value.sampler.PermutationSampler] but other methods
-may benefit as well).
-
-You can either install it as a package or run it inside a docker container (the
-simplest). For installation instructions, refer to the [Getting
-started](https://github.com/memcached/memcached/wiki#getting-started) section in
-memcached's wiki. Then you can run it with:
+In order to check the installation you can use:
 
 ```shell
-memcached -u user
+python -c "import pydvl; print(pydvl.__version__)"
 ```
 
-To run memcached inside a container in daemon mode instead, do:
-
-```shell
-docker container run -d --rm -p 11211:11211 memcached:latest
-```
+## Dependencies
 
-!!! tip "Using the cache"
-    Continue reading about the cache in the [First Steps](first-steps.md#caching)
-    and the documentation for the [caching module][pydvl.utils.caching].
+pyDVL requires Python >= 3.8, [numpy](https://numpy.org/),
+[scikit-learn](https://scikit-learn.org/stable/), [scipy](https://scipy.org/),
+[cvxpy](https://www.cvxpy.org/) for the Core methods,
+and [joblib](https://joblib.readthedocs.io/en/stable/)
+for parallelization locally. Additionally,the [Influence functions][pydvl.influence]
+module requires PyTorch (see [[installation#extras]]).
+
+### Extras
+
+pyDVL has a few [extra](https://peps.python.org/pep-0508/#extras) dependencies
+that can be optionally installed:
+
+- `influence`:
+
+    To use all features of influence functions use instead:
+
+    ```shell
+    pip install pyDVL[influence]
+    ```
+
+    This includes a dependency on [PyTorch](https://pytorch.org/) (Version 2.0 and
+    above) and thus is left out by default.
+
+- `cupy`:
+
+    In case that you have a supported version of CUDA installed (v11.2 to 11.8 as of
+    this writing), you can enable eigenvalue computations for low-rank approximations
+    with [CuPy](https://docs.cupy.dev/en/stable/index.html) on the GPU by using:
+
+    ```shell
+    pip install pyDVL[cupy]
+    ```
+
+    This installs [cupy-cuda11x](https://pypi.org/project/cupy-cuda11x/).
+
+    If you use a different version of CUDA, please install CuPy
+    [manually](https://docs.cupy.dev/en/stable/install.html).
+
+- `ray`:
+
+    If you want to use [Ray](https://www.ray.io/) to distribute data valuation
+    workloads across nodes in a cluster (it can be used locally as well,
+    but for this we recommend joblib instead) install pyDVL using:
+
+    ```shell
+    pip install pyDVL[ray]
+    ```
+
+    see [[getting-started#ray]] for more details on how to use it.
+
+- `memcached`:
+
+    If you want to use [Memcached](https://memcached.org/) for caching
+    utility evaluations, use:
+
+    ```shell
+    pip install pyDVL[memcached]
+    ```
+
+    This installs [pymemcache](https://github.com/pinterest/pymemcache) additionally.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -69,6 +69,7 @@ plugins:
             - https://scikit-learn.org/stable/objects.inv
             - https://pytorch.org/docs/stable/objects.inv
             - https://pymemcache.readthedocs.io/en/latest/objects.inv
+            - https://joblib.readthedocs.io/en/stable/objects.inv
             - https://docs.dask.org/en/latest/objects.inv
             - https://distributed.dask.org/en/latest/objects.inv
           paths: [ src ]  # search packages in the src folder

diff --git a/requirements.txt b/requirements.txt
@@ -5,7 +5,6 @@ scikit-learn
 scipy>=1.7.0
 cvxpy>=1.3.0
 joblib
-pymemcache
 cloudpickle
 tqdm
 matplotlib
diff --git a/setup.py b/setup.py
@@ -23,6 +23,7 @@
     tests_require=["pytest"],
     extras_require={
         "cupy": ["cupy-cuda11x>=12.1.0"],
+        "memcached": ["pymemcache"],
         "influence": [
             "torch>=2.0.0",
             "dask>=2023.5.0",