diff --git a/CHANGELOG.md b/CHANGELOG.md
index 46316b2d2..136136bba 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -14,6 +14,8 @@
### Changed
+- Improve readme and explain better the examples
+ [PR #465](https://github.com/aai-institute/pyDVL/pull/465)
- Simplify and improve tests, add CodeCov code coverage
[PR #429](https://github.com/aai-institute/pyDVL/pull/429)
-
diff --git a/README.md b/README.md
index 2a8871307..9497bbd76 100644
--- a/README.md
+++ b/README.md
@@ -30,11 +30,251 @@
-pyDVL collects algorithms for Data Valuation and Influence Function computation.
+**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation.
+
+**Data Valuation** for machine learning is the task of assigning a scalar
+to each element of a training set which reflects its contribution to the final
+performance or outcome of some model trained on it. Some concepts of
+value depend on a specific model of interest, while others are model-agnostic.
+pyDVL focuses on model-dependent methods.
+
+
+
+
+ Comparison of different data valuation methods
+ on best sample removal.
+
+
+
+The **Influence Function** is an infinitesimal measure of the effect that single
+training points have over the parameters of a model, or any function thereof.
+In particular, in machine learning they are also used to compute the effect
+of training samples over individual test points.
+
+
+
+
+ Influences of input points with corrupted data.
+ Highlighted points have flipped labels.
+
+
-Data Valuation is the task of estimating the intrinsic value of a data point
-wrt. the training set, the model and a scoring function. We currently implement
-methods from the following papers:
+# Installation
+
+To install the latest release use:
+
+```shell
+$ pip install pyDVL
+```
+
+You can also install the latest development version from
+[TestPyPI](https://test.pypi.org/project/pyDVL/):
+
+```shell
+pip install pyDVL --index-url https://test.pypi.org/simple/
+```
+
+pyDVL has also extra dependencies for certain functionalities (e.g. influence functions).
+
+For more instructions and information refer to [Installing pyDVL
+](https://pydvl.org/stable/getting-started/installation/) in the
+documentation.
+
+# Usage
+
+In the following subsections, we will showcase the usage of pyDVL
+for Data Valuation and Influence Functions using simple examples.
+
+For more instructions and information refer to [Getting
+Started](https://pydvl.org/stable/getting-started/first-steps/) in
+the documentation.
+We provide several examples for data valuation
+(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
+and for influence functions
+(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
+with details on the algorithms and their applications.
+
+## Influence Functions
+
+For influence computation, follow these steps:
+
+1. Import the necessary packages (The exact packages depend on your specific use case).
+
+ ```python
+ import torch
+ from torch import nn
+ from torch.utils.data import DataLoader, TensorDataset
+
+ from pydvl.influence.torch import DirectInfluence
+ from pydvl.influence.torch.util import TorchCatAggregator, TorchNumpyConverter
+ from pydvl.influence import SequentialInfluenceCalculator
+ ```
+
+2. Create PyTorch data loaders for your train and test splits.
+
+ ```python
+ input_dim = (5, 5, 5)
+ output_dim = 3
+ train_x = torch.rand((10, *input_dim))
+ train_y = torch.rand((10, output_dim))
+ test_x = torch.rand((5, *input_dim))
+ test_y = torch.rand((5, output_dim))
+
+ train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
+ test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
+ ```
+
+3. Instantiate your neural network model.
+
+ ```python
+ nn_architecture = nn.Sequential(
+ nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
+ nn.Flatten(),
+ nn.Linear(27, 3),
+ )
+ ```
+
+4. Define your loss:
+
+ ```python
+ loss = nn.MSELoss()
+ ```
+
+5. Instantiate an `InfluenceFunctionModel` and fit it to the training data
+
+ ```python
+ influence_function_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
+ influence_function_model = influence_function_model.fit(train_data_loader)
+ ```
+
+6. For small input data call influence method on the fitted instance.
+
+ ```python
+ influences = influence_function_model.influences(test_x, test_y, train_x, train_y)
+ ```
+ The result is a tensor of shape `(training samples x test samples)`
+ that contains at index `(i, j`) the influence of training sample `i` on
+ test sample `j`.
+
+7. For larger data, wrap the model into a
+ calculator and call methods on the calculator.
+ ```python
+ sequential_influence_calculator = SequentialInfluenceCalculator(influence_function_model)
+
+ # Lazy object providing arrays batch-wise in a sequential manner
+ lazy_influences = sequential_influence_calculator.influences(test_data_loader, train_data_loader)
+
+ # Trigger computation and pull results to memory
+ influences = lazy_influences.compute(tensor_aggregator=TorchCatAggregator())
+
+ # Trigger computation and write results batch-wise to disk
+ lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
+ ```
+
+
+ The higher the absolute value of the influence of a training sample
+ on a test sample, the more influential it is for the chosen test sample, model
+ and data loaders. The sign of the influence determines whether it is
+ useful (positive) or harmful (negative).
+
+> **Note** pyDVL currently only support PyTorch for Influence Functions.
+> We are planning to add support for Jax and perhaps TensorFlow or even Keras.
+
+## Data Valuation
+
+The steps required to compute data values for your samples are:
+
+1. Import the necessary packages (The exact packages depend on your specific use case).
+
+ ```python
+ import matplotlib.pyplot as plt
+ from sklearn.datasets import load_breast_cancer
+ from sklearn.linear_model import LogisticRegression
+ from pydvl.utils import Dataset, Scorer, Utility
+ from pydvl.value import (
+ compute_shapley_values,
+ ShapleyMode,
+ MaxUpdates,
+ )
+ ```
+
+2. Create a `Dataset` object with your train and test splits.
+
+ ```python
+ data = Dataset.from_sklearn(
+ load_breast_cancer(),
+ train_size=10,
+ stratify_by_target=True,
+ random_state=16,
+ )
+ ```
+
+3. Create an instance of a `SupervisedModel` (basically any sklearn compatible
+ predictor).
+
+ ```python
+ model = LogisticRegression()
+ ```
+
+4. Create a `Utility` object to wrap the Dataset, the model and a scoring
+ function.
+
+ ```python
+ u = Utility(
+ model,
+ data,
+ Scorer("accuracy", default=0.0)
+ )
+ ```
+
+5. Use one of the methods defined in the library to compute the values.
+ In our example, we will use *Permutation Montecarlo Shapley*,
+ an approximate method for computing Data Shapley values.
+
+ ```python
+ values = compute_shapley_values(
+ u,
+ mode=ShapleyMode.PermutationMontecarlo,
+ done=MaxUpdates(100),
+ seed=16,
+ progress=True
+ )
+ ```
+ The result is a variable of type `ValuationResult` that contains
+ the indices and their values as well as other attributes.
+
+ The higher the value for an index, the more important it is for the chosen
+ model, dataset and scorer.
+
+6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values.
+
+ ```python
+ df = values.to_dataframe(column="data_value")
+ ```
+
+# Contributing
+
+Please open new issues for bugs, feature requests and extensions. You can read
+about the structure of the project, the toolchain and workflow in the [guide for
+contributions](CONTRIBUTING.md).
+
+# Papers
+
+We currently implement the following papers:
+
+## Data Valuation
- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
@@ -80,8 +320,7 @@ methods from the following papers:
Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS).
New Orleans, Louisiana, USA, 2022.
-Influence Functions compute the effect that single points have on an estimator /
-model. We implement methods from the following papers:
+## Influence Functions
- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
@@ -94,141 +333,7 @@ model. We implement methods from the following papers:
[Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052).
In Proceedings of the AAAI-22. arXiv, 2021.
-# Installation
-
-To install the latest release use:
-
-```shell
-$ pip install pyDVL
-```
-
-You can also install the latest development version from
-[TestPyPI](https://test.pypi.org/project/pyDVL/):
-
-```shell
-pip install pyDVL --index-url https://test.pypi.org/simple/
-```
-
-For more instructions and information refer to [Installing pyDVL
-](https://pydvl.org/stable/getting-started/installation/) in the
-documentation.
-
-# Usage
-
-### Influence Functions
-
-For influence computation, follow these steps:
-
-1. Instantiate an `InfluenceFunctionModel`
-2. Fit the influence model to the training data.
-3. For small input data call influence method on the fitted instance. For larger data, wrap the model into a
- calculator and call methods on the calculator.
-
-```python
-import torch
-from torch import nn
-from torch.utils.data import DataLoader, TensorDataset
-
-from pydvl.influence.torch import DirectInfluence
-from pydvl.influence.torch.util import TorchCatAggregator, TorchNumpyConverter
-from pydvl.influence import SequentialInfluenceCalculator
-
-nn_architecture = nn.Sequential(
- nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3),
- nn.Flatten(),
- nn.Linear(27, 3),
-)
-loss = nn.MSELoss()
-
-
-input_dim = (5, 5, 5)
-output_dim = 3
-train_x = torch.rand((10, *input_dim))
-train_y = torch.rand((10, output_dim))
-test_x = torch.rand((5, *input_dim))
-test_y = torch.rand((5, output_dim))
-
-train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2)
-test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1)
-
-influence_function_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01)
-influence_function_model = influence_function_model.fit(train_data_loader)
-
-# For small data, directly call methods on the influence function model
-in_memory_influences = influence_function_model.influences(test_x, test_y, train_x, train_y)
-
-# For larger data, wrap the influence function model into a calculator
-sequential_influence_calculator = SequentialInfluenceCalculator(influence_function_model)
-
-# Lazy object providing arrays batch-wise in a sequential manner
-lazy_influences = sequential_influence_calculator.influences(test_data_loader, train_data_loader)
-
-# Trigger computation and pull results to memory
-influences = lazy_influences.compute(tensor_aggregator=TorchCatAggregator())
-
-# Trigger computation and write results batch-wise to disk
-lazy_influences.to_zarr("influences_result", TorchNumpyConverter())
-```
-
-
-### Shapley Values
-The steps required to compute values for your samples are:
-
-1. Create a `Dataset` object with your train and test splits.
-2. Create an instance of a `SupervisedModel` (basically any sklearn compatible
- predictor)
-3. Create a `Utility` object to wrap the Dataset, the model and a scoring
- function.
-4. Use one of the methods defined in the library to compute the values.
-
-This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for
-Data Shapley values:
-
-```python
-from sklearn.datasets import load_breast_cancer
-from sklearn.linear_model import LogisticRegression
-from pydvl.value import *
-
-data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7)
-model = LogisticRegression()
-u = Utility(model, data, Scorer("accuracy", default=0.0))
-values = compute_shapley_values(
- u,
- mode=ShapleyMode.TruncatedMontecarlo,
- done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01),
- truncation=RelativeTruncation(u, rtol=0.01),
-)
-```
-
-For more instructions and information refer to [Getting
-Started](https://pydvl.org/stable/getting-started/first-steps/) in
-the documentation. We provide several examples for data valuation
-(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/))
-and for influence functions
-(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/))
-with details on the algorithms and their applications.
-
-## Caching
-
-pyDVL offers the possibility to cache certain results and
-speed up computation. It uses [Memcached](https://memcached.org/) For that.
-
-You can run it either locally or, using
-[Docker](https://www.docker.com/):
-
-```shell
-docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest
-```
-
-You can read more in the
-[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching).
-
-# Contributing
-
-Please open new issues for bugs, feature requests and extensions. You can read
-about the structure of the project, the toolchain and workflow in the [guide for
-contributions](CONTRIBUTING.md).
-
+
# License
pyDVL is distributed under
diff --git a/docs/assets/influence_functions_example.png b/docs/assets/influence_functions_example.png
new file mode 100644
index 000000000..94f804e9e
Binary files /dev/null and b/docs/assets/influence_functions_example.png differ
diff --git a/src/pydvl/reporting/plots.py b/src/pydvl/reporting/plots.py
index 4e8e5afa5..7c0f19b73 100644
--- a/src/pydvl/reporting/plots.py
+++ b/src/pydvl/reporting/plots.py
@@ -270,6 +270,26 @@ def plot_shapley(
return ax
+def plot_influence_distribution(
+ influences: NDArray[np.float_], index: int, title_extra: str = ""
+) -> plt.Axes:
+ """Plots the histogram of the influence that all samples in the training set
+ have over a single sample index.
+
+ Args:
+ influences: array of influences (training samples x test samples)
+ index: Index of the test sample for which the influences
+ will be plotted.
+ title_extra: Additional text that will be appended to the title.
+ """
+ _, ax = plt.subplots()
+ ax.hist(influences[:, index], alpha=0.7)
+ ax.set_xlabel("Influence values")
+ ax.set_ylabel("Number of samples")
+ ax.set_title(f"Distribution of influences {title_extra}")
+ return ax
+
+
def plot_influence_distribution_by_label(
influences: NDArray[np.float_], labels: NDArray[np.float_], title_extra: str = ""
):
@@ -279,7 +299,7 @@ def plot_influence_distribution_by_label(
Args:
influences: array of influences (training samples x test samples)
labels: labels for the training set.
- title_extra:
+ title_extra: Additional text that will be appended to the title.
"""
_, ax = plt.subplots()
unique_labels = np.unique(labels)
@@ -287,6 +307,6 @@ def plot_influence_distribution_by_label(
ax.hist(influences[labels == label], label=label, alpha=0.7)
ax.set_xlabel("Influence values")
ax.set_ylabel("Number of samples")
- ax.set_title(f"Distribution of influences " + title_extra)
+ ax.set_title(f"Distribution of influences {title_extra}")
ax.legend()
plt.show()
diff --git a/src/pydvl/value/shapley/common.py b/src/pydvl/value/shapley/common.py
index c4d5db13a..eda884e6e 100644
--- a/src/pydvl/value/shapley/common.py
+++ b/src/pydvl/value/shapley/common.py
@@ -110,7 +110,13 @@ def compute_shapley_values(
):
truncation = kwargs.pop("truncation", NoTruncation())
return permutation_montecarlo_shapley( # type: ignore
- u=u, done=done, truncation=truncation, n_jobs=n_jobs, seed=seed, **kwargs
+ u=u,
+ done=done,
+ truncation=truncation,
+ n_jobs=n_jobs,
+ seed=seed,
+ progress=progress,
+ **kwargs,
)
elif mode == ShapleyMode.CombinatorialMontecarlo:
return combinatorial_montecarlo_shapley(