diff --git a/CHANGELOG.md b/CHANGELOG.md index 46316b2d2..136136bba 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,6 +14,8 @@ ### Changed +- Improve readme and explain better the examples + [PR #465](https://github.com/aai-institute/pyDVL/pull/465) - Simplify and improve tests, add CodeCov code coverage [PR #429](https://github.com/aai-institute/pyDVL/pull/429) - diff --git a/README.md b/README.md index 2a8871307..9497bbd76 100644 --- a/README.md +++ b/README.md @@ -30,11 +30,251 @@

-pyDVL collects algorithms for Data Valuation and Influence Function computation. +**pyDVL** collects algorithms for **Data Valuation** and **Influence Function** computation. + +**Data Valuation** for machine learning is the task of assigning a scalar +to each element of a training set which reflects its contribution to the final +performance or outcome of some model trained on it. Some concepts of +value depend on a specific model of interest, while others are model-agnostic. +pyDVL focuses on model-dependent methods. + +
+ best sample removal +

+ Comparison of different data valuation methods + on best sample removal. +

+
+ +The **Influence Function** is an infinitesimal measure of the effect that single +training points have over the parameters of a model, or any function thereof. +In particular, in machine learning they are also used to compute the effect +of training samples over individual test points. + +
+ best sample removal +

+ Influences of input points with corrupted data. + Highlighted points have flipped labels. +

+
-Data Valuation is the task of estimating the intrinsic value of a data point -wrt. the training set, the model and a scoring function. We currently implement -methods from the following papers: +# Installation + +To install the latest release use: + +```shell +$ pip install pyDVL +``` + +You can also install the latest development version from +[TestPyPI](https://test.pypi.org/project/pyDVL/): + +```shell +pip install pyDVL --index-url https://test.pypi.org/simple/ +``` + +pyDVL has also extra dependencies for certain functionalities (e.g. influence functions). + +For more instructions and information refer to [Installing pyDVL +](https://pydvl.org/stable/getting-started/installation/) in the +documentation. + +# Usage + +In the following subsections, we will showcase the usage of pyDVL +for Data Valuation and Influence Functions using simple examples. + +For more instructions and information refer to [Getting +Started](https://pydvl.org/stable/getting-started/first-steps/) in +the documentation. +We provide several examples for data valuation +(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/)) +and for influence functions +(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/)) +with details on the algorithms and their applications. + +## Influence Functions + +For influence computation, follow these steps: + +1. Import the necessary packages (The exact packages depend on your specific use case). + + ```python + import torch + from torch import nn + from torch.utils.data import DataLoader, TensorDataset + + from pydvl.influence.torch import DirectInfluence + from pydvl.influence.torch.util import TorchCatAggregator, TorchNumpyConverter + from pydvl.influence import SequentialInfluenceCalculator + ``` + +2. Create PyTorch data loaders for your train and test splits. + + ```python + input_dim = (5, 5, 5) + output_dim = 3 + train_x = torch.rand((10, *input_dim)) + train_y = torch.rand((10, output_dim)) + test_x = torch.rand((5, *input_dim)) + test_y = torch.rand((5, output_dim)) + + train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2) + test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1) + ``` + +3. Instantiate your neural network model. + + ```python + nn_architecture = nn.Sequential( + nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3), + nn.Flatten(), + nn.Linear(27, 3), + ) + ``` + +4. Define your loss: + + ```python + loss = nn.MSELoss() + ``` + +5. Instantiate an `InfluenceFunctionModel` and fit it to the training data + + ```python + influence_function_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01) + influence_function_model = influence_function_model.fit(train_data_loader) + ``` + +6. For small input data call influence method on the fitted instance. + + ```python + influences = influence_function_model.influences(test_x, test_y, train_x, train_y) + ``` + The result is a tensor of shape `(training samples x test samples)` + that contains at index `(i, j`) the influence of training sample `i` on + test sample `j`. + +7. For larger data, wrap the model into a + calculator and call methods on the calculator. + ```python + sequential_influence_calculator = SequentialInfluenceCalculator(influence_function_model) + + # Lazy object providing arrays batch-wise in a sequential manner + lazy_influences = sequential_influence_calculator.influences(test_data_loader, train_data_loader) + + # Trigger computation and pull results to memory + influences = lazy_influences.compute(tensor_aggregator=TorchCatAggregator()) + + # Trigger computation and write results batch-wise to disk + lazy_influences.to_zarr("influences_result", TorchNumpyConverter()) + ``` + + + The higher the absolute value of the influence of a training sample + on a test sample, the more influential it is for the chosen test sample, model + and data loaders. The sign of the influence determines whether it is + useful (positive) or harmful (negative). + +> **Note** pyDVL currently only support PyTorch for Influence Functions. +> We are planning to add support for Jax and perhaps TensorFlow or even Keras. + +## Data Valuation + +The steps required to compute data values for your samples are: + +1. Import the necessary packages (The exact packages depend on your specific use case). + + ```python + import matplotlib.pyplot as plt + from sklearn.datasets import load_breast_cancer + from sklearn.linear_model import LogisticRegression + from pydvl.utils import Dataset, Scorer, Utility + from pydvl.value import ( + compute_shapley_values, + ShapleyMode, + MaxUpdates, + ) + ``` + +2. Create a `Dataset` object with your train and test splits. + + ```python + data = Dataset.from_sklearn( + load_breast_cancer(), + train_size=10, + stratify_by_target=True, + random_state=16, + ) + ``` + +3. Create an instance of a `SupervisedModel` (basically any sklearn compatible + predictor). + + ```python + model = LogisticRegression() + ``` + +4. Create a `Utility` object to wrap the Dataset, the model and a scoring + function. + + ```python + u = Utility( + model, + data, + Scorer("accuracy", default=0.0) + ) + ``` + +5. Use one of the methods defined in the library to compute the values. + In our example, we will use *Permutation Montecarlo Shapley*, + an approximate method for computing Data Shapley values. + + ```python + values = compute_shapley_values( + u, + mode=ShapleyMode.PermutationMontecarlo, + done=MaxUpdates(100), + seed=16, + progress=True + ) + ``` + The result is a variable of type `ValuationResult` that contains + the indices and their values as well as other attributes. + + The higher the value for an index, the more important it is for the chosen + model, dataset and scorer. + +6. (Optional) Convert the valuation result to a dataframe and analyze and visualize the values. + + ```python + df = values.to_dataframe(column="data_value") + ``` + +# Contributing + +Please open new issues for bugs, feature requests and extensions. You can read +about the structure of the project, the toolchain and workflow in the [guide for +contributions](CONTRIBUTING.md). + +# Papers + +We currently implement the following papers: + +## Data Valuation - Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004). @@ -80,8 +320,7 @@ methods from the following papers: Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS). New Orleans, Louisiana, USA, 2022. -Influence Functions compute the effect that single points have on an estimator / -model. We implement methods from the following papers: +## Influence Functions - Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In @@ -94,141 +333,7 @@ model. We implement methods from the following papers: [Scaling Up Influence Functions](http://arxiv.org/abs/2112.03052). In Proceedings of the AAAI-22. arXiv, 2021. -# Installation - -To install the latest release use: - -```shell -$ pip install pyDVL -``` - -You can also install the latest development version from -[TestPyPI](https://test.pypi.org/project/pyDVL/): - -```shell -pip install pyDVL --index-url https://test.pypi.org/simple/ -``` - -For more instructions and information refer to [Installing pyDVL -](https://pydvl.org/stable/getting-started/installation/) in the -documentation. - -# Usage - -### Influence Functions - -For influence computation, follow these steps: - -1. Instantiate an `InfluenceFunctionModel` -2. Fit the influence model to the training data. -3. For small input data call influence method on the fitted instance. For larger data, wrap the model into a - calculator and call methods on the calculator. - -```python -import torch -from torch import nn -from torch.utils.data import DataLoader, TensorDataset - -from pydvl.influence.torch import DirectInfluence -from pydvl.influence.torch.util import TorchCatAggregator, TorchNumpyConverter -from pydvl.influence import SequentialInfluenceCalculator - -nn_architecture = nn.Sequential( - nn.Conv2d(in_channels=5, out_channels=3, kernel_size=3), - nn.Flatten(), - nn.Linear(27, 3), -) -loss = nn.MSELoss() - - -input_dim = (5, 5, 5) -output_dim = 3 -train_x = torch.rand((10, *input_dim)) -train_y = torch.rand((10, output_dim)) -test_x = torch.rand((5, *input_dim)) -test_y = torch.rand((5, output_dim)) - -train_data_loader = DataLoader(TensorDataset(train_x, train_y), batch_size=2) -test_data_loader = DataLoader(TensorDataset(test_x, test_y), batch_size=1) - -influence_function_model = DirectInfluence(nn_architecture, loss, hessian_regularization=0.01) -influence_function_model = influence_function_model.fit(train_data_loader) - -# For small data, directly call methods on the influence function model -in_memory_influences = influence_function_model.influences(test_x, test_y, train_x, train_y) - -# For larger data, wrap the influence function model into a calculator -sequential_influence_calculator = SequentialInfluenceCalculator(influence_function_model) - -# Lazy object providing arrays batch-wise in a sequential manner -lazy_influences = sequential_influence_calculator.influences(test_data_loader, train_data_loader) - -# Trigger computation and pull results to memory -influences = lazy_influences.compute(tensor_aggregator=TorchCatAggregator()) - -# Trigger computation and write results batch-wise to disk -lazy_influences.to_zarr("influences_result", TorchNumpyConverter()) -``` - - -### Shapley Values -The steps required to compute values for your samples are: - -1. Create a `Dataset` object with your train and test splits. -2. Create an instance of a `SupervisedModel` (basically any sklearn compatible - predictor) -3. Create a `Utility` object to wrap the Dataset, the model and a scoring - function. -4. Use one of the methods defined in the library to compute the values. - -This is how it looks for *Truncated Montecarlo Shapley*, an efficient method for -Data Shapley values: - -```python -from sklearn.datasets import load_breast_cancer -from sklearn.linear_model import LogisticRegression -from pydvl.value import * - -data = Dataset.from_sklearn(load_breast_cancer(), train_size=0.7) -model = LogisticRegression() -u = Utility(model, data, Scorer("accuracy", default=0.0)) -values = compute_shapley_values( - u, - mode=ShapleyMode.TruncatedMontecarlo, - done=MaxUpdates(100) | AbsoluteStandardError(threshold=0.01), - truncation=RelativeTruncation(u, rtol=0.01), -) -``` - -For more instructions and information refer to [Getting -Started](https://pydvl.org/stable/getting-started/first-steps/) in -the documentation. We provide several examples for data valuation -(e.g. [Shapley Data Valuation](https://pydvl.org/stable/examples/shapley_basic_spotify/)) -and for influence functions -(e.g. [Influence Functions for Neural Networks](https://pydvl.org/stable/examples/influence_imagenet/)) -with details on the algorithms and their applications. - -## Caching - -pyDVL offers the possibility to cache certain results and -speed up computation. It uses [Memcached](https://memcached.org/) For that. - -You can run it either locally or, using -[Docker](https://www.docker.com/): - -```shell -docker container run --rm -p 11211:11211 --name pydvl-cache -d memcached:latest -``` - -You can read more in the -[documentation](https://pydvl.org/stable/getting-started/first-steps/#caching). - -# Contributing - -Please open new issues for bugs, feature requests and extensions. You can read -about the structure of the project, the toolchain and workflow in the [guide for -contributions](CONTRIBUTING.md). - + # License pyDVL is distributed under diff --git a/docs/assets/influence_functions_example.png b/docs/assets/influence_functions_example.png new file mode 100644 index 000000000..94f804e9e Binary files /dev/null and b/docs/assets/influence_functions_example.png differ diff --git a/src/pydvl/reporting/plots.py b/src/pydvl/reporting/plots.py index 4e8e5afa5..7c0f19b73 100644 --- a/src/pydvl/reporting/plots.py +++ b/src/pydvl/reporting/plots.py @@ -270,6 +270,26 @@ def plot_shapley( return ax +def plot_influence_distribution( + influences: NDArray[np.float_], index: int, title_extra: str = "" +) -> plt.Axes: + """Plots the histogram of the influence that all samples in the training set + have over a single sample index. + + Args: + influences: array of influences (training samples x test samples) + index: Index of the test sample for which the influences + will be plotted. + title_extra: Additional text that will be appended to the title. + """ + _, ax = plt.subplots() + ax.hist(influences[:, index], alpha=0.7) + ax.set_xlabel("Influence values") + ax.set_ylabel("Number of samples") + ax.set_title(f"Distribution of influences {title_extra}") + return ax + + def plot_influence_distribution_by_label( influences: NDArray[np.float_], labels: NDArray[np.float_], title_extra: str = "" ): @@ -279,7 +299,7 @@ def plot_influence_distribution_by_label( Args: influences: array of influences (training samples x test samples) labels: labels for the training set. - title_extra: + title_extra: Additional text that will be appended to the title. """ _, ax = plt.subplots() unique_labels = np.unique(labels) @@ -287,6 +307,6 @@ def plot_influence_distribution_by_label( ax.hist(influences[labels == label], label=label, alpha=0.7) ax.set_xlabel("Influence values") ax.set_ylabel("Number of samples") - ax.set_title(f"Distribution of influences " + title_extra) + ax.set_title(f"Distribution of influences {title_extra}") ax.legend() plt.show() diff --git a/src/pydvl/value/shapley/common.py b/src/pydvl/value/shapley/common.py index c4d5db13a..eda884e6e 100644 --- a/src/pydvl/value/shapley/common.py +++ b/src/pydvl/value/shapley/common.py @@ -110,7 +110,13 @@ def compute_shapley_values( ): truncation = kwargs.pop("truncation", NoTruncation()) return permutation_montecarlo_shapley( # type: ignore - u=u, done=done, truncation=truncation, n_jobs=n_jobs, seed=seed, **kwargs + u=u, + done=done, + truncation=truncation, + n_jobs=n_jobs, + seed=seed, + progress=progress, + **kwargs, ) elif mode == ShapleyMode.CombinatorialMontecarlo: return combinatorial_montecarlo_shapley(