Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/499 datainf #582

Merged
merged 44 commits into from
Jun 6, 2024
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
93b6ecd
Add new module influence.types containing generic types to build up i…
schroedk May 21, 2024
db0fa07
Refactor influence.base_influence_function_model:
schroedk May 21, 2024
23fed56
Merge remote-tracking branch 'origin/develop' into feature/499-datainf
schroedk May 21, 2024
c4b7d3a
Fix linting issues in influence.types
schroedk May 21, 2024
c189848
Add class ModelInfoMixin to influence.torch.util
schroedk May 21, 2024
bda3504
Add class TorchBatch to influence.torch.util
schroedk May 21, 2024
ddec527
Add classes TorchChunkAverageAggregator and TorchPointAverageAggregat…
schroedk May 21, 2024
7472840
Add functions rank_one_mvp and inverse_rank_one_update to influence.t…
schroedk May 21, 2024
e32ae46
Add LossType (annotation), enum BlockMode and class ModelParameterDic…
schroedk May 21, 2024
d804e6b
Add dtype property to LowRankProductRepresentation
schroedk May 21, 2024
993bcfa
Add new subpackage influence.torch.operator with submodule operator.g…
schroedk May 21, 2024
3292253
Add operator subpackages batch_operation, add tests for bnatch_operat…
schroedk May 21, 2024
ec57282
Add submodules operator.base, operator.bilinear_form and operator.solve
schroedk May 21, 2024
5e37a4b
Add tests for influence.torch.util
schroedk May 21, 2024
29eb69a
Add implementation of generic types to influence.torch.influence_func…
schroedk May 21, 2024
07f80bd
Fix linting issues in operator subpackage
schroedk May 21, 2024
3b7289c
Fix type-checking issues
schroedk May 21, 2024
531aacc
Add and improve docstrings
schroedk May 22, 2024
1e5f18b
Simplify package structure:
schroedk May 22, 2024
964f566
Fix linting issues
schroedk May 22, 2024
3fa9a95
Move test modules
schroedk May 23, 2024
a1d4c9a
Add module description to pydvl.influence.types
schroedk May 23, 2024
545308a
Update reference bibtex file
schroedk May 24, 2024
6337a5f
Refactor influence package:
schroedk May 24, 2024
f007507
Add inverse rank-1 update approximation for dictionary input
schroedk May 27, 2024
ed5f14d
Refactor new structure:
schroedk May 29, 2024
f4f5931
Move functions from util to static methods, increase tolerance for test
schroedk May 29, 2024
01791d5
Simplify operator interface
schroedk May 29, 2024
79f7569
Simplify tensor summation
schroedk May 29, 2024
5bab8e4
Simplify Hessian operations
schroedk May 29, 2024
ae13306
Fix typo, import BlockMode in influence.torch
schroedk May 29, 2024
6cf70e5
Add documentation for InverseHarmonicMeanInfluence
schroedk May 29, 2024
7564662
Update CHANGELOG.md
schroedk May 29, 2024
983b379
Add test cases, increase tolerance
schroedk May 30, 2024
ea89feb
Fix issues
schroedk Jun 1, 2024
d21da08
Add property to base class
schroedk Jun 1, 2024
9c5ddea
Simplify gradient computation for torch, remove ABC for now, do not e…
schroedk Jun 3, 2024
8498524
Simplify gradient computation
schroedk Jun 3, 2024
5b7bbb9
Fix issue in building factors from factors dict
schroedk Jun 5, 2024
a8caf13
Fix issue in types module, due to OrderedDict from typing and collect…
schroedk Jun 5, 2024
a69ae3d
Refactor ModelParameterDictBuilder to be more readable
schroedk Jun 5, 2024
23442e4
Refactor tests in test_batch_operation to be more readable
schroedk Jun 5, 2024
f25f289
Remove abstract methods, which compute gradient dictionary outputs fr…
schroedk Jun 5, 2024
5b34406
Fix fixture usage in test
schroedk Jun 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,18 @@
# Changelog

## Unreleased

### Added

- New method `InverseHarmonicMeanInfluence`, implementation for the paper
`DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and
Diffusion Models`
[PR #582](https://github.com/aai-institute/pyDVL/pull/582)
- Add new backend implementations for influence computation
to account for block-diagonal approximations
[PR #582](https://github.com/aai-institute/pyDVL/pull/582)


## 0.9.2 - 🏗 Bug fixes, logging improvement

### Added
Expand Down
20 changes: 17 additions & 3 deletions docs/assets/pydvl.bib
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,8 @@ @inproceedings{george_fast_2018
publisher = {Curran Associates, Inc.},
url = {https://proceedings.neurips.cc/paper/2018/hash/48000647b315f6f00f913caa757a70b3-Abstract.html},
urldate = {2024-01-12},
abstract = {Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covari- ance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approxima- tions and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens \& Grosse, 2015; Grosse \& Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.}
abstract = {Optimization algorithms that leverage gradient covariance information, such as variants of natural gradient descent (Amari, 1998), offer the prospect of yielding more effective descent directions. For models with many parameters, the covari- ance matrix they are based on becomes gigantic, making them inapplicable in their original form. This has motivated research into both simple diagonal approxima- tions and more sophisticated factored approximations such as KFAC (Heskes, 2000; Martens \& Grosse, 2015; Grosse \& Martens, 2016). In the present work we draw inspiration from both to propose a novel approximation that is provably better than KFAC and amendable to cheap partial updates. It consists in tracking a diagonal variance, not in parameter coordinates, but in a Kronecker-factored eigenbasis, in which the diagonal approximation is likely to be more effective. Experiments show improvements over KFAC in optimization speed for several deep network architectures.},
keywords = {notion}
}

@inproceedings{ghorbani_data_2019,
Expand Down Expand Up @@ -175,7 +176,8 @@ @inproceedings{hataya_nystrom_2023
urldate = {2024-02-26},
abstract = {The essential difficulty of gradient-based bilevel optimization using implicit differentiation is to estimate the inverse Hessian vector product with respect to neural network parameters. This paper proposes to tackle this problem by the Nyström method and the Woodbury matrix identity, exploiting the low-rankness of the Hessian. Compared to existing methods using iterative approximation, such as conjugate gradient and the Neumann series approximation, the proposed method avoids numerical instability and can be efficiently computed in matrix operations without iterations. As a result, the proposed method works stably in various tasks and is faster than iterative approximations. Throughout experiments including large-scale hyperparameter optimization and meta learning, we demonstrate that the Nyström method consistently achieves comparable or even superior performance to other approaches. The source code is available from https://github.com/moskomule/hypergrad.},
eventtitle = {International {{Conference}} on {{Artificial Intelligence}} and {{Statistics}}},
langid = {english}
langid = {english},
keywords = {notion}
}

@article{ji_breakdownfree_2017,
Expand Down Expand Up @@ -292,6 +294,18 @@ @inproceedings{kwon_beta_2022
keywords = {notion}
}

@inproceedings{kwon_datainf_2023,
title = {{{DataInf}}: {{Efficiently Estimating Data Influence}} in {{LoRA-tuned LLMs}} and {{Diffusion Models}}},
shorttitle = {{{DataInf}}},
author = {Kwon, Yongchan and Wu, Eric and Wu, Kevin and Zou, James},
date = {2023-10-13},
doi = {10.48550/arXiv.2310.00902},
url = {https://openreview.net/forum?id=9m02ib92Wz},
urldate = {2023-10-27},
abstract = {Quantifying the impact of training data points is crucial for understanding the outputs of machine learning models and for improving the transparency of the AI pipeline. The influence function is a principled and popular data attribution method, but its computational cost often makes it challenging to use. This issue becomes more pronounced in the setting of large language models and text-to-image models. In this work, we propose DataInf, an efficient influence approximation method that is practical for large-scale generative AI models. Leveraging an easy-to-compute closed-form expression, DataInf outperforms existing influence computation algorithms in terms of computational and memory efficiency. Our theoretical analysis shows that DataInf is particularly well-suited for parameter-efficient fine-tuning techniques such as LoRA. Through systematic empirical evaluations, we show that DataInf accurately approximates influence scores and is orders of magnitude faster than existing methods. In applications to RoBERTa-large, Llama-2-13B-chat, and stable-diffusion-v1.5 models, DataInf effectively identifies the most influential fine-tuning examples better than other approximate influence scores. Moreover, it can help to identify which data points are mislabeled.},
eventtitle = {The {{Twelfth International Conference}} on {{Learning Representations}}}
}

@inproceedings{kwon_dataoob_2023,
title = {Data-{{OOB}}: {{Out-of-bag Estimate}} as a {{Simple}} and {{Efficient Data Value}}},
shorttitle = {Data-{{OOB}}},
Expand All @@ -303,7 +317,7 @@ @inproceedings{kwon_dataoob_2023
issn = {2640-3498},
url = {https://proceedings.mlr.press/v202/kwon23e.html},
urldate = {2023-09-06},
abstract = {Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are \$10\^{}6\$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.},
abstract = {Data valuation is a powerful framework for providing statistical insights into which data are beneficial or detrimental to model training. Many Shapley-based data valuation methods have shown promising results in various downstream tasks, however, they are well known to be computationally challenging as it requires training a large number of models. As a result, it has been recognized as infeasible to apply to large datasets. To address this issue, we propose Data-OOB, a new data valuation method for a bagging model that utilizes the out-of-bag estimate. The proposed method is computationally efficient and can scale to millions of data by reusing trained weak learners. Specifically, Data-OOB takes less than 2.25 hours on a single CPU processor when there are \$10\textasciicircum 6\$ samples to evaluate and the input dimension is 100. Furthermore, Data-OOB has solid theoretical interpretations in that it identifies the same important data point as the infinitesimal jackknife influence function when two different points are compared. We conduct comprehensive experiments using 12 classification datasets, each with thousands of sample sizes. We demonstrate that the proposed method significantly outperforms existing state-of-the-art data valuation methods in identifying mislabeled data and finding a set of helpful (or harmful) data points, highlighting the potential for applying data values in real-world applications.},
eventtitle = {International {{Conference}} on {{Machine Learning}}},
langid = {english},
keywords = {notion}
Expand Down
125 changes: 125 additions & 0 deletions docs/influence/influence_function_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,7 +207,132 @@ if_model = NystroemSketchInfluence(
if_model.fit(train_loader)
```

### Inverse Harmonic Mean

This implementation replaces the inverse Hessian matrix in the influence computation
with an approximation of the inverse Gauss-Newton vector product and was
proposed in [@kwon_datainf_2023].

The approximation method comprises
the following steps:

1. Replace the Hessian $H(\theta)$ with the Gauss-Newton matrix
$G(\theta)$:

\begin{equation*}
G(\theta)=n^{-1} \sum_{i=1}^n \nabla_{\theta}\ell_i\nabla_{\theta}\ell_i^T
\end{equation*}

which results in

\begin{equation*}
\mathcal{I}(z_{t}, z) \approx \nabla_{\theta} \ell(z_{t}, \theta)^T
(G(\theta) + \lambda I_d)^{-1}
\nabla_{\theta} \ell(z, \theta)
\end{equation*}

2. Simplify the problem by breaking it down into a block diagonal structure,
where each block $G_l(\theta)$ corresponds to the l-th block:

\begin{equation*}
G_{l}(\theta) = n^{-1} \sum_{i=1}^n \nabla_{\theta_l} \ell_i
\nabla_{\theta_l} \ell_i^{T} + \lambda_l I_{d_l},
\end{equation*}

which leads to

\begin{equation*}
\mathcal{I}(z_{t}, z) \approx \nabla_{\theta} \ell(z_{t}, \theta)^T
\operatorname{diag}(G_1(\theta)^{-1},
\dots, G_L(\theta)^{-1})
\nabla_{\theta} \ell(z, \theta)
\end{equation*}

3. Substitute the arithmetic mean of the rank-$1$ updates in
$G_l(\theta)$, with the inverse harmonic mean $R_l(\theta)$ of the rank-1
updates:

\begin{align*}
G_l(\theta)^{-1} &= \left( n^{-1} \sum_{i=1}^n \nabla_{\theta_l}
\ell(z_i, \theta) \nabla_{\theta_l}
\ell(z_i, \theta)^{T} +
\lambda_l I_{d_l}\right)^{-1} \\\
R_{l}(\theta)&= n^{-1} \sum_{i=1}^n \left( \nabla_{\theta_l}
\ell(z_i, \theta) \nabla_{\theta_l} \ell(z_i, \theta)^{T}
+ \lambda_l I_{d_l} \right)^{-1}
\end{align*}

4. Use the
<a href="https://en.wikipedia.org/wiki/Sherman%E2%80%93Morrison_formula">
Sherman–Morrison formula
</a>
to get an explicit representation of the inverses in the definition of
$R_l(\theta):$

\begin{align*}
R_l(\theta) &= n^{-1} \sum_{i=1}^n \left( \nabla_{\theta_l} \ell_i
\nabla_{\theta_l} \ell_i^{T}
+ \lambda_l I_{d_l}\right)^{-1} \\\
&= n^{-1} \sum_{i=1}^n \lambda_l^{-1} \left(I_{d_l}
- \frac{\nabla_{\theta_l} \ell_i \nabla_{\theta_l}
\ell_i^{T}}{\lambda_l
+ \\|\nabla_{\theta_l} \ell_i\\|_2^2}\right)
,
\end{align*}

which means application of $R_l(\theta)$ boils down to computing $n$
rank-$1$ updates.

```python
from pydvl.influence.torch import InverseHarmonicMeanInfluence, BlockMode

if_model = InverseHarmonicMeanInfluence(
model,
loss,
regularization=1e-1,
block_structure=BlockMode.LAYER_WISE
)
if_model.fit(train_loader)
```

!!! Info
This implementation is capable of using a block-matrix approximation. The
blocking structure can be specified via the `block_structure` parameter.
The `block_structure` parameter can either be a
[BlockMode][pydvl.influence.torch.util.BlockMode] enum (which provides
layer-wise or parameter-wise blocking) or a custom block structure defined
by an ordered dictionary with the keys being the block identifiers (arbitrary
strings) and the values being lists of parameter names contained in the block.
```python
block_structure = OrderedDict(
(
("custom_block1", ["0.weight", "1.bias"]),
("custom_block2", ["1.weight", "0.bias"]),
)
)
```
If you would like to apply a block-specific regularization, you can provide a
dictionary with the block names as keys and the regularization values as values.
In this case, the specification must be complete, i.e. every block must have
a positive regularization value.
```python
regularization = {
"custom_block1": 0.1,
"custom_block2": 0.2,
}
```
Accordingly, if you choose a layer-wise or parameter-wise structure
(by providing `BlockMode.LAYER_WISE` or `BlockMode.PARAMETER_WISE` for
`block_structure`) the keys must be the layer names or parameter names,
respectively.
You can retrieve the block-wise influence information from the methods
with suffix `_by_block`. By default, `block_structure` is set to
`BlockMode.FULL` and in this case these methods will return a dictionary
with the empty string being the only key.

These implementations represent the calculation logic on in memory tensors.
To scale up to large collection of data, we map these influence function models
over these collections. For a detailed discussion see the
documentation page [Scaling Computation](scaling_computation.md).


3 changes: 1 addition & 2 deletions docs/influence/scaling_computation.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,7 @@ into memory.
```python
from pydvl.influence import SequentialInfluenceCalculator
from pydvl.influence.torch.util import (
NestedTorchCatAggregator,
TorchNumpyConverter,
TorchNumpyConverter, NestedTorchCatAggregator,
)
from pydvl.influence.torch import CgInfluence

Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/influence/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@
probably change.

"""
from .base_influence_function_model import InfluenceMode
from .influence_calculator import (
DaskInfluenceCalculator,
DisableClientSingleThreadCheck,
SequentialInfluenceCalculator,
)
from .types import InfluenceMode
2 changes: 1 addition & 1 deletion src/pydvl/influence/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from zarr.storage import StoreLike

from ..utils import log_duration
from .base_influence_function_model import TensorType
from .types import TensorType


class NumpyConverter(Generic[TensorType], ABC):
Expand Down
Loading