Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/586 block diagonal direct #591

Merged
merged 10 commits into from
Jun 10, 2024
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,18 @@
- Add new backend implementations for influence computation
to account for block-diagonal approximations
[PR #582](https://github.com/aai-institute/pyDVL/pull/582)
- Extend `DirectInfluence` with block-diagonal and Gauss-Newton
approximation
[PR #591](https://github.com/aai-institute/pyDVL/pull/591)

## Changed

- **Breaking Changes**
- Rename parameter `hessian_regularization` of `DirectInfluence`
to `regularization` and change the type annotation to allow
for block-wise regularization parameters
[PR #591](https://github.com/aai-institute/pyDVL/pull/591)


## 0.9.2 - 🏗 Bug fixes, logging improvement

Expand Down
83 changes: 82 additions & 1 deletion docs/influence/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -238,7 +238,7 @@ from torch.utils.data import DataLoader
from pydvl.influence.torch import DirectInfluence

training_data_loader = DataLoader(...)
infl_model = DirectInfluence(model, loss, hessian_regularization=0.01)
infl_model = DirectInfluence(model, loss, regularization=0.01)
infl_model = infl_model.fit(training_data_loader)
```

Expand All @@ -249,6 +249,87 @@ not to corrupt the outcome too much, the parameter $\lambda$ should be as small
as possible while still allowing a reliable inversion of $H_{\hat{\theta}} +
\lambda \mathbb{I}$.

### Block-diagonal approximation

This implementation is capable of using a block-diagonal approximation.
The full matrix is approximated by a block-diagonal version, which
reduces both the time and memory consumption.
The blocking structure can be specified via the `block_structure` parameter.
The `block_structure` parameter can either be a
[BlockMode][pydvl.influence.torch.util.BlockMode] enum (which provides
layer-wise or parameter-wise blocking) or a custom block structure defined
by an ordered dictionary with the keys being the block identifiers (arbitrary
strings) and the values being lists of parameter names contained in the block.
```python
from torch.utils.data import DataLoader
from pydvl.influence.torch import DirectInfluence, BlockMode, SecondOrderMode

training_data_loader = DataLoader(...)
# layer-wise block-diagonal approximation
infl_model = DirectInfluence(model, loss,
regularization=0.1,
block_structure=BlockMode.LAYER_WISE)

block_structure = OrderedDict((
("custom_block1", ["0.weight", "1.bias"]),
("custom_block2", ["1.weight", "0.bias"]),
))
# custom block-diagonal structure
infl_model = DirectInfluence(model, loss,
regularization=0.1,
block_structure=block_structure)
infl_model = infl_model.fit(training_data_loader)
```
If you would like to apply a block-specific regularization, you can provide a
dictionary with the block names as keys and the regularization values as values.
In this case, the specification must be complete, i.e. every block must have
schroedk marked this conversation as resolved.
Show resolved Hide resolved
a positive regularization value.
```python
regularization = {
"custom_block1": 0.1,
"custom_block2": 0.2,
}
infl_model = DirectInfluence(model, loss,
regularization=regularization,
block_structure=block_structure)
infl_model = infl_model.fit(training_data_loader)
```
Accordingly, if you choose a layer-wise or parameter-wise structure
(by providing `BlockMode.LAYER_WISE` or `BlockMode.PARAMETER_WISE` for
`block_structure`) the keys must be the layer names or parameter names,
respectively.
You can retrieve the block-wise influence information from the methods
with suffix `_by_block`. By default, `block_structure` is set to
`BlockMode.FULL` and in this case these methods will return a dictionary
with the empty string being the only key.

### Gauss-Newton approximation

In the computation of the influence values, the inversion of the Hessian can be
replaced by the inversion of the Gauss-Newton matrix

$$ G_{\hat{\theta}}=n^{-1} \sum_{i=1}^n \nabla_{\theta}L(z_i, \hat{\theta})
\nabla_{\theta}L(z_i, \hat{\theta})^T $$

so the computed values are of the form

$$\nabla_\theta L(z_{\text{test}}, \hat{\theta})^\top \
G_{\hat{\theta}}^{-1} \ \nabla_\theta L(z, \hat{\theta}). $$

The parameter `second_orer_mode` is used to configure this approximation.
```python
from torch.utils.data import DataLoader
from pydvl.influence.torch import DirectInfluence, BlockMode, SecondOrderMode

training_data_loader = DataLoader(...)
infl_model = DirectInfluence(model, loss,
regularization={"layer_1": 0.1, "layer_2": 0.2},
block_structure=BlockMode.LAYER_WISE,
second_order_mode=SecondOrderMode.GAUSS_NEWTON)
infl_model = infl_model.fit(training_data_loader)
```


### Perturbation influences

The method of empirical influence computation can be selected with the
Expand Down
36 changes: 2 additions & 34 deletions docs/influence/influence_function_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,41 +294,9 @@ if_model = InverseHarmonicMeanInfluence(
)
if_model.fit(train_loader)
```
This implementation is capable of using a block-matrix approximation, see
[Block-diagonal approximation](#block-diagonal-approximation).

!!! Info
This implementation is capable of using a block-matrix approximation. The
blocking structure can be specified via the `block_structure` parameter.
The `block_structure` parameter can either be a
[BlockMode][pydvl.influence.torch.util.BlockMode] enum (which provides
layer-wise or parameter-wise blocking) or a custom block structure defined
by an ordered dictionary with the keys being the block identifiers (arbitrary
strings) and the values being lists of parameter names contained in the block.
```python
block_structure = OrderedDict(
(
("custom_block1", ["0.weight", "1.bias"]),
("custom_block2", ["1.weight", "0.bias"]),
)
)
```
If you would like to apply a block-specific regularization, you can provide a
dictionary with the block names as keys and the regularization values as values.
In this case, the specification must be complete, i.e. every block must have
a positive regularization value.
```python
regularization = {
"custom_block1": 0.1,
"custom_block2": 0.2,
}
```
Accordingly, if you choose a layer-wise or parameter-wise structure
(by providing `BlockMode.LAYER_WISE` or `BlockMode.PARAMETER_WISE` for
`block_structure`) the keys must be the layer names or parameter names,
respectively.
You can retrieve the block-wise influence information from the methods
with suffix `_by_block`. By default, `block_structure` is set to
`BlockMode.FULL` and in this case these methods will return a dictionary
with the empty string being the only key.

These implementations represent the calculation logic on in memory tensors.
To scale up to large collection of data, we map these influence function models
Expand Down
26 changes: 13 additions & 13 deletions notebooks/influence_synthetic.ipynb

Large diffs are not rendered by default.

104 changes: 58 additions & 46 deletions notebooks/influence_wine.ipynb

Large diffs are not rendered by default.

8 changes: 0 additions & 8 deletions src/pydvl/influence/base_influence_function_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,14 +10,6 @@
from .types import BatchType, BlockMapperType, DataLoaderType, InfluenceMode, TensorType


class UnsupportedInfluenceModeException(ValueError):
def __init__(self, mode: str):
super().__init__(
f"Provided {mode=} is not supported. Choose one of InfluenceMode.Up "
f"and InfluenceMode.Perturbation"
)


class NotFittedException(ValueError):
def __init__(self, object_type: Type):
super().__init__(
Expand Down
7 changes: 2 additions & 5 deletions src/pydvl/influence/influence_calculator.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,8 @@
from numpy.typing import NDArray

from .array import LazyChunkSequence, NestedLazyChunkSequence, NumpyConverter
from .base_influence_function_model import (
InfluenceFunctionModel,
UnsupportedInfluenceModeException,
)
from .types import InfluenceMode, TensorType
from .base_influence_function_model import InfluenceFunctionModel
from .types import InfluenceMode, TensorType, UnsupportedInfluenceModeException

__all__ = [
"DaskInfluenceCalculator",
Expand Down
2 changes: 1 addition & 1 deletion src/pydvl/influence/torch/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,4 +8,4 @@
NystroemSketchInfluence,
)
from .pre_conditioner import JacobiPreConditioner, NystroemPreConditioner
from .util import BlockMode
from .util import BlockMode, SecondOrderMode
4 changes: 3 additions & 1 deletion src/pydvl/influence/torch/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -364,7 +364,7 @@ def grads_inner_prod(
def mixed_grads_inner_prod(
self,
left: TorchBatch,
right: TorchBatch,
right: Optional[TorchBatch],
gradient_provider: TorchGradientProvider,
) -> torch.Tensor:
r"""
Expand All @@ -387,6 +387,8 @@ def mixed_grads_inner_prod(
A tensor representing the inner products of the mixed per-sample gradients
"""
operator = cast(TensorDictOperator, self.operator)
if right is None:
right = left
right_grads = gradient_provider.mixed_grads(right)
left_grads = gradient_provider.grads(left)
left_grads = operator.apply_to_dict(left_grads)
Expand Down
87 changes: 77 additions & 10 deletions src/pydvl/influence/torch/functional.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,14 @@
from torch.func import functional_call, grad, jvp, vjp
from torch.utils.data import DataLoader

from .util import align_structure, align_with_model, flatten_dimensions, to_model_device
from .util import (
BlockMode,
ModelParameterDictBuilder,
align_structure,
align_with_model,
flatten_dimensions,
to_model_device,
)

__all__ = [
"create_hvp_function",
Expand Down Expand Up @@ -383,6 +390,7 @@ def hessian(
data_loader: DataLoader,
use_hessian_avg: bool = True,
track_gradients: bool = False,
restrict_to: Optional[Dict[str, torch.Tensor]] = None,
) -> torch.Tensor:
"""
Computes the Hessian matrix for a given model and loss function.
Expand All @@ -397,18 +405,23 @@ def hessian(
If False, the empirical loss across the entire dataset is used.
track_gradients: Whether to track gradients for the resulting tensor of
the hessian vector products.
restrict_to: The parameters to restrict the second order differentiation to,
i.e. the corresponding sub-matrix of the Hessian. If None, the full Hessian
is computed.

Returns:
A tensor representing the Hessian matrix. The shape of the tensor will be
(n_parameters, n_parameters), where n_parameters is the number of trainable
parameters in the model.
"""
params = restrict_to

params = {
k: p if track_gradients else p.detach()
for k, p in model.named_parameters()
if p.requires_grad
}
if params is None:
params = {
k: p if track_gradients else p.detach()
for k, p in model.named_parameters()
if p.requires_grad
}
n_parameters = sum([p.numel() for p in params.values()])
model_dtype = next((p.dtype for p in params.values()))

Expand All @@ -424,13 +437,16 @@ def hessian(
def flat_input_batch_loss_function(
p: torch.Tensor, t_x: torch.Tensor, t_y: torch.Tensor
):
return blf(align_with_model(p, model), t_x, t_y)
return blf(align_structure(params, p), t_x, t_y)
schroedk marked this conversation as resolved.
Show resolved Hide resolved

for x, y in iter(data_loader):
n_samples += x.shape[0]
hessian_mat += x.shape[0] * torch.func.hessian(
flat_input_batch_loss_function
)(flat_params, to_model_device(x, model), to_model_device(y, model))
batch_hessian = torch.func.hessian(flat_input_batch_loss_function)(
flat_params, to_model_device(x, model), to_model_device(y, model)
)
if not track_gradients and batch_hessian.requires_grad:
batch_hessian = batch_hessian.detach()
hessian_mat += x.shape[0] * batch_hessian

hessian_mat /= n_samples
else:
Expand All @@ -447,6 +463,57 @@ def flat_input_empirical_loss(p: torch.Tensor):
return hessian_mat


def gauss_newton(
model: torch.nn.Module,
loss: Callable[[torch.Tensor, torch.Tensor], torch.Tensor],
data_loader: DataLoader,
restrict_to: Optional[Dict[str, torch.Tensor]] = None,
):
r"""
Compute the Gauss-Newton matrix, i.e.

$$ \sum_{i=1}^N \nabla_{\theta}\ell(m(x_i; \theta), y)
\nabla_{\theta}\ell(m(x_i; \theta), y)^t,$$
for a loss function $\ell$ and a model $m$ with model parameters $\theta$.

Args:
model: The PyTorch model.
loss: A callable that computes the loss.
data_loader: A PyTorch DataLoader providing batches of input data and
corresponding output data.
restrict_to: The parameters to restrict the differentiation to,
i.e. the corresponding sub-matrix of the Jacobian. If None, the full
Jacobian is used.

Returns:
The Gauss-Newton matrix.
"""

per_sample_grads = create_per_sample_gradient_function(model, loss)

params = restrict_to
if params is None:
params = {k: p.detach() for k, p in model.named_parameters() if p.requires_grad}
schroedk marked this conversation as resolved.
Show resolved Hide resolved

def generate_batch_matrices():
for x, y in data_loader:
grads = flatten_dimensions(
per_sample_grads(params, x, y).values(), shape=(x.shape[0], -1)
)
batch_mat = grads.t() @ grads
yield batch_mat.detach()

n_points = 0
tensors = generate_batch_matrices()
result = next(tensors)

for t in tensors:
result += t
n_points += t.shape[0]

return result / n_points


def create_per_sample_loss_function(
model: torch.nn.Module, loss: Callable[[torch.Tensor, torch.Tensor], torch.Tensor]
) -> Callable[[Dict[str, torch.Tensor], torch.Tensor, torch.Tensor], torch.Tensor]:
Expand Down
Loading