Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing move to device in influence_model.fit() #569

Closed
sleepymalc opened this issue May 1, 2024 · 5 comments · Fixed by #570
Closed

Missing move to device in influence_model.fit() #569

sleepymalc opened this issue May 1, 2024 · 5 comments · Fixed by #570
Assignees
Labels
bug Something isn't working
Milestone

Comments

@sleepymalc
Copy link

When using models in "cuda" to construct an instance of influence_model, by calling influence_model.fit(), an error throws out indicating that there are tensors on different devices.

@mdbenito mdbenito added the bug Something isn't working label May 2, 2024
@schroedk
Copy link
Collaborator

schroedk commented May 2, 2024

@sleepymalc Thank you for reporting. Could you please provide more information, so we can reproduce and fix the issue?
The easiest thing would be a minimal, reproducible example. Which implementation do you use (CgInfluence,....?), which version of pyDVL do you use?

@sleepymalc
Copy link
Author

Sure, the following is a MWE:

import torch
import torch.nn as nn
import torch.optim as optim

from torchvision import datasets, transforms
from torch.utils.data import DataLoader, Sampler
from pydvl.influence.torch import EkfacInfluence
import random
import numpy as np

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple MLP model
class MLP(nn.Module):
    def __init__(self, input_size=784, hidden_size=128, output_size=10, num_layers=2):
        super(MLP, self).__init__()
        self.flatten = torch.nn.Flatten()
        self.layers = torch.nn.ModuleList()
        self.layers.append(torch.nn.Linear(input_size, hidden_size))
        for _ in range(num_layers - 2):
            self.layers.append(torch.nn.Linear(hidden_size, hidden_size))
        self.layers.append(torch.nn.Linear(hidden_size, output_size))
        self.relu = torch.nn.ReLU()

    def forward(self, x):
        x = self.flatten(x)
        for layer in self.layers[:-1]:
            x = self.relu(layer(x))
        x = self.layers[-1](x)
        return x

    def train_with_seed(self, train_loader, epochs=30, seed=0, verbose=True):
        torch.manual_seed(seed)
        random.seed(seed)
        np.random.seed(seed)

        criterion = nn.CrossEntropyLoss()
        optimizer = optim.SGD(self.parameters(), lr=0.01, momentum=0.9)
        for epoch in range(epochs):
            running_loss = 0.0
            for images, labels in train_loader:
                images, labels = images.to(device), labels.to(device)
                optimizer.zero_grad()
                outputs = self(images)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()
            if verbose:
                print(f"Epoch {epoch+1}, Loss: {running_loss/len(train_loader)}")
        print("Training complete")

    def test(self, test_loader):
        self.eval()
        correct = 0
        total = 0
        # No gradient is needed for evaluation
        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = self(images)
                # Get the predicted class from the maximum value in the output-list of class scores
                _, predicted = torch.max(outputs.data, 1)
                total += labels.size(0)
                correct += (predicted == labels).sum().item()
        accuracy = 100 * correct / total
        print(f'Accuracy of the model on the test set: {accuracy:.2f}%')

class SubsetSamper(Sampler):
    def __init__(self, indices):
        self.indices = indices

    def __iter__(self):
        return iter(self.indices)

    def __len__(self):
        return len(self.indices)

transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=1, sampler=SubsetSamper(list(range(500))))
test_loader = DataLoader(test_dataset, batch_size=1, sampler=SubsetSamper(list(range(50))))

influence_model = EkfacInfluence(
                MLP().to(device),
                update_diagonal=True,
                hessian_regularization=0.001,
            )
influence_model = influence_model.fit(train_loader)

When running influence_model.fit(train_loader), I get the following:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], [line 23](vscode-notebook-cell:?execution_count=3&line=23)
     [16](vscode-notebook-cell:?execution_count=3&line=16) test_loader = DataLoader(test_dataset, batch_size=1, sampler=SubsetSamper(list(range(50))))
     [18](vscode-notebook-cell:?execution_count=3&line=18) influence_model = EkfacInfluence(
     [19](vscode-notebook-cell:?execution_count=3&line=19)                 MLP().to(device),
     [20](vscode-notebook-cell:?execution_count=3&line=20)                 update_diagonal=True,
     [21](vscode-notebook-cell:?execution_count=3&line=21)                 hessian_regularization=0.001,
     [22](vscode-notebook-cell:?execution_count=3&line=22)             )
---> [23](vscode-notebook-cell:?execution_count=3&line=23) influence_model = influence_model.fit(train_loader)

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:56](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:56), in log_duration.<locals>.decorator_log_duration.<locals>.wrapper_log_duration(*args, **kwargs)
     [54](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:54) duration_logger.log(log_level, f"Function '{func_name}' is starting.")
     [55](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:55) start_time = time()
---> [56](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:56) result = func(*args, **kwargs)
     [57](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:57) duration = time() - start_time
     [58](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:58) duration_logger.log(
     [59](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:59)     log_level,
     [60](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:60)     f"Function '{func_name}' completed. " f"Duration: {duration:.2f} sec",
     [61](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/utils/progress.py:61) )

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1218](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1218), in EkfacInfluence.fit(self, data)
   [1211](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1211) @log_duration(log_level=logging.INFO)
   [1212](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1212) def fit(self, data: DataLoader) -> EkfacInfluence:
   [1213](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1213)     """
   [1214](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1214)     Compute the KFAC blocks for each layer of the model, using the provided data.
   [1215](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1215)     It then creates an EkfacRepresentation object that stores the KFAC blocks for
   [1216](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1216)     each layer, their eigenvalue decomposition and diagonal values.
   [1217](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1217)     """
-> [1218](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1218)     forward_x, grad_y = self._get_kfac_blocks(data)
   [1219](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1219)     layers_evecs_a = {}
   [1220](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1220)     layers_evect_g = {}

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1198](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1198), in EkfacInfluence._get_kfac_blocks(self, data)
   [1194](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1194) for x, *_ in tqdm(
   [1195](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1195)     data, disable=not self.progress, desc="K-FAC blocks - batch progress"
   [1196](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1196) ):
   [1197](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1197)     data_len += x.shape[0]
-> [1198](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1198)     pred_y = self.model(x)
   [1199](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1199)     loss = empirical_cross_entropy_loss_fn(pred_y)
   [1200](...~/miniconda3/envs/influence/lib/python3.9/site-packages/pydvl/influence/torch/influence_function_model.py:1200)     loss.backward()

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1511](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1520](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1520), in Module._call_impl(self, *args, **kwargs)
   [1515](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1515) # If we don't have any hooks, we want to skip the rest of the logic in
   [1516](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1516) # this function, and just call forward.
   [1517](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1517) if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   [1518](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1518)         or _global_backward_pre_hooks or _global_backward_hooks
   [1519](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1519)         or _global_forward_hooks or _global_forward_pre_hooks):
-> [1520](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1520)     return forward_call(*args, **kwargs)
   [1522](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1522) try:
   [1523](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1523)     result = None

Cell In[2], [line 18](vscode-notebook-cell:?execution_count=2&line=18)
     [16](vscode-notebook-cell:?execution_count=2&line=16) x = self.flatten(x)
     [17](vscode-notebook-cell:?execution_count=2&line=17) for layer in self.layers[:-1]:
---> [18](vscode-notebook-cell:?execution_count=2&line=18)     x = self.relu(layer(x))
     [19](vscode-notebook-cell:?execution_count=2&line=19) x = self.layers[-1](x)
     [20](vscode-notebook-cell:?execution_count=2&line=20) return x

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1511](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1511), in Module._wrapped_call_impl(self, *args, **kwargs)
   [1509](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1509)     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   [1510](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1510) else:
-> [1511](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1511)     return self._call_impl(*args, **kwargs)

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1561](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1561), in Module._call_impl(self, *args, **kwargs)
   [1558](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1558)     bw_hook = hooks.BackwardHook(self, full_backward_hooks, backward_pre_hooks)
   [1559](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1559)     args = bw_hook.setup_input_hook(args)
-> [1561](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1561) result = forward_call(*args, **kwargs)
   [1562](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1562) if _global_forward_hooks or self._forward_hooks:
   [1563](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1563)     for hook_id, hook in (
   [1564](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1564)         *_global_forward_hooks.items(),
   [1565](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1565)         *self._forward_hooks.items(),
   [1566](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1566)     ):
   [1567](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/module.py:1567)         # mark that always called hook is run

File [~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/linear.py:116](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/linear.py:116), in Linear.forward(self, input)
    [115](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/linear.py:115) def forward(self, input: Tensor) -> Tensor:
--> [116](...~/miniconda3/envs/influence/lib/python3.9/site-packages/torch/nn/modules/linear.py:116)     return F.linear(input, self.weight, self.bias)

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument mat1 in method wrapper_CUDA_addmm)

Hope this helps.

@schroedk
Copy link
Collaborator

schroedk commented May 3, 2024

@sleepymalc thanks, that helped a lot. Please have a look at #570. To install via pip

pip install git+https://github.com/aai-institute/pyDVL.git@fix/569-missing-device-move-ekfac

To test it, I added a small call to the end of your file

# had to add this, due to an old cuda version in combination with nan values, try if you neeed it
#torch.backends.cuda.preferred_linalg_library('magma')
influence_model = influence_model.fit(train_loader)

for x_train, y_train in train_loader:
    for x_test, y_test in test_loader:
        influence_model.influences(x_test, y_test, x_train, y_train)
        fac = influence_model.influence_factors(x_test, y_test)
        influence_model.influences_from_factors(fac, x_train, y_train)
        break
    break

Please let me know, if this solves the problem. thanks:)

@sleepymalc
Copy link
Author

It seems like the problem is solved! Thanks for the quick fix.

@schroedk
Copy link
Collaborator

schroedk commented May 3, 2024

@sleepymalc awesome, please let us know, if you encounter any other issues, thanks:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants