Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(skorch): add an inherited class from skorch.NeuralNet that is compatible with PyTorch Frame #375

Open
wants to merge 57 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
71fbea0
feat(skorch): add prototype of an inherited class from skorch.NeuralN…
34j Mar 11, 2024
b8e8ae4
docs: add tutorial for the last commit
34j Mar 11, 2024
df8ecc4
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 11, 2024
ca95b8f
fix: patch `skorch.utils.to_tensor()`
34j Mar 11, 2024
0b9426f
style: format code
34j Mar 16, 2024
198b749
feat: fix multiple issues, support sklearn-like datasets and predict()
34j Mar 16, 2024
d264488
chore(example): test with regression as well
34j Mar 16, 2024
691f204
Merge branch 'master' into feat/skorch-compatible
34j Mar 16, 2024
9cc4fe1
docs: add changelog
34j Mar 16, 2024
98aea5c
fix(skorch): import annotations from __future__
34j Mar 16, 2024
0f650d8
revert: revert wrong changes
34j Mar 16, 2024
95688e3
style(skorch): fix typing
34j Mar 16, 2024
7594b44
fix(skorch): use `classes` if specified
34j Mar 16, 2024
aa5484d
Merge branch 'master' into feat/skorch-compatible
34j Apr 3, 2024
cb76e8d
Merge branch 'master' into feat/skorch-compatible
34j May 2, 2024
4a7598d
Merge branch 'master' into feat/skorch-compatible
34j Jul 4, 2024
bacf31f
chore: remove comments
34j Jul 4, 2024
1c50a59
chore(skorch): add more comments
34j Jul 4, 2024
3a90392
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 4, 2024
474caff
test: add prototype test
34j Jul 4, 2024
cda1524
feat: add NeuralNetBinaryClassifierPytorchFrame
34j Jul 4, 2024
568a6de
test: update test
34j Jul 4, 2024
7cc2f30
fix(dataset): fix dataset
34j Jul 4, 2024
95f22c1
feat: allow creating module later
34j Jul 4, 2024
7f2ec3e
test: add binary test
34j Jul 4, 2024
098cb4f
feat: add sklearn test
34j Jul 4, 2024
d8a1ca5
style: format code
34j Jul 4, 2024
4d12972
docs: update docs
34j Jul 4, 2024
03e9d56
chore(deps): add skorch as deps
34j Jul 4, 2024
ca284f3
test_skorch.py
34j Jul 4, 2024
6052910
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jul 4, 2024
2826100
test_skorch.py
34j Jul 4, 2024
ba740ba
fix: use dict.update instead of dict | dict
34j Jul 5, 2024
a0b3f51
fix(dataset): convert indices to list
34j Jul 5, 2024
90f57d4
fix: fix staticmethod usage for < Python 310
34j Jul 5, 2024
10689d9
fix: safer patch
34j Jul 5, 2024
71d9763
fix: do not call twice
34j Jul 5, 2024
eca1905
fix: copy dataframe before adding columns
34j Jul 5, 2024
33009b6
docs: add docs to _patch_skorch_support_tenforframe
34j Jul 6, 2024
e624953
fix(skorch): wrap with functools.wraps
34j Jul 6, 2024
3276903
fix: move imports
34j Jul 6, 2024
18a50ee
chore: do not use NeuralNetClassifierPytorchFrame for regression alth…
34j Jul 6, 2024
d53061d
fix(skorch): add typing only for module
34j Jul 6, 2024
a09beb2
fix: support specifying module as class
34j Jul 6, 2024
a967b0d
docs: add docs
34j Jul 6, 2024
bc07d7b
fix: fix dtype
34j Jul 6, 2024
aa23ca0
Merge branch 'master' into feat/skorch-compatible
34j Jul 8, 2024
e6d5dfe
Merge branch 'master' into feat/skorch-compatible
34j Jul 9, 2024
2fe4f69
test: remove comment
34j Jul 11, 2024
14f0e7b
Discard changes to examples/revisiting.py
34j Jul 11, 2024
06ec88e
Discard changes to examples/tutorial.py
34j Jul 11, 2024
8d4d32d
Discard changes to README.md
34j Jul 11, 2024
947daf1
fix: use args instead of kwargs to match typing
34j Jul 11, 2024
3769f2d
feat: add example for sklearn api
34j Jul 11, 2024
05785eb
Merge branch 'master' into feat/skorch-compatible
34j Jul 24, 2024
8e46fb5
Merge branch 'master' into feat/skorch-compatible
34j Sep 16, 2024
eddecf8
Merge branch 'master' into feat/skorch-compatible
34j Sep 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).

- Added `MovieLens 1M` dataset ([#397](https://github.com/pyg-team/pytorch-frame/pull/397))
- Added light-weight MLP ([#372](https://github.com/pyg-team/pytorch-frame/pull/372))
- Added an inherited class from skorch.NeuralNet that is compatible with PyTorch Frame ([#375](https://github.com/pyg-team/pytorch-frame/pull/375))
- Added R^2 metric ([#403](https://github.com/pyg-team/pytorch-frame/pull/403))

### Changed
Expand Down
54 changes: 54 additions & 0 deletions examples/sklearn_api.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
from typing import Any

import torch.nn as nn
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from torch_frame import stype
from torch_frame.data.stats import StatType
from torch_frame.nn import Trompt
from torch_frame.nn.models.trompt import Trompt
from torch_frame.utils.skorch import NeuralNetPytorchFrame

# load the diabetes dataset
X, y = load_diabetes(return_X_y=True, as_frame=True)

# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y)


# define the function to get the module
def get_module(col_stats: dict[str, dict[StatType, Any]],
col_names_dict: dict[stype, list[str]]) -> Trompt:
channels = 8
out_channels = 1
num_prompts = 2
num_layers = 3
return Trompt(channels=channels, out_channels=out_channels,
num_prompts=num_prompts, num_layers=num_layers,
col_stats=col_stats, col_names_dict=col_names_dict,
stype_encoder_dicts=None)


# wrap the function in a NeuralNetPytorchFrame
# NeuralNetClassifierPytorchFrame and NeuralNetBinaryClassifierPytorchFrame
# are also available
net = NeuralNetPytorchFrame(
module=get_module,
criterion=nn.MSELoss(),
max_epochs=10,
verbose=1,
lr=0.0001,
batch_size=30,
)

# fit the model
net.fit(X_train, y_train)

# predict on the test set
y_pred = net.predict(X_test)

# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print(mse)
1 change: 1 addition & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ full=[
"lightgbm",
"datasets",
"torchmetrics",
"skorch",
]

[project.urls]
Expand Down
198 changes: 198 additions & 0 deletions test/utils/test_skorch.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
from __future__ import annotations

from typing import Any

import pandas as pd
import pytest
import torch
import torch.nn as nn
from sklearn.datasets import load_diabetes, load_iris
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split

from torch_frame import TaskType, stype
from torch_frame.config.text_embedder import TextEmbedderConfig
from torch_frame.data.dataset import Dataset
from torch_frame.data.stats import StatType
from torch_frame.datasets.fake import FakeDataset
from torch_frame.nn.models.mlp import MLP
from torch_frame.testing.text_embedder import HashTextEmbedder
from torch_frame.utils.skorch import (
NeuralNetBinaryClassifierPytorchFrame,
NeuralNetClassifierPytorchFrame,
NeuralNetPytorchFrame,
)


class EnsureDtypeLoss(nn.Module):
def __init__(self, loss: nn.Module, dtype_input: torch.dtype = torch.float,
dtype_target: torch.dtype = torch.float):
super().__init__()
self.loss = loss
self.dtype_input = dtype_input
self.dtype_target = dtype_target

def forward(self, input, target):
return self.loss(
input.to(dtype=self.dtype_input).squeeze(),
target.to(dtype=self.dtype_target).squeeze())


@pytest.mark.parametrize('cls', ["mlp"])
@pytest.mark.parametrize(
'stypes',
[
[stype.numerical],
[stype.categorical],
# [stype.text_embedded],
# [stype.numerical, stype.numerical, stype.text_embedded],
Comment on lines +47 to +48
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't support these stypes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently not supported at this time due to lack of time to understand how to use these dtypes.
However, since it probably only require changes in the arguments of the NeuralNet, it should have little trouble extending it in the future.

])
@pytest.mark.parametrize('task_type_and_loss_cls', [
(TaskType.REGRESSION, nn.MSELoss),
(TaskType.BINARY_CLASSIFICATION, nn.BCEWithLogitsLoss),
(TaskType.MULTICLASS_CLASSIFICATION, nn.CrossEntropyLoss),
])
@pytest.mark.parametrize('pass_dataset', [False, True])
@pytest.mark.parametrize('module_as_function', [False, True])
def test_skorch_torchframe_dataset(cls, stypes, task_type_and_loss_cls,
pass_dataset: bool,
module_as_function: bool):
task_type, loss_cls = task_type_and_loss_cls
loss = loss_cls()
loss = EnsureDtypeLoss(
loss, dtype_target=torch.long
if task_type == TaskType.MULTICLASS_CLASSIFICATION else torch.float)

# initialize dataset
dataset: Dataset = FakeDataset(
num_rows=30,
# with_nan=True,
stypes=stypes,
create_split=True,
task_type=task_type,
col_to_text_embedder_cfg=TextEmbedderConfig(
text_embedder=HashTextEmbedder(8)),
)
dataset.materialize()
train_dataset, val_dataset, test_dataset = dataset.split()
if not pass_dataset:
df_train = pd.concat([train_dataset.df, val_dataset.df])
X_train, y_train = df_train.drop(
columns=[dataset.target_col, dataset.split_col]), df_train[
dataset.target_col]
df_test = test_dataset.df
X_test, _ = df_test.drop(
columns=[dataset.target_col, dataset.split_col]), df_test[
dataset.target_col]

# never use dataset again
# we assume that only dataframes are available
del train_dataset, val_dataset, test_dataset

if cls == "mlp":
if module_as_function:

def get_module(col_stats: dict[str, dict[StatType, Any]],
col_names_dict: dict[stype, list[str]]) -> MLP:
channels = 8
out_channels = 1
if task_type == TaskType.MULTICLASS_CLASSIFICATION:
out_channels = dataset.num_classes
num_layers = 3
return MLP(
channels=channels,
out_channels=out_channels,
num_layers=num_layers,
col_stats=col_stats,
col_names_dict=col_names_dict,
normalization="layer_norm",
)

module = get_module
kwargs = {}
else:
module = MLP
kwargs = {
"channels":
8,
"out_channels":
dataset.num_classes
if task_type == TaskType.MULTICLASS_CLASSIFICATION else 1,
"num_layers":
3,
"normalization":
"layer_norm",
}
kwargs = {f"module__{k}": v for k, v in kwargs.items()}
else:
raise NotImplementedError
kwargs.update({
"module": module,
"criterion": loss,
"max_epochs": 2,
"verbose": 1,
"batch_size": 3,
})

if task_type == TaskType.REGRESSION:
net = NeuralNetPytorchFrame(**kwargs, )
if task_type == TaskType.MULTICLASS_CLASSIFICATION:
net = NeuralNetClassifierPytorchFrame(**kwargs, )
elif task_type == TaskType.BINARY_CLASSIFICATION:
net = NeuralNetBinaryClassifierPytorchFrame(**kwargs, )

if pass_dataset:
net.fit(dataset)
_ = net.predict(test_dataset)
else:
net.fit(X_train, y_train)
_ = net.predict(X_test)
Comment on lines +144 to +149
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't we take tensor frame? It's also weird to sometimes take dataset and sometimes take data frame.

Copy link
Author

@34j 34j Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The main purpose of this PR is to allow DataFrame to be DIRECTLY fitted, as shown in examples/sklearn_api.py.
  • Since it is unclear how to create a Dataset from a TensorFrame, and if there is a TensorFrame, there should be also a Dataset, which means there is little need to implement this, and even to use skorch as the user might be familiar with deep learning.
  • Instead of "sometimes take dataset and sometimes take data frame", both are tested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm if dataframe is directly fed, it is unclear why we need this feature within pytorch frame.
the whole point of pytorch frame is to materialize data frame into tensor frame, to be processed by pytorch.

Copy link
Author

@34j 34j Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not match your purpose but my goal is to use advanced neural networks implemented in pytorch_frame in existing sklearn pipeline.
This PR allows pytorch_frame to be used on top of existing scikit-learn code without having to heavily modify the existing code. Since many people use sklearn Pipeline, especially on Kaggle, it is easy to verify performance changes by changing or assembling the estimator in other people's code to my NeuralNetPytorchFrame. I am convinced that this will be very valuable.



@pytest.mark.parametrize(
'task_type', [TaskType.MULTICLASS_CLASSIFICATION, TaskType.REGRESSION])
def test_sklearn_only(task_type) -> None:
if task_type == TaskType.MULTICLASS_CLASSIFICATION:
X, y = load_iris(return_X_y=True, as_frame=True)
num_classes = 3
else:
X, y = load_diabetes(return_X_y=True, as_frame=True)

X_train, X_test, y_train, y_test = train_test_split(X, y)

def get_module(col_stats: dict[str, dict[StatType, Any]],
col_names_dict: dict[stype, list[str]]) -> MLP:
channels = 8
out_channels = 1
if task_type == TaskType.MULTICLASS_CLASSIFICATION:
out_channels = num_classes
num_layers = 3
return MLP(
channels=channels,
out_channels=out_channels,
num_layers=num_layers,
col_stats=col_stats,
col_names_dict=col_names_dict,
normalization="layer_norm",
)

net = NeuralNetClassifierPytorchFrame(
module=get_module,
criterion=nn.CrossEntropyLoss()
if task_type == TaskType.MULTICLASS_CLASSIFICATION else nn.MSELoss(),
max_epochs=2,
verbose=1,
lr=0.0001,
batch_size=3,
)
net.fit(X_train, y_train)
y_pred = net.predict(X_test)

if task_type == TaskType.MULTICLASS_CLASSIFICATION:
assert y_pred.shape == (len(y_test), num_classes)
acc = accuracy_score(y_test, y_pred.argmax(-1))
print(acc)
else:
assert y_pred.shape == (len(y_test), 1)
mse = mean_squared_error(y_test, y_pred)
print(mse)
5 changes: 3 additions & 2 deletions torch_frame/data/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@
from collections import defaultdict
from typing import Any

import numpy as np
import pandas as pd
import torch
from torch import Tensor
Expand Down Expand Up @@ -733,8 +734,8 @@ def get_split(self, split: str) -> Dataset:
if split not in ["train", "val", "test"]:
raise ValueError(f"The split named '{split}' is not available. "
f"Needs to be either 'train', 'val', or 'test'.")
indices = self.df.index[self.df[self.split_col] ==
SPLIT_TO_NUM[split]].tolist()
indices = np.where(
self.df[self.split_col] == SPLIT_TO_NUM[split])[0].tolist()
return self[indices]

def split(self) -> tuple[Dataset, Dataset, Dataset]:
Expand Down
Loading
Loading