Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GridSearch and skdag #32

Open
TonciG opened this issue Feb 13, 2024 · 2 comments
Open

GridSearch and skdag #32

TonciG opened this issue Feb 13, 2024 · 2 comments

Comments

@TonciG
Copy link

TonciG commented Feb 13, 2024

Hi,

First of all, I think your library is a great add on to sklearn, especially since it addresses limitations of Pipeline.

Having said that, I tried to use skdag with GridSearchCV of sklearn but run into problem. I try to use one of your examples from the library docs (https://skdag.readthedocs.io/en/latest/quick_start.html) to do the grid search of optimal hyperparameter values. To you code I only add the following:
from sklearn.model_selection import GridSearchCV
params = {'blood__n_components': [1,2,3,4]}
grid = GridSearchCV(estimator = dag2, param_grid = params, scoring = 'accuracy')
grid.fit(X_train, y_train)

However, when I try to fit the model, I get the following error:
ValueError: Found input variables with inconsistent numbers of samples: [61, 2]

Would really appreciate if you could tell me what is going on here.
Regards,
Tonci

@big-o
Copy link
Collaborator

big-o commented Jun 15, 2024

Hi, can you share your full code? There's no X_train in the quick start guide so it's hard to recreate the issue from from what you've included here.

@TonciG
Copy link
Author

TonciG commented Jun 17, 2024

Hi,
Thank you for your response. Below is the full code:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from skdag import DAGBuilder
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dag = (
DAGBuilder(infer_dataframe=True)
.add_step("impute", SimpleImputer())
.add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
.add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": slice(4, 10)})
.add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"])
.make_dag()
)

from sklearn.ensemble import RandomForestClassifier
cal = DAGBuilder(infer_dataframe=True).from_pipeline(
[("rf", RandomForestClassifier(random_state=0))]
).make_dag()
dag2 = dag.join(cal, edges=[("blood", "rf"), ("vitals", "rf")])

y_pred = dag2.fit_predict(X_train, y_train)
type(y_pred)

from sklearn.model_selection import GridSearchCV

params = {'blood__n_components': [1,2,3,4]}
grid = GridSearchCV(estimator = dag2, param_grid = params, scoring = 'accuracy')
grid.fit(X_train, y_train)

Regards,
Tonci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants