GridSearch and skdag #32

TonciG · 2024-02-13T14:31:58Z

Hi,

First of all, I think your library is a great add on to sklearn, especially since it addresses limitations of Pipeline.

Having said that, I tried to use skdag with GridSearchCV of sklearn but run into problem. I try to use one of your examples from the library docs (https://skdag.readthedocs.io/en/latest/quick_start.html) to do the grid search of optimal hyperparameter values. To you code I only add the following:
from sklearn.model_selection import GridSearchCV
params = {'blood__n_components': [1,2,3,4]}
grid = GridSearchCV(estimator = dag2, param_grid = params, scoring = 'accuracy')
grid.fit(X_train, y_train)

However, when I try to fit the model, I get the following error:
ValueError: Found input variables with inconsistent numbers of samples: [61, 2]

Would really appreciate if you could tell me what is going on here.
Regards,
Tonci

big-o · 2024-06-15T20:34:58Z

Hi, can you share your full code? There's no X_train in the quick start guide so it's hard to recreate the issue from from what you've included here.

TonciG · 2024-06-17T15:35:22Z

Hi,
Thank you for your response. Below is the full code:

from sklearn import datasets
from sklearn.model_selection import train_test_split
from skdag import DAGBuilder
from sklearn.decomposition import PCA
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression

X, y = datasets.load_diabetes(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

dag = (
DAGBuilder(infer_dataframe=True)
.add_step("impute", SimpleImputer())
.add_step("vitals", "passthrough", deps={"impute": ["age", "sex", "bmi", "bp"]})
.add_step("blood", PCA(n_components=2, random_state=0), deps={"impute": slice(4, 10)})
.add_step("lr", LogisticRegression(random_state=0), deps=["blood", "vitals"])
.make_dag()
)

from sklearn.ensemble import RandomForestClassifier
cal = DAGBuilder(infer_dataframe=True).from_pipeline(
[("rf", RandomForestClassifier(random_state=0))]
).make_dag()
dag2 = dag.join(cal, edges=[("blood", "rf"), ("vitals", "rf")])

y_pred = dag2.fit_predict(X_train, y_train)
type(y_pred)

from sklearn.model_selection import GridSearchCV

params = {'blood__n_components': [1,2,3,4]}
grid = GridSearchCV(estimator = dag2, param_grid = params, scoring = 'accuracy')
grid.fit(X_train, y_train)

Regards,
Tonci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GridSearch and skdag #32

GridSearch and skdag #32

TonciG commented Feb 13, 2024

big-o commented Jun 15, 2024

TonciG commented Jun 17, 2024

GridSearch and skdag #32

GridSearch and skdag #32

Comments

TonciG commented Feb 13, 2024

big-o commented Jun 15, 2024

TonciG commented Jun 17, 2024