In order to reach a good score and solve a complex machine learning problem, multiple angles and attempts are often necessary. In here, I hope to document my trials on my journey to find an optimal solution for the Santander Satisfaction Competition, from Kaggle. Although it's a bygone tournament from 7 years ago, I believe there's much I can learn trying to crack it.
Important info: In Kaggle Competitions, there are two type of scores: the Public Score, used to rank the public leaderboard, and the Private Score, used to order the final positions of the competition. Since the other's private scores aren't available for comparison, I'll focus on the public score instead.
Attempt | Model | Public Score | Private Score |
---|---|---|---|
1st | DecisionTreeClassifier |
0.65662 | 0.6621 |
2nd (Oversampling) | RandomForestClassifier |
0.69402 | 0.6884 |
2nd (Undersampling) | DecisionTreeClassifier |
0.68533 | 0.68632 |
3rd (Oversampling) | VotingClassifier |
0.71566 | 0.7067 |
3rd (Undersampling) | VotingClassifier |
0.74544 | 0.73255 |
4th | VotingClassifier |
0.74268 | 0.72855 |
5th | Keras.Sequential() |
0.73499 | 0.72331 |
6th (Oversampling) | Keras.Sequential() |
0.71362 | 0.68998 |
6th (Undersampling) | Keras.Sequential() |
0.71995 | 0.71407 |
7h (Stacking) | StackingClassifier() |
0.74151 | 0.73156 |
Best Public Score 7h (Voting) |
VotingClassifier() |
0.74636 | 0.73214 |
Best Private Score 7h (Voting) |
VotingClassifier() |
0.74544 | 0.7351 |
Overall, there are some key elements to the dataset that need to be adressed at every new try. These factors are its high dimentionality (371 columns) and the major unbalance of the Target variable.
To adress the second issue, I used the oversampling technique - SMOTE. As for the first issue, I decided to try a combination of feature selection techniques and algorithms to choose the best set of columns.
Four approaches were used:
- I calculated the correlation between features and Target, and made 5 lists, with the respective, 10, 20, 30, 40 and 50 most correlated columns.
- After applyng the
MinMaxScaler()
to the dataset, I used theVarianceTreshold
algorithm to make two lists, one with features with more than 5% of variance, and another with features above 1%. - I used the
SelectKBest
algorithm to select the 10, 20, 30, 40, 50 best features with the chi2 method. - I used the
SelectFromModel
class to return the list of features gathered by LogisticRegression and DecisionTree algorithms.
As a result, I was left with 14 sets of features, to which I submitted to a function that returned the ones that appeared most frequently, creating, at the end, five sets of features, with columns that appeared in 3, 5, 7, 9, and 11 methods, respectively.
These sets were then submmited to various scikit-learn algorithms, using the cross_validate
function.
The best result found was a DecisionTreeClassifier
that was then submitted to a public score of a private score of 0.65662.
In my second attempt, the target was to study the effects of Undersampling and Oversampling into the result. After cleaning the dataser from the already known unchageable columns, I Used the ``SelectKBest`` algorithm to generate 5 different set of columns. Since in the first attempt, larger sets showed greater performance, this time, I created sets of 25, 50, 75, 100 and 150 columns, and separated them into 10 datasets, with one with oversampling and another with undersampling for each set.
From training, some conclusions could be apparently drawn.
- Tree algorithms (RandomForest, ExtraTrees, DecisionTree...) showed great promise in both attempts so far.
- Larger sets of columns impacted positively in some cases, albeit slightly.
- Undersampled datasets showed significant worse performance in all cases.
At the end, the best results were the RandomForest algorithm with 100 set of columns for oversampling and 150 sets for undersampling.
The answers resulted were then compared and submitted.
The two answers were submitted. The Undersampling Answer resulted in a private score of 0.68632. The Oversampling Answer resulted in a private score of 0.6884.
The third attempt aimed at discovering the impact of feature selection, Undersampling and Oversampling in the dataset. This time, the only columns removed from the dataset were the ones who were constant (whose values didn't change). Otherwise, all columns were used. Two datasets were made from this with different approaches: one with Oversampling made with SMOTE
, and another with Undersampling made with RandomUnderSampler
.
The best results were made with the VotingClassifier
, searching to complement each Classifier mistakes with one of another kind.
The results were great. The Oversampling method resulted in a private score of 0.7067, while the Undersampling method resulted in a private score of 0.73255.
These results show that the Undersampling is similarly the correct approach, that Feature Selection isn't needed for performance, and that the VotingClassifier is promising.
Searching to confirm these findings, the Fourth Attempt consisted at a tentative to create an "Ultimate VotingClassifier
", in a undersampled dataset with all non-constant columns. In theory, this approach, making use of classifiers optimized with RandomizedSearchCV
should yield better results.
The chosen classifiers were as follows:
- A KNN Classifier
- A LogisticRegression Classifier
- A SVC with kernel RBF Classifier
- A Bagging RandomForestClassifier
- A Boosting GradientBoostingClassifier
- A MLPClassifier
All of them were optimized and joined in a single classifier. The end result, however, wasn't the expected, with a private score of 0.72855, a downgrade from the previous result, hinting at some level of overfitting, maybe.
This time, I wanted to test the performance of a specialized NeuralNetwork in the data. So, I created a simple Keras
classification network, and optimized its parameters later with the KerasClassifier
wrapper for scikit-learn from the Tensorflow library.
After only ten iterations (and 16 minutes of runtime), the result was the following network.
I used the network to predict the results, ending with a private score of 0.72331, showing that neural networks exhibit great potential for future attempts.
In the sixth attempt, I tried to tune more with neural networks, exploring its parameters and results. Plotting the performance of the earlier model (5th Attempt) coupled with the results of earlier models in the notebook, it was clear that the neural network was finding major difficulties to generalize.
Due to such results, multiple approaches were tried. Be it adjusting the learning rate or batch size, using BatchNormalization
and Dropout
, and trying Oversampling and Undersampling datasets, all of these attempts yielded poor results.
In the end, using the keras_tuner
library, the best parameters found were the 0.001 learning rate for 512 neurons. This result improved performance in the validation set, and using the selu
activation function, later on, the results were, once gain, optimized.
Even so, the results were poor. The neural network, trained in a Oversampling Dataset, resulted in a private score of 0.68998, slightly better than the RandomForestClassifier
of the second attempt. The same network, trained in a Undersampling Dataset, resulted in a private score of 0.71407
This conclusion likely shows that an ensemble approach with multiple classifiers, like the VotingClassifier
is likely to be the best apporach.
Since the ensemble learners had the best performance so far, and the brief dive in Neural Networks didn't yield the same results (at the very least for now), I went back to the Scikit-Learn
library, this time to test the VotingClassifier
with as much different estimators as possible, as well as trying the StackingClassifier
.
In the end, I created and submitted 9 VotingClassifier
and 4 StackingClassifier
, to acess their performance. A few conclusions can be drawn.
- The
VotingClassifier
showcased better performance. - Using estimators with optimized hyperparameters, the best
VotingClassifier
was the one with 7 classifiers. - The ensemble learners with the estimators NOT optimized stil yielded the best results.
In the end, I reached a new record for public and private scores, with the following models.
New Public Record
classificador = VotingClassifier(estimators = [
('Xgboost', xgb.XGBClassifier(random_state = SEED)),
('RandomForest', RandomForestClassifier(random_state = SEED)),
('Rede Neural', MLPClassifier(random_state = SEED, max_iter = 1000)),
('KNN', KNeighborsClassifier()),
('Regressão Logística', LogisticRegression(random_state = SEED, max_iter = 1000)),
('Máquina de Vetor', SVC(random_state = SEED, probability = True))
], voting = "soft")
New Private Record
classificador = VotingClassifier(estimators=[
('Xgboost', xgb.XGBClassifier(random_state = SEED)),
('RandomForest', RandomForestClassifier(random_state = SEED)),
('Rede Neural', MLPClassifier(random_state = SEED, max_iter = 1000)),
('KNN', KNeighborsClassifier()),
('Regressão Logística', LogisticRegression(random_state = SEED, max_iter = 1000)),
], voting='soft')