For this project, I will be applying machine learning techniques to predict the outcome in a given dataset. I will be using a dataset from kaggle that is composed of 6 million of money transactions. Here the goal is to oredict if a given transaction is a Fraud or not.
There is a lack of public available datasets on financial services and specially in the emerging mobile money transactions domain. Financial datasets are important to many researchers and in particular to us performing research in the domain of fraud detection. Part of the problem is the intrinsically private nature of financial transactions, that leads to no publicly available datasets.
This dataset was generated using the simulator called PaySim as an approach to such a problem. PaySim uses aggregated data from the private dataset to generate a synthetic dataset that resembles the normal operation of transactions and injects malicious behaviour to later evaluate the performance of fraud detection methods.
PaySim simulates mobile money transactions based on a sample of real transactions extracted from one month of financial logs from a mobile money service implemented in an African country. The original logs were provided by a multinational company, who is the provider of the mobile financial service which is currently running in more than 14 countries all around the world.
This is a sample of 1 row with headers explanation:
step | type | amount | nameOrig | oldbalanceOrg | newbalanceOrig | nameDest | oldbalanceDest | newbalanceDest | isFraud | isFlaggedFraud |
---|---|---|---|---|---|---|---|---|---|---|
1 | PAYMENT | 9839.64 | C1231006815 | 170136 | 160296.4 | M1979787155 | 0 | 0 | 0 | 0 |
Column | Description |
---|---|
step | maps a unit of time in the real world. In this case 1 step is 1 hour of time. Total steps 744 (30 days simulation) |
type | CASH-IN, CASH-OUT, DEBIT, PAYMENT and TRANSFER. |
amount | amount of the transaction in local currency. |
nameOrig | customer who started the transaction. |
oldbalanceOrg | initial balance before the transaction. |
newbalanceOrig | new balance after the transaction. |
nameDest | customer who is the recipient of the transaction. |
oldbalanceDest | initial balance recipient before the transaction. Note that there is not information for customers that start with M (Merchants). |
newbalanceDest | new balance recipient after the transaction. Note that there is not information for customers that start with M (Merchants). |
isFraud | This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system. |
isFlaggedFraud | The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction. |
ABCDEFGHIJ0123456789 |
step <dbl> | type <chr> | amount <dbl> | nameOrig <chr> | oldbalanceOrg <dbl> | newbalanceOrig <dbl> | nameDest <chr> | oldbalanceDest <dbl> | newbalanceDest <dbl> | isFraud <dbl> | |
---|---|---|---|---|---|---|---|---|---|---|
1 | PAYMENT | 9839.64 | C1231006815 | 170136 | 160296.36 | M1979787155 | 0 | 0 | 0 | |
1 | PAYMENT | 1864.28 | C1666544295 | 21249 | 19384.72 | M2044282225 | 0 | 0 | 0 | |
1 | TRANSFER | 181.00 | C1305486145 | 181 | 0.00 | C553264065 | 0 | 0 | 1 | |
1 | CASH_OUT | 181.00 | C840083671 | 181 | 0.00 | C38997010 | 21182 | 0 | 1 | |
1 | PAYMENT | 11668.14 | C2048537720 | 41554 | 29885.86 | M1230701703 | 0 | 0 | 0 | |
1 | PAYMENT | 7817.71 | C90045638 | 53860 | 46042.29 | M573487274 | 0 | 0 | 0 |
Let´s look at our data distribuition when grouped by Fraud or not Fraud transactions.
data %>%
mutate(isFraud = ifelse(isFraud == 1, "Fraud", "Not Fraud")) %>%
ggplot(aes(isFraud, fill = isFraud)) +
geom_bar() +
ggtitle("Number of Fraud and not Fraud Transactions") +
theme_fivethirtyeight() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
data %>%
summarize(Fraud = sum(isFraud == 1), Not_Fraud = sum(isFraud == 0))
ABCDEFGHIJ0123456789 |
Fraud <int> | Not_Fraud <int> | |||
---|---|---|---|---|
8213 | 6354407 |
As we can see there are a lot of transactions that weren´t fraud and just a few that are fraud. This can cause or model to always predict for not fraud and always get a high accuracy. To solve this we will split the data into a set of 50% fraud and 50% not fraud.
set.seed(0)
size <- dim(data[data$isFraud == 1,])[1]
temp_df_fraud <- data[data$isFraud == 1,]
temp_df_not_fraud <- data[data$isFraud == 0,][sample(seq(1, size), size),]
df <- full_join(temp_df_not_fraud, temp_df_fraud)
Joining, by = c("step", "type", "amount", "nameOrig", "oldbalanceOrg", "newbalanceOrig", "nameDest", "oldbalanceDest", "newbalanceDest", "isFraud", "isFlaggedFraud")
rm(temp_df_fraud, temp_df_not_fraud)
As we can see now we have a dataser with equal proportion of fraud and not fraud transactions.
df %>%
mutate(isFraud = ifelse(isFraud == 1, "Fraud", "Not Fraud")) %>%
ggplot(aes(isFraud, fill = isFraud)) +
geom_bar() +
ggtitle("Number of Fraud and not Fraud Transactions") +
theme_fivethirtyeight() +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5))
Now we can start constructing the prediction model.
Our goal is to predict wether a certain transaction is a fraud or not, given tha value of the predictors. First let´s drop all the non-numeric columns, as they don´t apport too much to the data.
df_num <- df %>%
select(-c(nameOrig, nameDest, type))
head(df_num)
ABCDEFGHIJ0123456789 |
step <dbl> | amount <dbl> | oldbalanceOrg <dbl> | newbalanceOrig <dbl> | oldbalanceDest <dbl> | newbalanceDest <dbl> | isFraud <dbl> | isFlaggedFraud <dbl> |
---|---|---|---|---|---|---|---|
7 | 1041.96 | 70789.0 | 69747.04 | 0.0 | 0 | 0 | 0 |
1 | 107440.87 | 140757.4 | 33316.54 | 266398.1 | 373839 | 0 | 0 |
2 | 48183.10 | 120039.9 | 71856.76 | 1412484.1 | 1791002 | 0 | 0 |
4 | 248011.57 | 0.0 | 0.00 | 9941904.2 | 12362817 | 0 | 0 |
7 | 6356.01 | 10787.0 | 4430.99 | 0.0 | 0 | 0 | 0 |
1 | 12429.59 | 16231.0 | 3801.41 | 0.0 | 0 | 0 | 0 |
Now let´s create the training and testing datasets (which are mutal exclusive). This will make our model don´t be overfitted to the trainig data and will make our predictions more accurate to the ones of the real values for any given dataset.
set.seed(0)
test_index <- createDataPartition(df_num$isFraud, times = 1, p = 0.2, list = FALSE) # first create the indexes for the test set
# select our test set
test_x <- select(df_num, -isFraud)[test_index,]
test_y <- df_num$isFraud[test_index]
# select the reamining indexes for our train set
train_x <- select(df_num, -isFraud)[-test_index,]
train_y <- df_num$isFraud[-test_index]
# change de training data as factor because we will only have to values
train_y <- as.factor(train_y)
Now, with our train and test data set. We can train a machine learning model that predicts wether a transactions is a Fraud or not given the numeric predictors.
Is a grouping method, which aims at dividing a set of n observations into k groups in which each observation belongs to the group whose mean value is closest.
predict_kmeans <- function(x, k) {
centers <- k$centers # extract cluster centers
# calculate distance to cluster centers
distances <- sapply(1:nrow(x), function(i){
apply(centers, 1, function(y) dist(rbind(x[i,], y)))
})
max.col(-t(distances)) # select cluster with min distance to center
}
set.seed(0)
k <- kmeans(train_x, centers = 2)
kmeans_preds <- ifelse(predict_kmeans(test_x, k) == 1, 0, 1)
# mean(kmeans_preds == test_y)
Results
results <- data_frame(method = "K means", accuracy = mean(kmeans_preds == test_y)) # save the results in the data frame
results %>% knitr::kable()
method | accuracy |
---|---|
K means | 0.5003043 |
GLM will keep the weighted sum of all the features, but allow non-Gaussian outcome distributions (not like a linear regression model) and connect the expected mean of this distribution and the weighted sum through a possibly nonlinear function.
set.seed(0)
# Train the model
train_glm <- train(train_x, train_y,
method = "glm")
# Predict for the test set
glm_preds <- predict(train_glm, test_x)
# mean(glm_preds == test_y)
Results
results <- bind_rows(results, # add accuracy to the df
data_frame(method="Logistic regression",
accuracy = mean(glm_preds == test_y)))
results %>% knitr::kable()
method | accuracy |
---|---|
K means | 0.5103469 |
Logistic regression | 0.9923920 |
Linear Discriminant Analysis (LDA) is a method used to find a linear combination of features that characterize or separate two or more kinds of objects or events. The resulting combination can be used as a linear classifier.
set.seed(0)
# Train the model
train_lda <- train(train_x, train_y,
method = "lda")
# Predict for the test set
lda_preds <- predict(train_lda, test_x)
# mean(lda_preds == test_y)
Results
results <- bind_rows(results, # add accuracy to the df
data_frame(method="LDA",
accuracy = mean(lda_preds == test_y)))
results %>% knitr::kable()
method | accuracy |
---|---|
K means | 0.4952055 |
Logistic regression | 0.9933049 |
LDA | 0.8889227 |
It is a method that simply searches the closest observations to the one you are trying to predict and classifies the point of interest based on most of the data that surrounds it.
set.seed(0)
# Train the model
train_knn <- train(train_x, train_y,
method = "knn",
tuneGrid = data.frame(k = seq(1.95, 2, 0.01)))
# Predict for the test set
knn_preds <- predict(train_knn, test_x)
# mean(knn_preds == test_y)
# train_knn$bestTune %>% pull()
Results
results <- bind_rows(results, # add accuracy to the df
data_frame(method="KNN",
accuracy = mean(knn_preds == test_y),
tune = train_knn$bestTune %>% pull()))
results %>% knitr::kable()
method | accuracy | tune |
---|---|---|
K means | 0.4952055 | NA |
Logistic regression | 0.9933049 | NA |
LDA | 0.8889227 | NA |
KNN | 0.9793061 | 1.99 |
The best accuracy is given a k of 1.99, as it is ilustrated in the next plot.
train_knn$results %>%
ggplot(aes(x = k, y = Accuracy)) +
geom_line(color = "blue") +
ggtitle("Accuracy of the knn model") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
The random forest is a classification algorithm consisting of many decisions trees. It uses bagging and feature randomness when building each individual tree to try to create an uncorrelated forest of trees whose prediction by committee is more accurate than that of any individual tree.
set.seed(0)
# Train the model
train_rf <- train(train_x, train_y,
method = "rf",
tuneGrid = data.frame(mtry = seq(5.4,5.6,0.1)),
importance = TRUE)
# Predict for the test set
rf_preds <- predict(train_rf, test_x)
# mean(rf_preds == test_y)
# train_rf$bestTune %>% pull()
Results
results <- bind_rows(results, # add accuracy to the df
data_frame(method="RF",
accuracy = mean(rf_preds == test_y),
tune = train_rf$bestTune %>% pull()))
results %>% knitr::kable()
method | accuracy | tune |
---|---|---|
K means | 0.4952055 | NA |
Logistic regression | 0.9933049 | NA |
LDA | 0.8889227 | NA |
KNN | 0.9793061 | 1.99 |
RF | 0.9975654 | 5.50 |
The best accuracy is given a mtry value of 5.5, as it is ilustrated in the next plot.
train_rf$results %>%
ggplot(aes(x = mtry, y = Accuracy)) +
geom_smooth() +
ggtitle("Accuracy of the RF model") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
We have created a few machine learning models with different accuracy values. But may be we can improve this further by combining all of this models into one.
For this we will use all of the models prediction and use a mayority vote to decide wich vale is the correct one. By doing this we can expect a better accuracy than the average model accuracy.
#kmeans_preds_num <- ifelse(kmeans_preds == "B", 1, 2)
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
models <- matrix(c(kmeans_preds, glm_preds, lda_preds, knn_preds, rf_preds), ncol = 5)
ensemble_preds <- ifelse(rowMedians(models) == 1, 0, 1)
models <- c("K means", "Logistic regression", "LDA", "K nearest neighbors", "Random forest", "Ensemble")
accuracy <- c(mean(kmeans_preds == test_y),
mean(glm_preds == test_y),
mean(lda_preds == test_y),
mean(knn_preds == test_y),
mean(rf_preds == test_y),
mean(ensemble_preds == test_y))
data.frame(Model = models, Accuracy = accuracy) %>% knitr::kable()
Model | Accuracy |
---|---|
K means | 0.5003043 |
Logistic regression | 0.9933049 |
LDA | 0.8889227 |
K nearest neighbors | 0.9793061 |
Random forest | 0.9975654 |
Ensemble | 0.9908704 |
The ensemble model got a very good accuracy, however it´s still behind the Logistic Regression and Random Forest models, as seen in the next plot.
data.frame(Model = models, Accuracy = accuracy) %>%
ggplot(aes(x = Model, y = Accuracy, size = 8 , color = Model, alpha = 0.5)) +
geom_point() +
theme_fivethirtyeight() +
ggtitle("Model comparison") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(trans = "logit")
We could improve it a little bit by just using the models with an accuracy above 90%.
#kmeans_preds_num <- ifelse(kmeans_preds == "B", 1, 2)
models <- matrix(c(glm_preds, knn_preds, rf_preds), ncol = 3)
ensemble_preds <- ifelse(rowMedians(models) == 1, 0, 1)
models <- c("Logistic regression", "K nearest neighbors", "Random forest", "Ensemble")
accuracy <- c(mean(glm_preds == test_y),
mean(knn_preds == test_y),
mean(rf_preds == test_y),
mean(ensemble_preds == test_y))
data.frame(Model = models, Accuracy = accuracy) %>% knitr::kable()
Model | Accuracy |
---|---|
Logistic regression | 0.9905660 |
K nearest neighbors | 0.9793061 |
Random forest | 0.9975654 |
Ensemble | 0.9960438 |
results <- bind_rows(results, # add accuracy to the df
data_frame(method="Ensemble",
accuracy = mean(ensemble_preds == test_y)))
Now the Ensemble model improve a lot, with an Accuracy of 0.996. However it´s still second to the accuracy of the Random Forest model, as seen in the next plot.
data.frame(Model = models, Accuracy = accuracy) %>%
ggplot(aes(x = Model, y = Accuracy, size = 8 , color = Model, alpha = 0.5)) +
geom_point() +
theme_fivethirtyeight() +
ggtitle("Model comparison") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(trans = "logit")
The best model is the Random Forest with an accuracy of 0.9975654, that´s pretty high!
At the end we got 6 different models, included one that combined the best models in the list. The highest accuracy model was the one created with the Random Forest method and the one that had the lowest accuracy was the one created with the K Means method.
results %>% knitr::kable()
method | accuracy |
---|---|
K means | 0.5003043 |
Ensemble | 0.9960438 |
Ensemble | 0.9960438 |
The models accuracy can be better comprehended with the next plot.
results %>%
ggplot(aes(x = method, y = accuracy, size = 8 , color = method, alpha = 0.5)) +
geom_point() +
theme_fivethirtyeight() +
ggtitle("Model comparison") +
theme(legend.position = "none",
plot.title = element_text(hjust = 0.5)) +
scale_y_continuous(trans = "logit")
This dataset was very clean and ready for a machine learning algorithm, however it still needed some modifications in order to make the predictions more accurate. Then, after analizing the data and using some visualization tools, a prediction model for the data was created. Altought the first approach wasn’t the best one, the ones created after were really accurate. At the end I got an accuracy of 0.9975654 for the Random Forest method, this accuracy is even better than the accuracy of all the methods combined. Therefore I think this model can be used in the real life for Fraud detection because of the very low error margin. However there’s still room for improvement, for example I could use the origin and destination data in addtion to all the predictors that I used to try and improve the algorithm, or create a Desicion Tree with all the predictors and see if the overall accuracy is better than the one that I got