diff --git a/docs/articles/AutoML24Workshop & MScThesis/01_MSc_multiclass_datasets_selection.Rmd b/docs/articles/AutoML24Workshop & MScThesis/01_MSc_multiclass_datasets_selection.Rmd deleted file mode 100644 index 5576e34..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/01_MSc_multiclass_datasets_selection.Rmd +++ /dev/null @@ -1,119 +0,0 @@ ---- -title: "MSc multiclass datasets selection.Rmd" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Imports - -We import the forester package to use the check_data function. - -```{r message=FALSE, warning=FALSE} -library(forester) -``` - -# Loading data - -We load the data from the RData files. - -```{r} -CC18 <- readRDS("CC18.RData") -regression <- readRDS("regression_bench.RData") -``` - -# Multiclass tasks selection - -At first we check which CC18 datasets are multiclass, as the benhcmark cosnists of both bianry and multilcass classification tasks. - -```{r} -multiclass_indexes <-c() -for (i in 1:length(CC18)) { - if (length(levels(CC18[[i]]$class)) > 2) { - multiclass_indexes <- c(multiclass_indexes, i) - } -} -multiclass_indexes -multiclass_CC18 <- CC18[multiclass_indexes] -``` - -# Selecting a subset of tasks - -Later, we analyse the sizes of the multiclass datasets, and ensure that we choose the representants with different characteristics, and reasonable sizes (not too big or too small). - -```{r} -for (i in 1:length(multiclass_CC18)) { - cat('Dataset Index:', i, '\n Name:', names(multiclass_CC18)[i], '\n Dimensionality:', dim(multiclass_CC18[[i]]), '\n') -} -``` - -Eventually we end up with the following selection of datasets: - -```{r} -small_idx <- c(2, 3, 4, 5, 7, 10, 15, 16, 17, 21) -multiclass_CC18_small <- multiclass_CC18[small_idx] -multiclass_CC18 <- multiclass_CC18_small[c(1, 4, 6, 7, 10)] -for (i in 1:length(multiclass_CC18)) { - cat('Dataset Index:', i, '\n Name:', names(multiclass_CC18)[i], '\n Dimensionality:', dim(multiclass_CC18[[i]]), '\n') -} -``` - -# Adding wine_quality task - -However, as a mistake, one of the regression datasets called `wine_quality` is actually a multiclass task, thus we also add it here - -```{r} -multiclass_CC18[[6]] <- regression[[2]] -names(multiclass_CC18)[6] <- "wine_quality" -for (i in 1:length(multiclass_CC18)) { - cat('Dataset Index:', i, '\n Name:', names(multiclass_CC18)[i], '\n Dimensionality:', dim(multiclass_CC18[[i]]), '\n') -} -``` - -# Data check - -Additionally, we calculate the foresters data check for all tasks. - -```{r warning=FALSE} -for (i in 1:length(multiclass_CC18)) { - cat('Dataset Index:', i, '\n Name:', names(multiclass_CC18)[i], '\n Dimensionality:', dim(multiclass_CC18[[i]]), '\n\n') - if (i != 6) { - check_data(multiclass_CC18[[i]], 'class') - } else { - check_data(multiclass_CC18[[i]], 'quality') - } -} -``` - - -# Saving the outcomes - -```{r} -saveRDS(multiclass_CC18, "multiclass_CC18.RData") -``` - diff --git a/docs/articles/AutoML24Workshop & MScThesis/02_MSc_altered_datasets_creation.Rmd b/docs/articles/AutoML24Workshop & MScThesis/02_MSc_altered_datasets_creation.Rmd deleted file mode 100644 index 1172fb0..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/02_MSc_altered_datasets_creation.Rmd +++ /dev/null @@ -1,138 +0,0 @@ ---- -title: "MSc_altered_datasets_creation" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Imports - -We import the forester package to use the check_data function. - -```{r message=FALSE, warning=FALSE} -library(forester) -``` - -# Loading data - -We load the data from the RData files. - -```{r} -MSc_binary_CC18 <- readRDS("binary_CC18.RData") -MSc_regression_bench <- readRDS("regression_bench.RData") -MSc_multiclass_CC18 <- readRDS("multiclass_CC18.RData") - -MSc_regression_bench <- MSc_regression_bench[c(1, 3, 4, 6, 7)] -``` - -# Altering function - -As the original datasets come from the benchmarks, their quality is much higer than in the case of regular ML tasks. Thus we alter the datasets in order to create a lower-quality, more real-life examples. - -To achieve that we introduce the following changes: - -- Adding ID column, - -- Adding static columns, - -- Duplicating existing columns, - -- Introducing the missing values for 3 columns where 5%, 10%, and 15% of observations are missing. - -```{r} -alter_df <- function(df) { - org_df_id <- 1:(ncol(df) - 1) - # Adding ID column - df$ID <- 1:nrow(df) - # Adding static columns - df$static_obvious <- rep(1, nrow(df)) - df$static_less_obvious <- c(rep('a', as.integer(nrow(df) * 0.995)), rep('b', (nrow(df) - as.integer(nrow(df) * 0.995)))) - # Duplicating existing columns - set.seed(123) - id <- sample(org_df_id, 2) - df$duplicate_1 <- df[, id[1]] - df$duplicate_2 <- df[, id[2]] - # Introducing the missing values - row_idx <- 1:nrow(df) - set.seed(234) - id <- sample(org_df_id[!org_df_id %in% id], 3) - set.seed(345) - miss_idx <- sample(row_idx, nrow(df) * 0.05) - df[miss_idx, id[1]] <- NA - set.seed(456) - miss_idx <- sample(row_idx, nrow(df) * 0.1) - df[miss_idx, id[2]] <- NA - set.seed(567) - miss_idx <- sample(row_idx, nrow(df) * 0.15) - df[miss_idx, id[3]] <- NA - - return(df) -} -``` - -# Alternation - -We alter the selected subset of datasets from all tasks. - -```{r} -MSc_binary_CC18$`credit-g-mod` <- alter_df(MSc_binary_CC18$`credit-g`) -MSc_binary_CC18$`phoneme-mod` <- alter_df(MSc_binary_CC18$phoneme) -MSc_regression_bench$`elevators-mod` <- alter_df(MSc_regression_bench$elevators) -MSc_regression_bench$`kin8nm-mod` <- alter_df(MSc_regression_bench$kin8nm) -MSc_multiclass_CC18$`satimage-mod` <- alter_df(MSc_multiclass_CC18$satimage) -MSc_multiclass_CC18$`car-mod` <- alter_df(MSc_multiclass_CC18$car) -MSc_binary_CC18 <- MSc_binary_CC18[c(1, 2, 3, 4, 5, 19, 25, 26, 36, 37)] -``` - -# Data check - -Additionally, we calculate the foresters data check for all modified tasks. - -```{r} -cat('Credit-g-mod\n') -s <- check_data(MSc_binary_CC18$`credit-g-mod`, 'class') -cat('Phoneme-g-mod\n') -s <- check_data(MSc_binary_CC18$phoneme, 'Class') -cat('Elevators-mod\n') -s <- check_data(MSc_regression_bench$elevators, 'Goal') -cat('Kin8nm-mod\n') -s <- check_data(MSc_regression_bench$kin8nm, 'y') -cat('Satimage-mod\n') -s <- check_data(MSc_multiclass_CC18$satimage, 'class') -cat('Car-mod\n') -s <- check_data(MSc_multiclass_CC18$car, 'class') -``` - -# Saving altered datasets - -```{r} -MSc_binary_CC18 <- MSc_binary_CC18[c(1, 2, 3, 4, 5, 19, 25, 26, 36, 37)] -saveRDS(MSc_binary_CC18, "MSc_binary_CC18.RData") -saveRDS(MSc_regression_bench, "MSc_regression_bench.RData") -saveRDS(MSc_multiclass_CC18, "MSc_multiclass_CC18.RData") -``` diff --git a/docs/articles/AutoML24Workshop & MScThesis/03_MSc_preprocessing.Rmd b/docs/articles/AutoML24Workshop & MScThesis/03_MSc_preprocessing.Rmd deleted file mode 100644 index 8940176..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/03_MSc_preprocessing.Rmd +++ /dev/null @@ -1,491 +0,0 @@ ---- -title: "Masters Thesis forester: Preprocessing" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Downloads - -The necessary downloads required for the forester package to work properly, if downloaded, the user can skip this part. - -```{r eval = FALSE} -install.packages("devtools") -devtools::install_github("ModelOriented/forester") -devtools::install_url('https://github.com/catboost/catboost/releases/download/v1.2.2/catboost-R-Windows-1.2.2.tgz', INSTALL_opts = c("--no-multiarch", "--no-test-load")) -devtools::install_github('ricardo-bion/ggradar', dependencies = TRUE) -install.packages('tinytex') -tinytex::install_tinytex() -``` - -# Imports - -Importing the necessary libraries. - -```{r warning=FALSE, message=FALSE} -library(forester) -``` - -# Data summary - -This short analysis presents the datasets used for the experiments, describing their basic characteristic as well as issues detected with `check_data()`. - -```{r eval = FALSE} -MSc_binary_CC18 <- readRDS("MSc_binary_CC18.RData") -MSc_binary_CC18 <- MSc_binary_CC18[c(1, 2, 3, 4, 5, 19, 25, 26, 36, 37)] -MSc_multiclass_CC18 <- readRDS("MSc_multiclass_CC18.RData") -MSc_regression_bench <- readRDS("MSc_regression_bench.RData") -``` - -The results of this summary were created manually mostly based on the script `ablation_study_datasets_info.Rmd`. As seen below they are saved in the file named `data_issues_summary.csv`. - -```{r eval = FALSE} -data_summary <- read.csv('data_issues_summary.csv', sep = ';') -rmarkdown::paged_table(data_summary) -``` - -# Single experiment - -This section contains the code necessary for running the preprocessing for a single dataset. - -## Removal settings - -Firstly we define three removal patterns, as testing all possible combinations is unnecessary and incredibly time-consuming. We call those options: removal_min, removal_med, and removal_max. The first option runs a minimal preprocessing pipeline where we remove only observations that have no target value. The second one additionally removes duplicate, id-like, static, and sparse columns removing corrupted rows with too many missing values. The last option additionally includes the removal of highly correlated columns. - -```{r eval = FALSE} -removal_min <- list(active_modules = c( - duplicate_cols = FALSE, id_like_cols = FALSE, static_cols = FALSE, - sparse_cols = FALSE, corrupt_rows = FALSE, correlated_cols = FALSE - ), - id_names = c(''), - static_threshold = 1, - sparse_columns_threshold = 1, - sparse_rows_threshold = 1, - high_correlation_threshold = 1 - ) - -removal_med <- list(active_modules = c( - duplicate_cols = TRUE, id_like_cols = TRUE, static_cols = TRUE, - sparse_cols = TRUE, corrupt_rows = TRUE, correlated_cols = FALSE - ), - id_names = c(''), - static_threshold = 0.99, - sparse_columns_threshold = 0.3, - sparse_rows_threshold = 0.3, - high_correlation_threshold = 1 - ) - -removal_max <- list(active_modules = c( - duplicate_cols = TRUE, id_like_cols = TRUE, static_cols = TRUE, - sparse_cols = TRUE, corrupt_rows = TRUE, correlated_cols = TRUE - ), - id_names = c(''), - static_threshold = 0.99, - sparse_columns_threshold = 0.3, - sparse_rows_threshold = 0.3, - high_correlation_threshold = 0.7 - ) -``` - -## Parameters preparation - -To run the preprocessing we prepare lists and vectors of parameters that will be used in the `custom_preprocessing()` function. We want to test various scenarios where: - -- RM - tests all removal strategies, where other modules are set to the most basic options: median-other for imputation, and none for feature selection. (3) -- IMP - tests all imputation methods, where other modules are set to the most basic options: min for removals, and none for feature selection. (4) -- FS - tests all feature selection methods, where other modules are set to the most basic options: min for removals, and median-other for imputation. (4) -- RM + IMP - tests med and max removal options combined with all imputation methods. (8) -- RM + FS - tests med and max removal options combined with all feature selection methods. (8) -- IMP + FS - tests all imputation methods with MI and Boruta feature selection. (8) -- RM + IMP + FS - tests med and max removal options combined with median-frequency, median-other and KNN imputation, and MI and Boruta feature selection. (12) - -Finally we end up with 38 different sets preprocessing for a single dataset. - -```{r eval = FALSE} -removals <- list(removal_min, removal_med, removal_max, # RM - removal_min, removal_min, removal_min, # Imp - removal_min, removal_min, removal_min, removal_min, # FS - removal_med, removal_med, removal_med, # RM + Imp - removal_max, removal_max, removal_max, # RM + Imp - removal_med, removal_med, removal_med, removal_med, # RM + FS - removal_max, removal_max, removal_max, removal_max, # RM + FS - removal_min, removal_min, removal_min, # Imp + FS - removal_min, removal_min, removal_min, # Imp + FS - removal_med, removal_med, removal_med, removal_med, # RM + Imp + FS - removal_max, removal_max, removal_max, removal_max) # RM + Imp + FS - -imp_method <- c('median-other', 'median-other', 'median-other', # RM - 'median-frequency', 'knn', 'mice', # Imp - 'median-other', 'median-other', 'median-other', 'median-other', # FS - 'median-frequency', 'knn', 'mice', # RM + Imp - 'median-frequency', 'knn', 'mice', # RM + Imp - 'median-other', 'median-other', 'median-other', 'median-other', # RM + FS - 'median-other', 'median-other', 'median-other', 'median-other', # RM + FS - 'median-frequency', 'knn', 'mice', # Imp + FS - 'median-frequency', 'knn', 'mice', # Imp + FS - 'median-frequency', 'knn', 'median-frequency', 'knn', # RM + Imp + FS - 'median-frequency', 'knn', 'median-frequency', 'knn') # RM + Imp + FS - -fs_method <- c('none', 'none', 'none', # RM - 'none', 'none', 'none', # Imp - 'VI', 'MCFS', 'MI', 'BORUTA', # FS - 'none', 'none', 'none', 'none', # RM + Imp - 'none', 'none', 'none', 'none', # RM + Imp - 'VI', 'MCFS', 'MI', 'BORUTA', # RM + FS - 'VI', 'MCFS', 'MI', 'BORUTA', # RM + FS - 'MI', 'MI', 'MI', # Imp + FS - 'BORUTA', 'BORUTA', 'BORUTA', # Imp + FS - 'MI', 'MI', 'BORUTA', 'BORUTA', # RM + Imp + FS - 'MI', 'MI', 'BORUTA', 'BORUTA') # RM + Imp + FS - -rmv_names <- c('removal_min', 'removal_med', 'removal_max', # RM - 'removal_min', 'removal_min', 'removal_min', # Imp - 'removal_min', 'removal_min', 'removal_min', 'removal_min', # FS - 'removal_med', 'removal_med', 'removal_med', # RM + Imp - 'removal_max', 'removal_max', 'removal_max', # RM + Imp - 'removal_med', 'removal_med', 'removal_med', 'removal_med', # RM + FS - 'removal_max', 'removal_max', 'removal_max', 'removal_max', # RM + FS - 'removal_min', 'removal_min', 'removal_min', # Imp + FS - 'removal_min', 'removal_min', 'removal_min', # Imp + FS - 'removal_med', 'removal_med', 'removal_med', 'removal_med', # RM + Imp + FS - 'removal_max', 'removal_max', 'removal_max', 'removal_max') # RM + Imp + FS -``` - -## Experiment function - -This is the main experiment function working for a singular dataset. To make it work we need to provide the data, specify the target, provide a string name of the dataset, include previously defined parameter settings (`removals`, `imp_method`, `fs_method`, `rmv_names`), specify if the dataset is binary classification, and choose MI method (in the end it was `estevez` for binary classification, and `peng` for regression). The function can be run in silent or informative mode. - -The function prepares particular datasets one by one, and the results are saved after into the sub-directories in order to enable stopping and resuming whole process. It creates a folder `preprocessing_data` in the working directory where the results are held. Inside we will find a sub-directory for each dataset where we hold 38 RData files with preprocessed data, as well as 2 helper RData files with `list_times`, and `names_times`. When the process is finished we will also obtain another RData file with list of all preprocessed versions of the dataset in the main `preprocessing_data` directory. Full analysis consists of 25 sub-directories and final RData files. - -```{r eval = FALSE} -preprocessing_experiment <- function(data, y, removals, imp_method, fs_method, - dataset_name, rmv_names, task = 'binary', verbose = TRUE, - mi_method = 'estevez') { - list_times <- c() - names_times <- c() - prep_data <- list() - dir.create('MSc_preprocessing_data', showWarnings = FALSE) - - if (task == 'binary') { - dir.create(paste0('MSc_preprocessing_data/binary_exp_', dataset_name), showWarnings = FALSE) - tryCatch({ - suppressWarnings(list_times <- readRDS(paste0(getwd(), '/MSc_preprocessing_data/binary_exp_', - dataset_name, '/list_times.RData'))) - suppressWarnings(names_times <- readRDS(paste0(getwd(), '/MSc_preprocessing_data/binary_exp_', - dataset_name, '/names_times.RData'))) - print('Loaded times list.') - }, error = function(cond) { - print('No times list.') - }) - } else if (task == 'regression') { - dir.create(paste0('MSc_preprocessing_data/regression_exp_', dataset_name), showWarnings = FALSE) - tryCatch({ - suppressWarnings(list_times <- readRDS(paste0(getwd(), '/MSc_preprocessing_data/regression_exp_', - dataset_name, '/list_times.RData'))) - suppressWarnings(names_times <- readRDS(paste0(getwd(), '/MSc_preprocessing_data/regression_exp_', - dataset_name, '/names_times.RData'))) - print('Loaded times list.') - }, error = function(cond) { - print('No times list.') - }) - } else if (task == 'multiclass') { - dir.create(paste0('MSc_preprocessing_data/multiclass_exp_', dataset_name), showWarnings = FALSE) - tryCatch({ - suppressWarnings(list_times <- readRDS(paste0(getwd(), '/MSc_preprocessing_data/multiclass_exp_', - dataset_name, '/list_times.RData'))) - suppressWarnings(names_times <- readRDS(paste0(getwd(), '/MSc_preprocessing_data/multiclass_exp_', - dataset_name, '/names_times.RData'))) - print('Loaded times list.') - }, error = function(cond) { - print('No times list.') - }) - } - - for (i in 1:38) { - verbose_cat('\n Iteration:', i, 'Removal:', rmv_names[i], 'Imputation:', - imp_method[i], 'Feature Selection:', fs_method[i], '\n', verbose = verbose) - - - if (!file.exists(paste0('MSc_preprocessing_data/binary_exp_', dataset_name, '/', i, '.RData')) && - !file.exists(paste0('MSc_preprocessing_data/regression_exp_', dataset_name, '/', i, '.RData')) && - !file.exists(paste0('MSc_preprocessing_data/multiclass_exp_', dataset_name, '/', i, '.RData'))) { - start <- as.numeric(Sys.time()) - names_times[i] <- paste0('Training: ', dataset_name, ' Removal: ', rmv_names[i], - ' Imputation: ', imp_method[i], ' Feature Selection: ', fs_method[i]) - script_wd <- getwd() - prep <- custom_preprocessing(data = data, - y = y, - na_indicators = c(''), - removal_parameters = removals[[i]], - imputation_parameters = list( - imputation_method = imp_method[[i]], - k = 10, - m = 5 - ), - feature_selection_parameters = list( - feature_selection_method = fs_method[[i]], - max_features = 'default', - nperm = 1, - cutoffPermutations = 20, - threadsNumber = NULL, - method = mi_method - ), - verbose = TRUE) - setwd(script_wd) - stop <- as.numeric(Sys.time()) - list_times[i] <- round(stop - start, 1) - prep_data[[i]] <- prep - - if (task == 'binary') { - saveRDS(prep, paste0(getwd(), '/MSc_preprocessing_data/binary_exp_', dataset_name, '/', i, '.RData')) - saveRDS(list_times, paste0(getwd(), '/MSc_preprocessing_data/binary_exp_', dataset_name, '/list_times.RData')) - saveRDS(names_times, paste0(getwd(), '/MSc_preprocessing_data/binary_exp_', dataset_name, '/names_times.RData')) - } else if (task == 'regression') { - saveRDS(prep, paste0(getwd(), '/MSc_preprocessing_data/regression_exp_', dataset_name, '/', i, '.RData')) - saveRDS(list_times, paste0(getwd(), '/MSc_preprocessing_data/regression_exp_', dataset_name, '/list_times.RData')) - saveRDS(names_times, paste0(getwd(), '/MSc_preprocessing_data/regression_exp_', dataset_name, '/names_times.RData')) - } else if (task == 'multiclass') { - saveRDS(prep, paste0(getwd(), '/MSc_preprocessing_data/multiclass_exp_', dataset_name, '/', i, '.RData')) - saveRDS(list_times, paste0(getwd(), '/MSc_preprocessing_data/multiclass_exp_', dataset_name, '/list_times.RData')) - saveRDS(names_times, paste0(getwd(), '/MSc_preprocessing_data/multiclass_exp_', dataset_name, '/names_times.RData')) - } - } else if (task == 'binary') { - prep_data[[i]] <- readRDS(paste0('MSc_preprocessing_data/binary_exp_', dataset_name, '/', i, '.RData')) - } else if (task == 'regression') { - prep_data[[i]] <- readRDS(paste0('MSc_preprocessing_data/regression_exp_', dataset_name, '/', i, '.RData')) - } else if (task == 'multiclass') { - prep_data[[i]] <- readRDS(paste0('MSc_preprocessing_data/multiclass_exp_', dataset_name, '/', i, '.RData')) - } - - } - names(prep_data) <- names_times - time_df <- data.frame(name = names_times, duration = list_times) - - outcome <- list( - time_df = time_df, - prep_data = prep_data - ) - - if (task == 'binary') { - saveRDS(outcome, paste0('MSc_preprocessing_data/binary_exp_', dataset_name, '.RData')) - } else if (task == 'regression') { - saveRDS(outcome, paste0('MSc_preprocessing_data/regression_exp_', dataset_name, '.RData')) - } else if (task == 'multiclass') { - saveRDS(outcome, paste0('MSc_preprocessing_data/multiclass_exp_', dataset_name, '.RData')) - } -} -``` - -# All datasets experiments - -In this section we present the function conducting multiple experiments for different datasets. - -## Parameters preparation - -Firstly we prepare lists of additional parameters describing the datasets. - -```{r eval = FALSE} -targets_binary <- c('class', 'Class', 'class', 'class', 'class', 'Class', 'Class', 'Class', 'class', 'Class') -targets_regression <- c('rej', 'quality', 'y', 'y', 'foo', 'y', 'Goal', 'Goal', 'y') -targets_multiclass <- c('class', 'class', 'class', 'class', 'class', 'class', 'class', 'class', 'class') -names_binary <- c('kr-vs-kp', 'breast-w', 'credit-approval', 'credit-g', 'diabetes', - 'phoneme', 'banknote-authentication', 'blood-transfusion-service-center', - 'credit-g-mod', 'phoneme-mod') -names_regression <- c('bank32nh', 'wine_quality', 'Mercedes_Benz_Greener_Manufacturing', - 'kin8nm', 'pol', '2dplanes', 'elevators', 'elevators-mod', 'kin8nm-mod') -names_multiclass <- c("balance-scale", "mfeat-karhunen", "mfeat-zernike", "satimage", "car", "segment", "dna", "satimage-mod", "car-mod" ) -``` - -## Multiple experiments function - -The function is similar to the previous one, and it helps with executing the preprocessing of multiple datasets at once. - -```{r eval = FALSE} -multiple_experiments <- function(datasets, targets, removals, imp_method, fs_method, - dataset_names, rmv_names, task = 'binary', verbose = 'part', - mi_method = 'estevez') { - - if (verbose == 'part') { - text_verbose <- TRUE - exp_verbose <- FALSE - } else if (verbose == 'all') { - text_verbose <- TRUE - exp_verbose <- TRUE - } else if (verbose == 'none') { - text_verbose <- FALSE - exp_verbose <- FALSE - } - - for (i in 1:length(targets)) { - if (!file.exists(paste0('MSc_preprocessing_data/binary_exp_', dataset_names[i], '.RData')) && - !file.exists(paste0('MSc_preprocessing_data/regression_exp_', dataset_names[i], '.RData')) && - !file.exists(paste0('MSc_preprocessing_data/multiclass_exp_', dataset_names[i], '/', i, '.RData'))) { - - verbose_cat('The file for the', i, 'dataset, called', dataset_names[i], - 'does not exist, proceeding with preprocessing.\n', verbose = text_verbose) - - preprocessing_experiment(data = datasets[[i]], - y = targets[i], - removals = removals, - imp_method = imp_method, - fs_method = fs_method, - dataset_name = dataset_names[i], - rmv_names = rmv_names, - task = task, - verbose = exp_verbose, - mi_method = mi_method) - } else { - verbose_cat('The file for the', i, 'dataset, called', dataset_names[i], - 'exists, skipping the preprocessing.\n', verbose = text_verbose) - } - } -} -``` - -# Preprocessing execution - -The code executing the preprocessing for both binary classification and regression tasks. - -## Binary classification - -```{r eval = FALSE} -idx <- 1:10 -multiple_experiments(datasets = MSc_binary_CC18[idx], - targets = targets_binary[idx], - removals = removals, - imp_method = imp_method, - fs_method = fs_method, - dataset_names = names_binary[idx], - rmv_names = rmv_names, - task = 'binary', - verbose = 'part', - mi_method = 'estevez') -``` - -## Regression - -```{r eval = FALSE} -idx <- 1:9 -multiple_experiments(datasets = MSc_regression_bench[idx], - targets = targets_regression[idx], - removals = removals, - imp_method = imp_method, - fs_method = fs_method, - dataset_names = names_regression[idx], - rmv_names = rmv_names, - task = 'regression', - verbose = 'part', - mi_method = 'peng') -``` - -## Multiclass - -```{r eval = FALSE} -idx <- 1:9 -multiple_experiments(datasets = MSc_multiclass_CC18[idx], - targets = targets_multiclass[idx], - removals = removals, - imp_method = imp_method, - fs_method = fs_method, - dataset_names = names_multiclass[idx], - rmv_names = rmv_names, - task = 'multiclass', - verbose = 'part', - mi_method = 'peng') -``` - -# Short time analysis - -```{r} -times <- list() -full_times <- c() -binary_names <- paste0('binary_exp_', names_binary) -regression_names <- paste0('regression_exp_', names_regression) -multiclass_names <- paste0('multiclass_exp_', names_multiclass) -all_names <- c(binary_names, regression_names, multiclass_names) -for (i in 1:25) { - times[[i]] <- readRDS(paste0(getwd(), '/MSc_preprocessing_data/', all_names[i], '/list_times.RData')) - full_times <- c(full_times, sum(times[[i]])) -} - - -``` - -```{r} -cat('Full preprocessing times in seconds:\n', full_times, '\n\n') -cat('Full preprocessing times in minutes:\n', round(full_times/60, 2), '\n\n') -cat('Full preprocessing times in hours:\n', round(full_times/3600, 2), '\n\n') -``` - -```{r} -files <- list.files('MSc_preprocessing_data', pattern = 'RData') -data <- list() -for (i in 1:length(files)) { - data[[i]] <- readRDS(paste0('MSc_preprocessing_data/', files[i])) -} -``` - -```{r} -dataset <- c() -removal <- c() -imputation <- c() -feature_selection <- c() -duration <- c() -task_type <- c() -for (i in 1:25) { - for (j in 1:38) { - stringsplt <- strsplit(data[[i]]$time_df$name[[j]], ':')[[1]] - dataset <- c(dataset, substr(stringsplt[2], 2, nchar(stringsplt[2]) - 7)) - removal <- c(removal, substr(stringsplt[3], 1, nchar(stringsplt[3]) - 10)) - imputation <- c(imputation, substr(stringsplt[4], 1, nchar(stringsplt[4]) - 17)) - feature_selection <- c(feature_selection, stringsplt[5]) - duration <- c(duration, data[[i]]$time_df$duration[j]) - if (i <= 10) { - task_type <- c(task_type, 'binary') - } else if (i > 18) { - task_type <- c(task_type, 'regression') - } else { - task_type <- c(task_type, 'multiclass') - } - } -} -duration_df <- data.frame(Dataset = dataset, Removal = removal, Imputation = imputation, - Feature_selection = feature_selection, Task_type = task_type, Duration = duration) -rmarkdown::paged_table(duration_df) -``` -```{r} -sum(duration_df$Duration[duration_df$Task_type == 'binary'])/3600 -sum(duration_df$Duration[duration_df$Task_type == 'regression'])/3600 -sum(duration_df$Duration[duration_df$Task_type == 'multiclass'])/3600 -``` - -```{r} -dir.create('MSc_processed_results', showWarnings = FALSE) -#saveRDS(duration_df, 'MSc_processed_results/preprocessing_duration.RData') -``` - diff --git a/docs/articles/AutoML24Workshop & MScThesis/04_MSc_training.Rmd b/docs/articles/AutoML24Workshop & MScThesis/04_MSc_training.Rmd deleted file mode 100644 index 5ebce10..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/04_MSc_training.Rmd +++ /dev/null @@ -1,324 +0,0 @@ ---- -title: "Masters Thesis forester: Model training" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Downloads - -The necessary downloads required for the forester package to work properly, if downloaded, the user can skip this part. - -```{r eval = FALSE} -install.packages("devtools") -devtools::install_github("ModelOriented/forester") -devtools::install_github('catboost/catboost', subdir = 'catboost/R-package') -devtools::install_github('ricardo-bion/ggradar', dependencies = TRUE) -install.packages('tinytex') -install.packages('RhpcBLASctl') -tinytex::install_tinytex() -``` - -# Imports - -Importing the necessary libraries. - -```{r warning=FALSE, message=FALSE} -library(forester) -``` - -```{r} -setwd('/net/ascratch/people/plghubertruczynski/forester') -getwd() -``` - - -# Import of outcomes - -At this step we import the outcomes obtained by the `ablation_study_preprocessing` . - -```{r} -files <- list.files('MSc_preprocessing_data', pattern = 'RData') -data <- list() -for (i in 1:length(files)) { - data[[i]] <- readRDS(paste0('MSc_preprocessing_data/', files[i])) -} -``` - -# Training - -## Parameters preparation - -We copy the parameters vectors from the `ablation_study_preprocessing` to use them for naming new outcomes. - -```{r} -imp_method <- c('median-other', 'median-other', 'median-other', # RM - 'median-frequency', 'knn', 'mice', # Imp - 'median-other', 'median-other', 'median-other', 'median-other', # FS - 'median-frequency', 'knn', 'mice', # RM + Imp - 'median-frequency', 'knn', 'mice', # RM + Imp - 'median-other', 'median-other', 'median-other', 'median-other', # RM + FS - 'median-other', 'median-other', 'median-other', 'median-other', # RM + FS - 'median-frequency', 'knn', 'mice', # Imp + FS - 'median-frequency', 'knn', 'mice', # Imp + FS - 'median-frequency', 'knn', 'median-frequency', 'knn', # RM + Imp + FS - 'median-frequency', 'knn', 'median-frequency', 'knn') # RM + Imp + FS - -fs_method <- c('none', 'none', 'none', # RM - 'none', 'none', 'none', # Imp - 'VI', 'MCFS', 'MI', 'BORUTA', # FS - 'none', 'none', 'none', 'none', # RM + Imp - 'none', 'none', 'none', 'none', # RM + Imp - 'VI', 'MCFS', 'MI', 'BORUTA', # RM + FS - 'VI', 'MCFS', 'MI', 'BORUTA', # RM + FS - 'MI', 'MI', 'MI', # Imp + FS - 'BORUTA', 'BORUTA', 'BORUTA', # Imp + FS - 'MI', 'MI', 'BORUTA', 'BORUTA', # RM + Imp + FS - 'MI', 'MI', 'BORUTA', 'BORUTA') # RM + Imp + FS - -rmv_names <- c('removal_min', 'removal_med', 'removal_max', # RM - 'removal_min', 'removal_min', 'removal_min', # Imp - 'removal_min', 'removal_min', 'removal_min', 'removal_min', # FS - 'removal_med', 'removal_med', 'removal_med', # RM + Imp - 'removal_max', 'removal_max', 'removal_max', # RM + Imp - 'removal_med', 'removal_med', 'removal_med', 'removal_med', # RM + FS - 'removal_max', 'removal_max', 'removal_max', 'removal_max', # RM + FS - 'removal_min', 'removal_min', 'removal_min', # Imp + FS - 'removal_min', 'removal_min', 'removal_min', # Imp + FS - 'removal_med', 'removal_med', 'removal_med', 'removal_med', # RM + Imp + FS - 'removal_max', 'removal_max', 'removal_max', 'removal_max') # RM + Imp + FS -``` - -## Training function for a single dataset - -This function performs a training for a single major dataset like `banknote-authentication` , so it results in training 39 models 1 per each preprocessed dataset. Function parameters are similar to the function conducting the preprocessing in `ablation_study_preprocessing` script, and it follows the same saving pattern. - -```{r} -single_dataset_training <- function(data, y, imp_method, fs_method, rmv_names, dataset_name, task = 'binary', verbose = 'part') { - list_times <- c() - names_times <- c() - out_data <- list() - - if (verbose == 'part') { - text_verbose <- TRUE - exp_verbose <- FALSE - } else if (verbose == 'all') { - text_verbose <- TRUE - exp_verbose <- TRUE - } else if (verbose == 'none') { - text_verbose <- FALSE - exp_verbose <- FALSE - } - - # Create directory for the training results of the ablation study, if it exists - # nothing happens. - dir.create('MSc_results', showWarnings = FALSE) - - # Create subdirectories for separate tasks, and attempt to read the lists of - # durations spent on the trainig. If error araises it means we have no proper - # files, thus we create them from scratch. - if (task == 'binary') { - dir.create(paste0('MSc_results/binary_', dataset_name), showWarnings = FALSE) - tryCatch({ - suppressWarnings(list_times <- readRDS(paste0(getwd(), '/MSc_results/binary_', dataset_name, '/list_times.RData'))) - suppressWarnings(names_times <- readRDS(paste0(getwd(), '/MSc_results/binary_', dataset_name, '/names_times.RData'))) - print('Loaded times list.') - }, error = function(cond) { - print('No times list.') - }) - } else if (task == 'regression') { - dir.create(paste0('MSc_results/regression_', dataset_name), showWarnings = FALSE) - tryCatch({ - suppressWarnings(list_times <- readRDS(paste0(getwd(), '/MSc_results/regression_', dataset_name, '/list_times.RData'))) - suppressWarnings(names_times <- readRDS(paste0(getwd(), '/MSc_results/regression_', dataset_name, '/names_times.RData'))) - print('Loaded times list.') - }, error = function(cond) { - print('No times list.') - }) - } else if (task == 'multiclass') { - dir.create(paste0('MSc_results/multiclass_', dataset_name), showWarnings = FALSE) - tryCatch({ - suppressWarnings(list_times <- readRDS(paste0(getwd(), '/MSc_results/multiclass_', dataset_name, '/list_times.RData'))) - suppressWarnings(names_times <- readRDS(paste0(getwd(), '/MSc_results/multiclass_', dataset_name, '/names_times.RData'))) - print('Loaded times list.') - }, error = function(cond) { - print('No times list.') - }) - } - # Iterate through differently prepared datsets and train the forester on them. - for (i in 1:38) { - verbose_cat('\n Iteration:', i, 'Removal:', rmv_names[i], 'Imputation:', - imp_method[i], 'Feature Selection:', fs_method[i], '\n', verbose = text_verbose) - - # We train new models only if we don't have an outcome for provided dataset. - if (!file.exists(paste0('MSc_results/binary_', dataset_name, '/', i, '.RData')) && - !file.exists(paste0('MSc_results/regression_', dataset_name, '/', i, '.RData')) && - !file.exists(paste0('MSc_results/multiclass_', dataset_name, '/', i, '.RData'))) { - # Calculate start end stop times for each training. - start <- as.numeric(Sys.time()) - names_times[i] <- paste0('Training: ', dataset_name, ' Removal: ', rmv_names[i], - ' Imputation: ', imp_method[i], ' Feature Selection: ', fs_method[i]) - script_wd <- getwd() - outcomes <- train(data = data$prep_data[[i]]$data, - y = y, - engine = c('ranger', 'xgboost', 'decision_tree', 'lightgbm', 'catboost'), - verbose = exp_verbose, - check_correlation = FALSE, - train_test_split = c(0.6, 0.2, 0.2), - split_seed = 123, - bayes_iter = 0, - random_evals = 20, - parallel = FALSE, - custom_preprocessing = data$prep_data[[i]]) - - setwd(script_wd) - stop <- as.numeric(Sys.time()) - list_times[i] <- round(stop - start, 1) - - # Save new outcomes as a new file and re-save both list_times and names_times. - if (task == 'binary') { - saveRDS(outcomes, paste0(getwd(), '/MSc_results/binary_', dataset_name, '/', i, '.RData')) - saveRDS(list_times, paste0(getwd(), '/MSc_results/binary_', dataset_name, '/list_times.RData')) - saveRDS(names_times, paste0(getwd(), '/MSc_results/binary_', dataset_name, '/names_times.RData')) - } else if (task == 'regression'){ - saveRDS(outcomes, paste0(getwd(), '/MSc_results/regression_', dataset_name, '/', i, '.RData')) - saveRDS(list_times, paste0(getwd(), '/MSc_results/regression_', dataset_name, '/list_times.RData')) - saveRDS(names_times, paste0(getwd(), '/MSc_results/regression_', dataset_name, '/names_times.RData')) - } else if (task == 'multiclass') { - saveRDS(outcomes, paste0(getwd(), '/MSc_results/multiclass_', dataset_name, '/', i, '.RData')) - saveRDS(list_times, paste0(getwd(), '/MSc_results/multiclass_', dataset_name, '/list_times.RData')) - saveRDS(names_times, paste0(getwd(), '/MSc_results/multiclass_', dataset_name, '/names_times.RData')) - } - } - } -} - -``` - -# All datasets training - -In this section we present the function conducting multiple training for different datasets. - -## Parameters preparation - -Firstly we prepare lists of additional parameters describing the datasets. - -```{r} -targets_binary <- c('Class', 'Class', 'Class', 'class', 'class', 'class', 'class', 'class', 'Class', 'Class') -targets_regression <- c('y', 'rej', 'Goal', 'Goal', 'y', 'y', 'y') -targets_multiclass <- c('class', 'class', 'class', 'class', 'class', 'class', 'class', 'quality') - -names_binary <- c('banknote-authentication', 'blood-transfusion-service-center', - 'breast-w', 'credit-approval', 'credit-g-mod', 'credit-g', 'diabetes', - 'kr-vs-kp', 'phoneme-mod', 'phoneme') -names_regression <- c('2dplanes', 'bank32nh', 'elevators-mod', 'elevators', 'kin8nm-mod', 'kin8nm', - 'Mercedes_Benz_Greener_Manufacturing') -names_multiclass <- c("balance-scale", "car-mod", "car", "dna", "mfeat-karhunen", "satimage-mod", "satimage", 'wine_quality') -``` - -## Multiple training - -The function is similar to the previous one, and it helps with executing the preprocessing of multiple datasets at once. - -```{r} -multiple_training <- function(data, targets, imp_method, fs_method, rmv_names, dataset_names, task = 'binary', verbose = 'part') { - - if (verbose == 'part') { - text_verbose <- TRUE - } else if (verbose == 'all') { - text_verbose <- TRUE - } else if (verbose == 'none') { - text_verbose <- FALSE - } - - for (i in 1:length(data)) { - if (!file.exists(paste0('MSc_results/binary_', dataset_names[i], '.RData')) && - !file.exists(paste0('MSc_results/regression_', dataset_names[i], '.RData')) && - !file.exists(paste0('MSc_results/multiclass_', dataset_names[i], '.RData'))) { - - verbose_cat('The results file for the', i, 'dataset, called', dataset_names[i], - 'does not exist, proceeding with training.\n', verbose = text_verbose) - - single_dataset_training(data = data[[i]], - y = targets[i], - imp_method = imp_method, - fs_method = fs_method, - rmv_names = rmv_names, - dataset_name = dataset_names[i], - task = task, - verbose = verbose) - } else { - verbose_cat('The results file for the', i, 'dataset, called', dataset_names[i], - 'exists, skipping the training.\n', verbose = text_verbose) - } - } -} -``` - -# Training execution - -The code executing the training for both binary classification and regression tasks. - -## Binary - -```{r eval = FALSE} -multiple_training(data = data[1:10], - targets = targets_binary, - imp_method = imp_method, - fs_method = fs_method, - rmv_names = rmv_names, - dataset_names = names_binary, - task = 'binary', - verbose = 'part') -``` -## Multiclass - -```{r eval = FALSE} -multiple_training(data = data[11:18], - targets = targets_multiclass, - imp_method = imp_method, - fs_method = fs_method, - rmv_names = rmv_names, - dataset_names = names_multiclass, - task = 'multiclass', - verbose = 'part') -``` - -## Regression - -```{r eval = FALSE} -multiple_training(data = data[19:25], - targets = targets_regression, - imp_method = imp_method, - fs_method = fs_method, - rmv_names = rmv_names, - dataset_names = names_regression, - task = 'regression', - verbose = 'part') -``` - diff --git a/docs/articles/AutoML24Workshop & MScThesis/05_MSc_results_preparation.Rmd b/docs/articles/AutoML24Workshop & MScThesis/05_MSc_results_preparation.Rmd deleted file mode 100644 index d06f93b..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/05_MSc_results_preparation.Rmd +++ /dev/null @@ -1,324 +0,0 @@ ---- -title: "Masters Thesis forester: Results preparation" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Preprocessing data preparation - -In this notebook we will process the data from training and preprocessing in order to enable easier analysis of the results, and reduction of data size. - -```{r} -setwd('/net/ascratch/people/plghubertruczynski/forester') -getwd() -``` - -```{r warning=FALSE} -files <- list.files('MSc_preprocessing_data', pattern = 'RData') -data <- list() -for (i in 1:length(files)) { - data[[i]] <- readRDS(paste0('MSc_preprocessing_data/', files[i])) -} -``` - -```{r eval=FALSE} -dataset <- c() -removal <- c() -imputation <- c() -feature_selection <- c() -duration <- c() -task_type <- c() -for (i in 1:25) { - for (j in 1:38) { - stringsplt <- strsplit(data[[i]]$time_df$name[[j]], ':')[[1]] - dataset <- c(dataset, substr(stringsplt[2], 2, nchar(stringsplt[2]) - 7)) - removal <- c(removal, substr(stringsplt[3], 1, nchar(stringsplt[3]) - 10)) - imputation <- c(imputation, substr(stringsplt[4], 1, nchar(stringsplt[4]) - 17)) - feature_selection <- c(feature_selection, stringsplt[5]) - duration <- c(duration, data[[i]]$time_df$duration[j]) - if (i <= 10) { - task_type <- c(task_type, 'binary') - } else if (i > 18) { - task_type <- c(task_type, 'regression') - } else { - task_type <- c(task_type, 'multiclass') - } - } -} -duration_df <- data.frame(Dataset = dataset, Removal = removal, Imputation = imputation, - Feature_selection = feature_selection, Task_type = task_type, Duration = duration) -rmarkdown::paged_table(duration_df) -``` - -```{r eval=FALSE} -dir.create('MSc_processed_results', showWarnings = FALSE) -saveRDS(duration_df, 'MSc_processed_results/preprocessing_duration.RData') -``` - -# Training data preparation - -## Time analysis - -```{r} -directories <- list.dirs('MSc_results')[2:26] -dataset <- c() -removal <- c() -imputation <- c() -feature_selection <- c() -duration <- c() -task_type <- c() -for (i in 1:25) { - names <- readRDS(paste0(directories[i], '/names_times.RData')) - times <- readRDS(paste0(directories[i], '/list_times.RData')) - for (j in 1:38) { - stringsplt <- strsplit(names[j], ' ')[[1]] - dataset <- c(dataset, stringsplt[2]) - removal <- c(removal, stringsplt[4]) - imputation <- c(imputation, stringsplt[6]) - feature_selection <- c(feature_selection, stringsplt[9]) - duration <- c(duration, times[j]) - if (i <= 10) { - task_type <- c(task_type, 'binary') - } else if (i > 18) { - task_type <- c(task_type, 'regression') - } else { - task_type <- c(task_type, 'multiclass') - } - } -} -duration_train_df <- data.frame(Dataset = dataset, Removal = removal, Imputation = imputation, - Feature_selection = feature_selection, Task_type = task_type, Duration = duration) -``` - -```{r eval=FALSE} -dir.create('MSc_processed_results', showWarnings = FALSE) -saveRDS(duration_train_df, 'MSc_processed_results/training_duration.RData') -``` - -```{r} -duration_train_df <- readRDS('MSc_processed_results/training_duration.RData') -rmarkdown::paged_table(duration_train_df) -``` - -## Scores and summaries - -As analysis of results for every single model is time consuming, and won't provide too much interesting information, we've decided to summarize those results by calculating maximum, mean, median, and minimum value for each metric. Moreover, we've decided to divide it by engine, however the analysis of all engines is also provided. In fact the latter one was used during the results analysis. - -```{r} -res <- readRDS(paste0(directories[1], '/1.RData')) -res2 <- readRDS(paste0(directories[15], '/1.RData')) -res3 <- readRDS(paste0(directories[18], '/1.RData')) -``` - -```{r} -summarize_results <- function(data, type) { - summary_df <- data.frame() - engines <- c(unique(data$engine), 'all') - if (type == 'binary') { - metrics <- c('accuracy', 'auc', 'f1') - } else if (type == 'regression') { - metrics <- c('rmse', 'mse', 'r2', 'mae') - } else { - metrics <- c('accuracy', 'weighted_precision', 'weighted_recall', 'weighted_f1') - } - - - for (i in 1:length(engines)) { - # Choose the engine - if (engines[i] == 'all') { - df <- data - } else { - df <- data[data$engine == engines[i], ] - } - for (j in 1:length(metrics)) { - metric <- df[[metrics[j]]] - summ <- summary(metric) - record <- data.frame(Engine = engines[i], Metric = metrics[j], Max = summ[[6]], - Mean = summ[[4]], Median = summ[[3]], Min = summ[[1]], - Range = (summ[[6]] - summ[[1]])) - summary_df <- rbind(summary_df, record) - } - } - - return(summary_df) -} -``` - -```{r} -rmarkdown::paged_table(summarize_results(res3$score_test, 'regression')) -``` - -### Summary - -Here we actually use the function defined before and calculate the results. - -```{r eval=FALSE} -directories <- list.dirs('MSc_results')[2:26] -training_summary <- list() -task_type <- NULL -dir.create('MSc_processed_results/training_summary', showWarnings = FALSE) - -names_binary <- c('banknote-authentication', 'blood-transfusion-service-center', - 'breast-w', 'credit-approval', 'credit-g', 'credit-g-mod', 'diabetes', - 'kr-vs-kp', 'phoneme', 'phoneme-mod') -names_multiclass <- c("balance-scale", "car", "car-mod", "dna", "mfeat-karhunen", "satimage", "satimage-mod", 'wine_quality') -names_regression <- c('2dplanes', 'bank32nh', 'elevators','elevators-mod','kin8nm', 'kin8nm-mod', - 'Mercedes_Benz_Greener_Manufacturing') - -names_all <- c(names_binary, names_multiclass, names_regression) - -for (i in 1:25) { - training_summary[[i]] <- list() - - if (i <= 10) { - task_type <- 'binary' - } else if (i > 18) { - task_type <- 'regression' - } else { - task_type <- 'multiclass' - } - - if (!file.exists(paste0('MSc_processed_results/training_summary/binary_', names_all[i], '.RData')) && - !file.exists(paste0('MSc_processed_results/training_summary/regression_', names_all[i], '.RData')) && - !file.exists(paste0('MSc_processed_results/training_summary/multiclass_', names_all[i], '.RData'))) { - cat('Iteration:', i, '\n', 'The results for', task_type, names_all[i], 'does not exist. Proceeding with calculations. \n') - task_summary <- list() - for (j in 1:38) { - cat('Managing models from iteration:', j, '\n') - training_summary[[i]][[j]] <- list() - - results <- readRDS(paste0(directories[i], '/', j, '.RData')) - data <- results$data - data_dim <- dim(results$data) - score_test <- results$score_test - score_train <- results$score_train - score_valid <- results$score_valid - test_summary <- summarize_results(results$score_test, task_type) - train_summary <- summarize_results(results$score_train, task_type) - valid_summary <- summarize_results(results$score_valid, task_type) - - obs <- list(data = data, data_dim = data_dim, score_test = score_test, - score_train = score_train, score_valid = score_valid, - test_summary = test_summary, train_summary = train_summary, - valid_summary = valid_summary) - - suppressWarnings(task_summary[[j]] <- obs) - } - names <- readRDS(paste0(directories[i], '/names_times.RData')) - names(task_summary) <- names - - suppressWarnings(training_summary[[i]] <- list(task_summary)) - - saveRDS(task_summary, paste0(getwd(), '/MSc_processed_results/training_summary/', task_type, '_', names_all[i], '.RData')) - } else { - cat('Iteration:', i, '\n', 'The results for', task_type, names_all[i], 'are already present. Skipping their preparation. \n') - task_summary <- readRDS(paste0(getwd(), '/MSc_processed_results/training_summary/', task_type, '_', names_all[i],'.RData')) - - suppressWarnings(training_summary[[i]] <- task_summary) - } -} - -names(training_summary) <- directories -``` - - -```{r eval=FALSE} -saveRDS(training_summary, 'MSc_processed_results/training_summary.RData') -``` - -# Extended training summary table - -To provide an easier usage of the data, we've decided to create one big table describing all results combined with preprocessing strategy and each stage durations. The resulting file is used in the results analysis. - -```{r eval=FALSE} -training_summary <- readRDS(paste0(getwd(), '/MSc_processed_results/training_summary.RData')) -duration_train_df <- readRDS(paste0(getwd(), '/MSc_processed_results/training_duration.RData')) -duration_preprocessing <- readRDS(paste0(getwd(), '/MSc_processed_results/preprocessing_duration.RData')) -duration_df <- duration_train_df -full_duration <- duration_preprocessing$Duration + duration_df$Duration - -duration_df$Preprocessing_duration <- duration_preprocessing$Duration -duration_df$Preprocessing_duration_fraction <- round(duration_df$Preprocessing_duration / full_duration, 3) -duration_df$Full_duration <- full_duration - -``` - -```{r eval=FALSE} -merged_train <- data.frame() -merged_test <- data.frame() -merged_valid <- data.frame() -merged_dim <- data.frame() - -#extended_summary <- duration_df[rep(1:nrow(duration_df), each = 18), ] -bin_summary <- duration_df[rep(1:(38*10), each = 18), ] -mcl_summary <- duration_df[rep((38*10+1):(38*18), each = 24), ] -reg_summary <- duration_df[rep((38*18+1):nrow(duration_df), each = 24), ] -extended_summary <- rbind(bin_summary, mcl_summary, reg_summary) - -for (i in 1:25) { - for (j in 1:38) { - merged_train <- rbind(merged_train, training_summary[[i]][[j]]$train_summary) - merged_test <- rbind(merged_test, training_summary[[i]][[j]]$test_summary) - merged_valid <- rbind(merged_valid, training_summary[[i]][[j]]$valid_summary) - dim <- training_summary[[i]][[j]]$data_dim - dim_df <- data.frame(Rows = dim[1], Columns = dim[2]) - if (i > 10) { - each <- 24 - } else { - each <- 18 - } - dim_df <- dim_df[rep(1:nrow(dim_df), each = each), ] - merged_dim <- rbind(merged_dim, dim_df) - } -} -``` - -```{r eval=FALSE} -extended_training_summary <- cbind(extended_summary, merged_dim, merged_train) -rownames(extended_training_summary) <- NULL -saveRDS(extended_training_summary, 'MSc_processed_results/training_summary_table.RData') -extended_training_summary <- readRDS('MSc_processed_results/training_summary_table.RData') -rmarkdown::paged_table(extended_training_summary) -``` - -```{r eval=FALSE} -extended_testing_summary <- cbind(extended_summary, merged_dim, merged_test) -rownames(extended_testing_summary) <- NULL -saveRDS(extended_testing_summary, 'MSc_processed_results/testing_summary_table.RData') -extended_testing_summary <- readRDS('MSc_processed_results/testing_summary_table.RData') -rmarkdown::paged_table(extended_testing_summary) -``` - -```{r} -extended_validation_summary <- cbind(extended_summary, merged_dim, merged_valid) -rownames(extended_validation_summary) <- NULL -saveRDS(extended_validation_summary, 'MSc_processed_results/validation_summary_table.RData') -extended_validation_summary <- readRDS('MSc_processed_results/validation_summary_table.RData') -rmarkdown::paged_table(extended_validation_summary) -``` diff --git a/docs/articles/AutoML24Workshop & MScThesis/06_MSc_preliminary_results_analysis.Rmd b/docs/articles/AutoML24Workshop & MScThesis/06_MSc_preliminary_results_analysis.Rmd deleted file mode 100644 index 15296db..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/06_MSc_preliminary_results_analysis.Rmd +++ /dev/null @@ -1,1651 +0,0 @@ ---- -title: "Masters Thesis forester: Preliminary results analysis" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Important note - -In this notebook you will find preliminary results of our study, however none of these plots are used in the Master's Thesis. This supplementary material is a form of additional knowledge, if some readers do not like aggregations. To see the final results, please refer to the notebooks 07 and 08. - -# Imports and settings - -```{r, warning=FALSE, message=FALSE} -library(ggplot2) -library(patchwork) -library(scales) -library(dplyr) -library(forcats) -``` - - -# Data import - -```{r} -duration_train_df <- readRDS('MSc_processed_results/training_duration.RData') -duration_preprocessing <- readRDS('MSc_processed_results/preprocessing_duration.RData') -training_summary_table <- readRDS('MSc_processed_results/training_summary_table.RData') -testing_summary_table <- readRDS('MSc_processed_results/testing_summary_table.RData') -validation_summary_table <- readRDS('MSc_processed_results/validation_summary_table.RData') -``` - -# Time analysis - -An important aspect of our analysis is the time complexity of different approaches, as extended preprocessing module leads to more time consuming computations, which could be spent for example on training the models. On the other hand, thorough preparation step might result in removing lots of unnecessary columns, so the model should be able to learn faster. Despite the absolute preprocessing time, another important aspect is the relative duration to training time. Ex. if the training takes 1000 seconds than preprocessing lasting 100 is not so much as in the case when training takes 100 seconds. We will work on slightly modified data frame presented below. - -```{r} -duration_df <- duration_train_df -full_duration <- duration_preprocessing$Duration + duration_df$Duration -duration_df$Preprocessing_duration <- duration_preprocessing$Duration -duration_df$Preprocessing_duration_fraction <- round(duration_df$Preprocessing_duration / full_duration, 3) -duration_df$Full_duration <- full_duration -rmarkdown::paged_table(duration_df) -``` - -## Training time - -```{r, echo=FALSE} -column_fractions <- c() -max_fields_num <- c() -task_type <- c() -Columns <- c() -datasets <- unique(training_summary_table$Dataset) -for (i in 1:length(unique(training_summary_table$Dataset))) { - cols <- training_summary_table[training_summary_table$Dataset == datasets[i], 'Columns'] - rows <- training_summary_table[training_summary_table$Dataset == datasets[i], 'Rows'] - Columns <- c(Columns, max(cols)) - column_fractions <- c(column_fractions, round(min(cols) / max(cols), 2)) - max_fields_num <- c(max_fields_num, max(rows) * max(cols)) - if (i <= 10) { - task_type <- c(task_type, 'binary') - } else if (i > 18) { - task_type <- c(task_type, 'regression') - } else { - task_type <- c(task_type, 'multiclass') - } -} -left_columns <- data.frame(Dataset = datasets, Column_fraction = column_fractions, Columns = Columns, - Max_fields_number = max_fields_num, Task_type = task_type) -``` - -```{r} -left_columns_binary <- left_columns[left_columns$Task_type == 'binary', ] -#left_columns_binary$Dataset <- droplevels(left_columns_binary$Dataset) -left_columns_binary$Dataset <- fct_reorder(left_columns_binary$Dataset, left_columns_binary$Max_fields_number) - -left_columns_multiclass <- left_columns[left_columns$Task_type == 'multiclass', ] -#left_columns_multiclass$Dataset <- droplevels(left_columns_multiclass$Dataset) -left_columns_multiclass$Dataset <- fct_reorder(left_columns_multiclass$Dataset, left_columns_multiclass$Max_fields_number) - -left_columns_regression <- left_columns[left_columns$Task_type == 'regression', ] -#left_columns_regression$Dataset <- droplevels(left_columns_regression$Dataset) -left_columns_regression$Dataset <- fct_reorder(left_columns_regression$Dataset, left_columns_regression$Max_fields_number) - -left_columns <- rbind(left_columns_binary, left_columns_multiclass, left_columns_regression) -``` - -```{r} -paper_theme <- function() { - theme_minimal() + - theme(plot.title = element_text(colour = 'black', size = 20), - plot.subtitle = element_text(colour = 'black', size = 16), - axis.title.x = element_text(colour = 'black', size = 14), - axis.title.y = element_text(colour = 'black', size = 14), - axis.text.y = element_text(colour = "black", size = 12), - axis.text.x = element_text(colour = "black", size = 12), - strip.background = element_rect(fill = "white", color = "white"), - strip.text = element_text(size = 6 ), - strip.text.y.right = element_text(angle = 0), - legend.title = element_text(colour = 'black', size = 14), - legend.text = element_text(colour = "black", size = 12)) -} -``` - - -```{r fig.height=8, fig.width=18, echo=FALSE} -a <- ggplot(data = left_columns, aes(color = Task_type, fill = Task_type)) + - geom_segment(aes(x = Columns * column_fractions, xend = Columns, y = Dataset, yend = Dataset)) + - geom_point(aes(x = Columns * column_fractions, y = Dataset), size = 3) + - geom_point(aes(x = Columns, y = Dataset), size = 3) + - labs(title = 'Columns range', - subtitle = 'full columns vs maximal reduction', - x = 'Number of columns', - y = '', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.text.y = element_blank(), - legend.position = "none") - -b <- ggplot(data = duration_df, aes(x = Duration, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Training time', - subtitle = 'for different ML tasks', - x = 'Duration [s]', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -(b | a) + plot_layout(widths = c(3, 1)) -``` - -## Preprocessing time - -```{r fig.height=8, fig.width=18, echo=FALSE} -c <- ggplot(data = left_columns, aes(x = Max_fields_number, y = Dataset, color = Task_type, fill = Task_type)) + - geom_col(alpha = 0.5) + - labs(title = 'Number of initial fields', - subtitle = '', - x = 'Number of fields', - y = '', - color = 'Task_type', - fill = 'Task_type') + - scale_x_continuous(trans = log2_trans(), - breaks = trans_breaks('log2', function(x) 2^x), - labels = trans_format('log2', math_format(2^.x))) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.text.y = element_blank(), - legend.position = "none") - -d <- ggplot(data = duration_df, aes(x = Preprocessing_duration, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks', - x = 'Duration [s]', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") -(d | a | c) + plot_layout(widths = c(3, 1, 1)) -``` - -## Combined time - -```{r fig.height=8, fig.width=18, echo=FALSE} -e <- ggplot(data = duration_df, aes(x = Full_duration, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Combined preprocessing and training time', - subtitle = 'for different ML tasks', - x = 'Duration [s]', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -(e | a | c) + plot_layout(widths = c(3, 1, 1)) -``` - -```{r fig.height=8, fig.width=18, echo=FALSE} -f <- ggplot(data = duration_df, aes(x = Preprocessing_duration_fraction, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Preprocessing time fraction', - subtitle = ' in comparison to full process, for different ML tasks', - x = 'Fraction of preprocessing time', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - xlim(0, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -(f | a | c) + plot_layout(widths = c(3, 1, 1)) -``` - -## Preprocessing components analysis - -### Feature selection impact - -```{r fig.height=8, fig.width=18, echo=FALSE} -bool_fs <- duration_preprocessing -bool_fs[bool_fs$Feature_selection != ' none', 'Feature_selection'] <- 'yes' - -g <- ggplot(data = bool_fs, aes(x = Duration, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - theme_minimal() + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by presence of feature selection', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -g -``` - -### No feature selection removal strategies - - -```{r fig.height=8, fig.width=18, echo=FALSE} -no_fs <- duration_preprocessing[duration_preprocessing$Feature_selection == ' none', ] - -h <- ggplot(data = no_fs, aes(x = Duration, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by removal strategy', - x = 'Duration [s]', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -h -``` - -### No feature selection Imputation methods - -```{r fig.height=8, fig.width=18, echo=FALSE} -no_fs_imp <- no_fs[no_fs$Dataset %in% c('breast-w ', 'credit-approval ', "credit-g-mod ", "phoneme-mod ", "car-mod ", "satimage-mod ", "elevators-mod ", "kin8nm-mod "), ] -i <- ggplot(data = no_fs_imp, aes(x = Duration, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by imputation strategy', - x = 'Duration [s]', - y = 'Dataset', - color = 'Imputation strategy', - fill = 'Imputation strategy') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -j <- ggplot(data = no_fs, aes(x = Duration, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Preprocessing time comparison with forester', - subtitle = 'for different ML tasks, divided by imputation strategy', - x = 'Duration [s]', - y = 'Dataset', - color = 'Imputation strategy', - fill = 'Imputation strategy') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -i -``` - -```{r fig.height=10, fig.width=18, echo=FALSE} -j -``` - -### Different feature selection methods - -```{r fig.height=8, fig.width=18, echo=FALSE} -only_fs <- duration_preprocessing[duration_preprocessing$Feature_selection != ' none', ] -only_fs_niche <- only_fs[only_fs$Feature_selection %in% c(' MI', ' MCFS'), ] -only_fs_top <- only_fs[only_fs$Feature_selection %in% c(' VI', ' BORUTA'), ] - -k <- ggplot(data = only_fs, aes(x = Duration, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by feature selection method', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x), limits = c(NA, 4100)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") -k -``` - -```{r fig.height=8, fig.width=18, echo=FALSE} -datasets <- unique(only_fs$Dataset) -VI <- c() -MCFS <- c() -MI <- c() -BORUTA <- c() - -for (i in unique(only_fs$Dataset)) { - ds <- only_fs[only_fs$Dataset == i, ] - VI <- c(VI, median(ds[ds$Feature_selection == ' VI', 'Duration'])) - MCFS <- c(MCFS, median(ds[ds$Feature_selection == ' MCFS', 'Duration'])) - MI <- c(MI, median(ds[ds$Feature_selection == ' MI', 'Duration'])) - BORUTA <- c(BORUTA, median(ds[ds$Feature_selection == ' BORUTA', 'Duration'])) -} - -median_fs <- data.frame(Dataset = datasets, VI = VI, MCFS = MCFS, BORUTA = BORUTA, MI = MI) -long_median_fs <- reshape(median_fs, varying = c('MI' ,'VI', 'MCFS', 'BORUTA'), v.names = c('Duration'), - times = c('MI' ,'VI', 'MCFS', 'BORUTA'), direction = 'long') -long_median_fs <- long_median_fs[, 1:3] - -rownames(long_median_fs) <- NULL -colnames(long_median_fs) <- c('Dataset', 'Method', 'Duration') - -l <- ggplot(data = long_median_fs, aes(x = Duration, y = Dataset, color = factor(Method), fill = factor(Method))) + - geom_point(size = 5, alpha = 0.5) + - labs(title = 'Preprocessing median time', - subtitle = 'for different ML tasks, divided by feature selection method', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") -l -``` - -The visualization above clearly indicates that in the forester package we can witness the division between slow and fast feature selection methods, where MCFS and MI are in the first group, whereas, VI and Borutain the second one. In order to analyse them thoroughly let's create two subplots that separate those two. - -```{r fig.height=8, fig.width=18, echo=FALSE} -long_median_fs_slow <- long_median_fs[long_median_fs$Method %in% c('VI', 'BORUTA'), ] -long_median_fs_fast <- long_median_fs[long_median_fs$Method %in% c('MCFS', 'MI'), ] - -m <- ggplot(data = long_median_fs_slow, aes(x = Duration, y = Dataset, color = factor(Method), fill = factor(Method))) + - geom_point(size = 5, alpha = 0.5) + - labs(title = 'Preprocessing median time', - subtitle = 'for slow feature selection methods', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom") - -n <- ggplot(data = long_median_fs_fast, aes(x = Duration, y = Dataset, color = factor(Method), fill = factor(Method))) + - geom_point(size = 5, alpha = 0.5) + - labs(title = 'Preprocessing median time', - subtitle = 'for fast feature selection methods', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.text.y = element_blank(), - axis.title.y = element_blank(), - legend.position = "bottom") -m | n -``` - -This time we can easily distinguish which preprocessing methods are faster and slower among considered pairs. In the case of less time-demanding ones presented on the right plot, every time MI method is faster than BORUTA, and in some cases the differences are significant as the cane reach up to 16 times difference. For the slow methods it is not so clear which one is more demanding, as sometimes VI is faster and sometimes MCFS. We could say that the slowest algorithm is the VI method, as there are 5 datasets where MCFS is incredibly fast, whereas the VI is much slower then. - -Summing up, the order from fastest to slowest feature selection method is: MI, BORUTA, MCFS, VI. - - -# Performance - -Now, let's analyse the performance of the models obtained in our experiment. - -```{r, echo=FALSE} -all_engines <- validation_summary_table[validation_summary_table$Engine == 'all', ] - -all_engines_bin <- all_engines[all_engines$Task_type == 'binary', ] -all_engines_mcl <- all_engines[all_engines$Task_type == 'multiclass', ] -all_engines_reg <- all_engines[all_engines$Task_type == 'regression', ] - -all_engines_bin_baselines <- all_engines_bin[which(all_engines_bin$Removal =='removal_min' & - all_engines_bin$Imputation =='median-other' & - all_engines_bin$Feature_selection =='none'), ] -all_engines_mcl_baselines <- all_engines_mcl[which(all_engines_mcl$Removal =='removal_min' & - all_engines_mcl$Imputation =='median-other' & - all_engines_mcl$Feature_selection =='none'), ] -all_engines_reg_baselines <- all_engines_reg[which(all_engines_reg$Removal =='removal_min' & - all_engines_reg$Imputation =='median-other' & - all_engines_reg$Feature_selection =='none'), ] -``` - -## Comparison to baseline preprocessing - -### Binary classification - -```{r fig.height=12, fig.width=18, echo=FALSE, warning=FALSE} -o <- ggplot(data = all_engines_bin, aes(x = Max, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_bin_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Max, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - labs(title = 'Max metrics values', - subtitle = 'for binary classification tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Metric', - fill = 'Metric') + - xlim(0.35, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.x = element_blank(), - axis.title.y = element_blank(), - legend.position = "none") - -p <- ggplot(data = all_engines_bin, aes(x = Mean, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_bin_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Mean, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - labs(title = 'Mean metrics values', - subtitle = 'for binary classification tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Metric', - fill = 'Metric') + - xlim(0.35, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(plot.subtitle = element_blank(), - axis.title.x = element_blank(), - legend.position = "none") - -r <- ggplot(data = all_engines_bin, aes(x = Median, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_bin_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Median, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - labs(title = 'Median metrics values', - subtitle = 'for binary classification tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Metric', - fill = 'Metric') + - xlim(0.35, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(plot.subtitle = element_blank(), - axis.title.y = element_blank(), - legend.position = "bottom") - -o / p / r -``` - - -### Multiclass classification - -```{r fig.height=12, fig.width=18, echo=FALSE, warning=FALSE} -o <- ggplot(data = all_engines_mcl, aes(x = Max, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_mcl_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Max, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - labs(title = 'Max metrics values', - subtitle = 'for multiclass classification tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Metric', - fill = 'Metric') + - xlim(0.25, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - axis.title.x = element_blank(), - legend.position = "none") - -p <- ggplot(data = all_engines_mcl, aes(x = Mean, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_mcl_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Mean, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - labs(title = 'Mean metrics values', - subtitle = 'for multiclass classification tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Metric', - fill = 'Metric') + - xlim(0.25, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(plot.subtitle = element_blank(), - axis.title.x = element_blank(), - legend.position = "none") - -r <- ggplot(data = all_engines_mcl, aes(x = Median, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_mcl_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Median, y = Dataset, color = factor(Metric), fill = factor(Metric))) + - labs(title = 'Median metrics values', - subtitle = 'for multiclass classification tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Metric', - fill = 'Metric') + - xlim(0.25, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(plot.subtitle = element_blank(), - axis.title.y = element_blank(), - legend.position = "bottom") - -o / p / r -``` - -### Regression - -```{r echo=FALSE} -all_engines_reg_min_med <- all_engines_reg[, c(1, 2, 3, 4, 5, 13, 16, 17)] -median <- all_engines_reg_min_med[, 1:7] -names(median) <- c(names(median)[1:6], 'Value') -min <- all_engines_reg_min_med[, c(1:6, 8)] -names(min) <- c(names(min)[1:6], 'Value') -all_engines_reg_min_med <- rbind(median, min) -all_engines_reg_min_med$Aggregation <- rep(c('Median', 'Min'), each = nrow(all_engines_reg)) -``` - -```{r echo=FALSE} -all_engines_reg_baselines_min_med <- all_engines_reg_baselines[, c(1, 2, 3, 4, 5, 13, 16, 17)] -median <- all_engines_reg_baselines_min_med[, 1:7] -names(median) <- c(names(median)[1:6], 'Value') -min <- all_engines_reg_baselines_min_med[, c(1:6, 8)] -names(min) <- c(names(min)[1:6], 'Value') -all_engines_reg_baselines_min_med <- rbind(median, min) -all_engines_reg_baselines_min_med$Aggregation <- rep(c('Median', 'Min'), each = nrow(all_engines_reg_baselines)) -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -metric <- 'mse' -s <- ggplot(data = all_engines_reg_min_med[which(all_engines_reg_min_med$Metric == metric), ], - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == metric), ], - size = 3, shape = 4,position = position_jitterdodge(), - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - labs(title = 'Aggregated MSE values (Magnified)', - subtitle = 'for regression tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Aggregation', - fill = 'Aggregation') + - coord_cartesian(xlim = c(0, 2)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -t <- ggplot(data = all_engines_reg_min_med[all_engines_reg_min_med$Metric == metric, ], - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == metric), ], - size = 3, shape = 4,position = position_jitterdodge(), - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - labs(title = 'Aggregated MSE values (All)', - x = 'Value',) + - coord_cartesian(xlim = c(0, NA)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.x = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank()) - -metric <- 'mae' -u <- ggplot(data = all_engines_reg_min_med[which(all_engines_reg_min_med$Metric == metric), ], - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == metric), ], - size = 3, shape = 4,position = position_jitterdodge(), - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - labs(title = 'Aggregated MAE values (Magnified)', - x = 'Value', - y = 'Dataset', - color = 'Aggregation', - fill = 'Aggregation') + - coord_cartesian(xlim = c(0, 2)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.x = element_blank()) - -v <- ggplot(data = all_engines_reg_min_med[all_engines_reg_min_med$Metric == metric, ], - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == metric), ], - size = 3, shape = 4,position = position_jitterdodge(), - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - labs(title = 'Aggregated MAE values (All)', - x = 'Value',) + - coord_cartesian(xlim = c(0, NA)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.x = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank()) - -metric <- 'rmse' -w <- ggplot(data = all_engines_reg_min_med[which(all_engines_reg_min_med$Metric == metric), ], - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == metric), ], - size = 3, shape = 4,position = position_jitterdodge(), - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - labs(title = 'Aggregated RMSE values (Magnified)', - x = 'Value', - y = 'Dataset', - color = 'Aggregation', - fill = 'Aggregation') + - coord_cartesian(xlim = c(0, 2)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - plot.subtitle = element_blank(), - axis.title.y = element_blank()) - -x <- ggplot(data = all_engines_reg_min_med[all_engines_reg_min_med$Metric == metric, ], - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == metric), ], - size = 3, shape = 4,position = position_jitterdodge(), - aes(x = Value, y = Dataset, color = factor(Aggregation), fill = factor(Aggregation))) + - labs(title = 'Aggregated RMSE values (All)', - x = 'Value',) + - coord_cartesian(xlim = c(0, NA)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank()) - -(s | t) / (u | v) / (w | x) -``` - -## Feature Selection Impact - -### Binary classification - -```{r echo=FALSE} -all_engines_bin_fs <- all_engines_bin -all_engines_bin_fs <- all_engines_bin_fs[all_engines_bin_fs$Metric == 'accuracy', ] -all_engines_bin_fs$Feature_selection <- ifelse(all_engines_bin_fs$Feature_selection != 'none', 'Yes', 'No') - -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -a1 <- ggplot(data = all_engines_bin_fs, aes(x = Max, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for binary classification tasks, wheter FS methods were used', - x = 'Value', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -b1 <- ggplot(data = all_engines_bin_fs, aes(x = Mean, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Mean Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.x = element_blank()) - -c1 <- ggplot(data = all_engines_bin_fs, aes(x = Median, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - plot.subtitle = element_blank(), - axis.title.y = element_blank()) - -a1 / c1 - -``` - -### Multiclass classification - -```{r echo=FALSE} -all_engines_mcl_fs <- all_engines_mcl -all_engines_mcl_fs <- all_engines_mcl_fs[all_engines_mcl_fs$Metric == 'accuracy', ] -all_engines_mcl_fs$Feature_selection <- ifelse(all_engines_mcl_fs$Feature_selection != 'none', 'Yes', 'No') - -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -a1 <- ggplot(data = all_engines_mcl_fs, aes(x = Max, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for multiclass classification tasks, wheter FS methods were used', - x = 'Value', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -b1 <- ggplot(data = all_engines_mcl_fs, aes(x = Mean, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Mean Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.x = element_blank()) - -c1 <- ggplot(data = all_engines_mcl_fs, aes(x = Median, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - plot.subtitle = element_blank(), - axis.title.y = element_blank()) - -a1 / c1 - -``` - -### Regression - -```{r echo=FALSE} -all_engines_reg_min_med_fs <- all_engines_reg_min_med -all_engines_reg_min_med_fs <- all_engines_reg_min_med_fs[all_engines_reg_min_med_fs$Metric == 'rmse', ] -all_engines_reg_min_med_fs$Feature_selection <- ifelse(all_engines_reg_min_med_fs$Feature_selection != 'none', 'Yes', 'No') -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -d1 <- ggplot(data = all_engines_reg_min_med_fs[all_engines_reg_min_med_fs$Aggregation == 'Min', ], - aes(x = Value, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Minimal RMSE values (Magnified)', - subtitle = 'for different regression tasks and preprocessing strategies', - x = 'RMSE', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - coord_cartesian(xlim = c(0, 0.25)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.x = element_blank(), - axis.title.y = element_blank()) - -e1 <- ggplot(data = all_engines_reg_min_med_fs[all_engines_reg_min_med_fs$Aggregation == 'Min', ], - aes(x = Value, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Minimal RMSE values (All)', - x = 'RMSE',) + - coord_cartesian(xlim = c(0, NA)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - plot.subtitle = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank(), - axis.title.x = element_blank()) - -f1 <- ggplot(data = all_engines_reg_min_med_fs[all_engines_reg_min_med_fs$Aggregation == 'Median', ], - aes(x = Value, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median RMSE values (Magnified)', - x = 'RMSE', - y = 'Dataset', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - coord_cartesian(xlim = c(0, 0.25)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - plot.subtitle = element_blank(), - axis.title.y = element_blank()) - -g1 <- ggplot(data = all_engines_reg_min_med_fs[all_engines_reg_min_med_fs$Aggregation == 'Median', ], - aes(x = Value, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median RMSE values (All)', - x = 'RMSE', - color = 'Are FS method used?', - fill = 'Are FS method used?') + - coord_cartesian(xlim = c(0, NA)) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - plot.subtitle = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank()) - -(d1 | e1) / (f1 | g1) - -``` - -## Lack of Feature Selection - -### Binary Classification - -```{r echo=FALSE} -all_engines_bin_no_fs <- all_engines_bin_fs[all_engines_bin_fs$Feature_selection == 'No', ] -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -h1 <- ggplot(data = all_engines_bin_no_fs, aes(x = Max, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for binary classification tasks, without FS, depending on removal strategy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.x = element_blank(), - axis.title.y = element_blank()) - -i1 <- ggplot(data = all_engines_bin_no_fs, aes(x = Median, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -h1 / i1 -``` - -### Multiclass classification - -```{r echo=FALSE} -all_engines_mcl_no_fs <- all_engines_mcl_fs[all_engines_mcl_fs$Feature_selection == 'No', ] -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -h1 <- ggplot(data = all_engines_mcl_no_fs, aes(x = Max, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for multiclass classification tasks, without FS, depending on removal strategy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.x = element_blank(), - axis.title.y = element_blank()) - -i1 <- ggplot(data = all_engines_mcl_no_fs, aes(x = Median, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -h1 / i1 -``` - -### Regression - -```{r echo=FALSE} -all_engines_reg_no_fs <- all_engines_reg_min_med_fs[all_engines_reg_min_med_fs$Feature_selection == 'No', ] -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -l1 <- ggplot(data = all_engines_reg_no_fs[all_engines_reg_no_fs$Aggregation == 'Median', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median RMSE (Magnified)', - subtitle = 'for regression tasks, without FS, depending on removal strategy', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, 0.25) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -m1 <- ggplot(data = all_engines_reg_no_fs[all_engines_reg_no_fs$Aggregation == 'Median', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median RMSE', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank(), - axis.text.y = element_blank(), - plot.subtitle = element_blank()) - -n1 <- ggplot(data = all_engines_reg_no_fs[all_engines_reg_no_fs$Aggregation == 'Min', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Minimal (Magnified)', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, 0.25) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -o1 <- ggplot(data = all_engines_reg_no_fs[all_engines_reg_no_fs$Aggregation == 'Min', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Minimal RMSE', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - axis.text.y = element_blank(), - plot.subtitle = element_blank()) - -(l1 | m1) / (n1 | o1) - -``` - -## Feature selection only - -### Binary classification - -```{r echo=FALSE} -all_engines_bin_fs_only <- all_engines_bin_fs[all_engines_bin_fs$Feature_selection == 'Yes', ] -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -p1 <- ggplot(data = all_engines_bin_fs_only, aes(x = Max, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_bin_baselines[all_engines_bin_baselines$Metric == 'accuracy', ], - size = 5, shape = 4, aes(x = Max, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Max Accuracy', - subtitle = 'for binary classification tasks, with FS, depending on removal strategy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -r1 <- ggplot(data = all_engines_bin_fs_only, aes(x = Median, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_bin_baselines[all_engines_bin_baselines$Metric == 'accuracy', ], - size = 5, shape = 4, aes(x = Median, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -p1 / r1 -``` - -### Multiclass classification - -```{r echo=FALSE} -all_engines_mcl_fs_only <- all_engines_mcl_fs[all_engines_mcl_fs$Feature_selection == 'Yes', ] -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -p1 <- ggplot(data = all_engines_mcl_fs_only, aes(x = Max, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_mcl_baselines[all_engines_mcl_baselines$Metric == 'accuracy', ], - size = 5, shape = 4, aes(x = Max, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Max Accuracy', - subtitle = 'for multiclass classification tasks, with FS, depending on removal strategy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -r1 <- ggplot(data = all_engines_mcl_fs_only, aes(x = Median, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_mcl_baselines[all_engines_mcl_baselines$Metric == 'accuracy', ], - size = 5, shape = 4, aes(x = Median, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -p1 / r1 -``` - -### Regression - -```{r echo=FALSE} -all_engines_reg_fs_only <- all_engines_reg_min_med_fs[all_engines_reg_min_med_fs$Feature_selection == 'Yes', ] -``` - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -u1 <- ggplot(data = all_engines_reg_fs_only[all_engines_reg_fs_only$Aggregation == 'Median', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == 'rmse' & - all_engines_reg_baselines_min_med$Aggregation == 'Median'), ], - size = 4, shape = 4, aes(x = Value, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Median RMSE (Magnified)', - subtitle = 'with FS used for different binary classification tasks', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, 0.25) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -v1 <- ggplot(data = all_engines_reg_fs_only[all_engines_reg_fs_only$Aggregation == 'Median', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == 'rmse' & - all_engines_reg_baselines_min_med$Aggregation == 'Median'), ], - size = 4, shape = 4, aes(x = Value, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Median RMSE', - subtitle = 'without FS used for different binary classification tasks', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank(), - axis.text.y = element_blank()) - -w1 <- ggplot(data = all_engines_reg_fs_only[all_engines_reg_fs_only$Aggregation == 'Min', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == 'rmse' & - all_engines_reg_baselines_min_med$Aggregation == 'Min'), ], - size = 4, shape = 4, aes(x = Value, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Minimal (Magnified)', - subtitle = 'for different binary classification tasks', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, 0.25) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -x1 <- ggplot(data = all_engines_reg_fs_only[all_engines_reg_fs_only$Aggregation == 'Min', ], - aes(x = Value, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = all_engines_reg_baselines_min_med[which(all_engines_reg_baselines_min_med$Metric == 'rmse' & - all_engines_reg_baselines_min_med$Aggregation == 'Min'), ], - size = 4, shape = 4, aes(x = Value, y = Dataset), color = '#B1805B', fill = '#B1805B') + - labs(title = 'Minimal RMSE', - subtitle = 'for different binary classification tasks', - x = 'RMSE', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - axis.text.y = element_blank(), - plot.subtitle = element_blank()) - -(u1 | v1) / (w1 | x1) - -``` - -## Feature Selection methods and performance - -```{r} -all_engines_bin_fs_methods <- all_engines_bin[all_engines_bin$Feature_selection != 'none' & all_engines_bin$Metric == 'accuracy', ] -all_engines_mcl_fs_methods <- all_engines_mcl[all_engines_mcl$Feature_selection != 'none' & all_engines_mcl$Metric == 'accuracy', ] -all_engines_reg_fs_methods <- all_engines_reg[all_engines_reg$Feature_selection != 'none' & all_engines_reg$Metric == 'rmse', ] -``` - -### Binary classification - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -p1 <- ggplot(data = all_engines_bin_fs_methods, aes(x = Max, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for binary classification tasks, depending on feature selection method', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -r1 <- ggplot(data = all_engines_bin_fs_methods, aes(x = Median, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - subtitle = 'for binary classification tasks, depending on feature selection method', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.7, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - axis.title.x = element_blank(), - plot.subtitle = element_blank()) - -p1 / r1 -``` - -### Multiclass classification - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -p1 <- ggplot(data = all_engines_mcl_fs_methods, aes(x = Max, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for multiclass classification tasks, depending on feature selection method', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -r1 <- ggplot(data = all_engines_mcl_fs_methods, aes(x = Median, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - subtitle = 'for multiclass classification tasks, depending on feature selection method', - x = 'Value', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - xlim(0.5, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - axis.title.x = element_blank(), - plot.subtitle = element_blank()) - -p1 / r1 -``` - -### Regression - -```{r fig.height=10, fig.width=18, echo=FALSE, warning=FALSE} -u1 <- ggplot(data = all_engines_reg_fs_methods, aes(x = Min, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median RMSE (Magnified)', - subtitle = 'for regression tasks, depending on feature selection method', - x = 'RMSE', - y = 'Dataset') + - xlim(0, 0.25) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) -v1 <- ggplot(data = all_engines_reg_fs_methods, aes(x = Min, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median RMSE', - x = 'RMSE', - y = 'Dataset') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank(), - axis.text.y = element_blank()) - -w1 <- ggplot(data = all_engines_reg_fs_methods, aes(x = Median, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Minimal (Magnified)', - x = 'RMSE', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - xlim(0, 0.25) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -x1 <- ggplot(data = all_engines_reg_fs_methods, aes(x = Median, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Minimal RMSE', - x = 'RMSE', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - axis.text.y = element_blank(), - plot.subtitle = element_blank()) - -(u1 | v1) / (w1 | x1) - -``` - - -## Imputation impact - -```{r echo=FALSE} -all_engines_bin_imp_no_fs <- all_engines_bin[all_engines_bin$Dataset %in% c('breast-w', 'credit-approval', "credit-g-mod", "phoneme-mod", "car-mod", "satimage-mod", "elevators-mod", "kin8nm-mod"), ] -all_engines_bin_imp_no_fs <- all_engines_bin_imp_no_fs[all_engines_bin_imp_no_fs$Feature_selection == 'none', ] - -all_engines_mcl_imp_no_fs <- all_engines_mcl[all_engines_mcl$Dataset %in% c('breast-w', 'credit-approval', "credit-g-mod", "phoneme-mod", "car-mod", "satimage-mod", "elevators-mod", "kin8nm-mod"), ] -all_engines_mcl_imp_no_fs <- all_engines_mcl_imp_no_fs[all_engines_mcl_imp_no_fs$Feature_selection == 'none', ] - -all_engines_reg_imp_no_fs <- all_engines_reg[all_engines_reg$Dataset %in% c('breast-w', 'credit-approval', "credit-g-mod", "phoneme-mod", "car-mod", "satimage-mod", "elevators-mod", "kin8nm-mod"), ] -all_engines_reg_imp_no_fs <- all_engines_reg_imp_no_fs[all_engines_reg_imp_no_fs$Feature_selection == 'none', ] -``` - - -```{r fig.height=10, fig.width=20, echo=FALSE, warning=FALSE} -p1 <- ggplot(data = all_engines_bin_imp_no_fs, aes(x = Max, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for binary classification tasks, without FS,\ndepending on imputation method', - x = 'Value', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - xlim(0.3, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -r1 <- ggplot(data = all_engines_bin_imp_no_fs, aes(x = Median, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - xlim(0.3, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -s1 <- ggplot(data = all_engines_mcl_imp_no_fs, aes(x = Max, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Max Accuracy', - subtitle = 'for multiclass classification tasks', - x = 'Value', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - xlim(0.3, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -t1 <- ggplot(data = all_engines_mcl_imp_no_fs, aes(x = Median, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median Accuracy', - x = 'Value', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - xlim(0.3, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "bottom", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -u1 <- ggplot(data = all_engines_reg_imp_no_fs[all_engines_reg_imp_no_fs$Metric == 'rmse', ], - aes(x = Min, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Min RMSE', - subtitle = 'for regression tasks', - x = 'Value', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - axis.title.x = element_blank()) - -w1 <- ggplot(data = all_engines_reg_imp_no_fs[all_engines_reg_imp_no_fs$Metric == 'rmse', ], - aes(x = Median, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.5) + - labs(title = 'Median RMSE', - x = 'Value', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - xlim(0, NA) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(legend.position = "none", - axis.title.y = element_blank(), - plot.subtitle = element_blank()) - -(p1 / r1) | (s1 / t1) | (u1 / w1) -``` - -## Engines comparison - -```{r} -tree_engines <- validation_summary_table[validation_summary_table$Engine != 'all', ] - -tree_engines_bin <- tree_engines[tree_engines$Task_type == 'binary' & tree_engines$Metric == 'accuracy', ] -tree_engines_mcl <- tree_engines[tree_engines$Task_type == 'multiclass' & tree_engines$Metric == 'accuracy', ] -tree_engines_reg <- tree_engines[tree_engines$Task_type == 'regression' & tree_engines$Metric == 'rmse', ] - -tree_engines_bin_baselines <- tree_engines_bin[which(tree_engines_bin$Removal =='removal_min' & - tree_engines_bin$Imputation =='median-other' & - tree_engines_bin$Feature_selection =='none'), ] -tree_engines_mcl_baselines <- tree_engines_mcl[which(tree_engines_mcl$Removal =='removal_min' & - tree_engines_mcl$Imputation =='median-other' & - tree_engines_mcl$Feature_selection =='none'), ] -tree_engines_reg_baselines <- tree_engines_reg[which(tree_engines_reg$Removal =='removal_min' & - tree_engines_reg$Imputation =='median-other' & - tree_engines_reg$Feature_selection =='none'), ] -``` - -### Classification tasks - -```{r fig.height=12, fig.width=18, echo=FALSE, warning=FALSE} -o <- ggplot(data = tree_engines_bin, aes(x = Max, y = Dataset, color = factor(Engine), fill = factor(Engine))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = tree_engines_bin_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Max, y = Dataset, color = factor(Engine), fill = factor(Engine))) + - labs(title = 'Max Accuracy', - subtitle = 'for binary classification tasks, X-mark stands for baseline preprocessing', - x = 'Value', - y = 'Dataset', - color = 'Engine', - fill = 'Engine') + - xlim(0, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.x = element_blank(), - axis.title.y = element_blank(), - legend.position = "none") - -p <- ggplot(data = tree_engines_mcl, aes(x = Max, y = Dataset, color = factor(Engine), fill = factor(Engine))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = tree_engines_mcl_baselines, size = 4, shape = 4, - position = position_jitterdodge(), - aes(x = Max, y = Dataset, color = factor(Engine), fill = factor(Engine))) + - labs(title = 'Max Accuracy', - subtitle = 'for multiclass classification tasks', - x = 'Value', - y = 'Dataset', - color = 'Engine', - fill = 'Engine') + - xlim(0, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") -o / p -``` - -### Regression - -```{r fig.height=12, fig.width=24, echo=FALSE, warning=FALSE, message=FALSE} -r <- ggplot(data = tree_engines_reg, aes(x = Max, y = Dataset, color = factor(Engine), fill = factor(Engine))) + - geom_boxplot(alpha = 0.5) + - geom_point(data = tree_engines_reg_baselines, size = 6, shape = 4, - position = position_jitterdodge(), - aes(x = Max, y = Dataset, color = factor(Engine), fill = factor(Engine))) + - labs(title = 'Max RMSE', - subtitle = 'for regression tasks, with 5 different zooms, X-mark stands for baseline preprocessing', - x = 'RMSE', - y = 'Dataset', - color = 'Engine', - fill = 'Engine') + - xlim(0.0045, 0.007) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "none") -s <- r + - xlim(0.09, 0.14) + - theme(axis.title.y = element_blank(), - axis.text.y = element_blank(), - plot.subtitle = element_blank(), - plot.title = element_blank()) - -t <- s + - xlim(0.2, 0.26) + - theme(legend.position = "bottom") - -u <- t + - xlim(0.95, 4.5) + - theme(legend.position = "none") - -w <- u + - xlim(6, 17) - -r | s | t | u | w -``` diff --git a/docs/articles/AutoML24Workshop & MScThesis/07_MSc_performance_analysis.Rmd b/docs/articles/AutoML24Workshop & MScThesis/07_MSc_performance_analysis.Rmd deleted file mode 100644 index 75d970f..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/07_MSc_performance_analysis.Rmd +++ /dev/null @@ -1,1250 +0,0 @@ ---- -title: "Masters Thesis forester: Performance analysis" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Imports and settings - -```{r, warning=FALSE, message=FALSE} -library(ggplot2) -library(patchwork) -library(scales) -library(dplyr) -library(forcats) -library(kableExtra) -library(knitr) -library(DT) -library(GGally) -library(tidyr) -``` - -# Data import - -```{r} -duration_train_df <- readRDS('MSc_processed_results/training_duration.RData') -training_summary_table <- readRDS('MSc_processed_results/training_summary_table.RData') -testing_summary_table <- readRDS('MSc_processed_results/testing_summary_table.RData') -validation_summary_table <- readRDS('MSc_processed_results/validation_summary_table.RData') -``` - -# Name changes - -As the data comes from the `forester` package in a raw form, in order to prepare plots for the Thesis/paper we rename some values, so they look nicer on plots. - -```{r} -change_factors <- function(dataset, score = FALSE) { - dataset$Task_type <- as.factor(dataset$Task_type) - dataset$Task_type <- fct_recode(dataset$Task_type, 'Binary' = 'binary', 'Multiclass' = 'multiclass', 'Regression' = 'regression') - dataset$Feature_selection <- as.factor(dataset$Feature_selection) - dataset$Feature_selection <- fct_recode(dataset$Feature_selection, 'None' = 'none', 'VI' = 'VI', 'MCFS' = 'MCFS', 'MI' = 'MI', 'Boruta' = 'BORUTA') - dataset$Imputation <- as.factor(dataset$Imputation) - dataset$Imputation <- fct_recode(dataset$Imputation, 'Median-other' = 'median-other', 'Median-frequency' = 'median-frequency', 'KNN' = 'knn', 'MICE' = 'mice') - dataset$Removal <- as.factor(dataset$Removal) - dataset$Removal <- fct_recode(dataset$Removal, 'Min' = 'removal_min', 'Med' = 'removal_med', 'Max' = 'removal_max') - if (score) { - dataset$Engine <- as.factor(dataset$Engine) - dataset$Engine <- fct_recode(dataset$Engine, 'LightGBM' = 'lightgbm', 'CatBoost' = 'catboost', 'Random forest' = 'ranger', 'XGBoost' = 'xgboost', 'Decision tree' = 'decision_tree', 'All' = 'all') - } - return(dataset) -} - -duration_train_df <- change_factors(duration_train_df) -training_summary_table <- change_factors(training_summary_table, TRUE) -testing_summary_table <- change_factors(testing_summary_table, TRUE) -validation_summary_table <- change_factors(validation_summary_table, TRUE) -``` - -# Paper theme - -We define the custom theme for this work, so the adjustments are easier to make. - -```{r} -paper_theme <- function() { - theme_minimal() + - theme(plot.title = element_text(colour = 'black', size = 26), - plot.subtitle = element_text(colour = 'black', size = 16), - axis.title.x = element_text(colour = 'black', size = 18), - axis.title.y = element_text(colour = 'black', size = 16), - axis.text.y = element_text(colour = "black", size = 16), - axis.text.x = element_text(colour = "black", size = 16), - strip.background = element_rect(fill = "white", color = "white"), - strip.text = element_text(size = 6 ), - strip.text.y.right = element_text(angle = 0), - legend.title = element_text(colour = 'black', size = 18), - legend.text = element_text(colour = "black", size = 16), - strip.text.y.left = element_text(size = 16, angle = 0, hjust = 1)) -} -``` - -# Results - -## Data preparation - -In this section we preprare the data for the majority of analysis. We will focus on validation results only, and the aggregations are made for all of engines. We also calculate the baselines, which are defined as a preprocessing strategy which has minimal removal strategy, median-other imputation, and lack of feature selection methods. - -```{r, echo=FALSE} -all_engines <- validation_summary_table[validation_summary_table$Engine == 'All', ] - -all_engines_bin <- all_engines[all_engines$Task_type == 'Binary' & all_engines$Metric == 'accuracy', ] -all_engines_mcl <- all_engines[all_engines$Task_type == 'Multiclass' & all_engines$Metric == 'accuracy', ] -all_engines_reg <- all_engines[all_engines$Task_type == 'Regression' & all_engines$Metric == 'r2', ] - -all_engines_bin_baselines <- all_engines_bin[which(all_engines_bin$Removal =='Min' & - all_engines_bin$Imputation =='Median-other' & - all_engines_bin$Feature_selection =='None'), ] -all_engines_mcl_baselines <- all_engines_mcl[which(all_engines_mcl$Removal =='Min' & - all_engines_mcl$Imputation =='Median-other' & - all_engines_mcl$Feature_selection =='None'), ] -all_engines_reg_baselines <- all_engines_reg[which(all_engines_reg$Removal =='Min' & - all_engines_reg$Imputation =='Median-other' & - all_engines_reg$Feature_selection =='None'), ] -``` - -```{r} -all_engines_bin_baselines$Fields <- all_engines_bin_baselines$Rows * all_engines_bin_baselines$Columns -all_engines_mcl_baselines$Fields <- all_engines_mcl_baselines$Rows * all_engines_mcl_baselines$Columns -all_engines_reg_baselines$Fields <- all_engines_reg_baselines$Rows * all_engines_reg_baselines$Columns - -all_engines_bin_baselines$Dataset <- as.factor(all_engines_bin_baselines$Dataset) -all_engines_bin_baselines$Dataset <- fct_reorder(all_engines_bin_baselines$Dataset, all_engines_bin_baselines$Fields) - -all_engines_mcl_baselines$Dataset <- as.factor(all_engines_mcl_baselines$Dataset) -all_engines_mcl_baselines$Dataset <- fct_reorder(all_engines_mcl_baselines$Dataset, all_engines_mcl_baselines$Fields) - -all_engines_reg_baselines$Dataset <- as.factor(all_engines_reg_baselines$Dataset) -all_engines_reg_baselines$Dataset <- fct_reorder(all_engines_reg_baselines$Dataset, all_engines_reg_baselines$Fields) - -all_engines_baselines <- rbind(all_engines_bin_baselines, all_engines_mcl_baselines, all_engines_reg_baselines) -``` - -## Automated preprocessibility function - -The next step is the definition of function calculating the preprocessibility value. Preprocessibility measures how much a preprocessing strategies improve the outcomes in comparison to the baseline. We divide it into positive, and negative preprocessibility, where the first one describes the positive impact, and the second the negative one. The positive is caluclated as `max(Maximum - Baseline, 0)`, whereas the negative one as `max(Maximum - Baseline, 0)`. - -The function also calculates the amount of times when we get scores better, equal or worse than the baseline. There are also 3 different scearios in which we can use the function with different grouping complexities. - -```{r message=FALSE, warning=FALSE} -calculate_preprocessibility <- function(data, baselines, grouping, grouping2 = NULL) { - if (is.null(grouping)) { - preprocessibility <- data %>% - left_join(baselines, by = 'Dataset') %>% - rename(Baseline = Max.y, Max = Max.x, Min = Min.x, Task_type = Task_type.x) %>% - mutate(Win = ifelse(Max > Baseline, 1, 0)) %>% - mutate(Lose = ifelse(Max < Baseline, 1, 0)) %>% - mutate(Tie = ifelse(Max == Baseline, 1, 0)) %>% - group_by(Dataset) %>% - summarise(Minimum = round(min(Max), 4), - Maximum = round(max(Max), 4), - Wins = sum(Win), - Loses = sum(Lose), - Ties = sum(Tie), - Baseline = round(mean(Baseline), 4), - Win = round(min(max((Wins / (Wins + Loses + Ties)), 0), 1), 5), - Loss = round(min(max((Loses / (Wins + Loses + Ties)), 0), 1), 5), - Tie = round(min(max((Ties / (Wins + Loses + Ties)), 0), 1), 5), - Postive_preprocessibility = round(max(Maximum - Baseline, 0), 5), - Negative_preprocessibility = round(min(Minimum - Baseline, 0), 5)) %>% - select(Dataset, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie) %>% - left_join(baselines, by = 'Dataset') %>% - select(Dataset, Task_type, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie, Fields) - preprocessibility$Dataset <- as.factor(preprocessibility$Dataset) - preprocessibility$Dataset <- fct_reorder(preprocessibility$Dataset, preprocessibility$Fields) - } else if (is.null(grouping2)) { - preprocessibility <- data %>% - left_join(baselines, by = 'Dataset') %>% - rename(Baseline = Max.y, Max = Max.x, Min = Min.x, Task_type = Task_type.x, - grouping = paste0(grouping, '.x')) %>% - mutate(Win = ifelse(Max > Baseline, 1, 0)) %>% - mutate(Lose = ifelse(Max < Baseline, 1, 0)) %>% - mutate(Tie = ifelse(Max == Baseline, 1, 0)) %>% - group_by(Dataset, grouping) %>% - summarise(Minimum = round(min(Max), 4), - Maximum = round(max(Max), 4), - Wins = sum(Win), - Loses = sum(Lose), - Ties = sum(Tie), - Baseline = round(mean(Baseline), 4), - Win = round(min(max((Wins / (Wins + Loses + Ties)), 0), 1), 5), - Loss = round(min(max((Loses / (Wins + Loses + Ties)), 0), 1), 5), - Tie = round(min(max((Ties / (Wins + Loses + Ties)), 0), 1), 5), - Postive_preprocessibility = round(max(Maximum - Baseline, 0), 5), - Negative_preprocessibility = round(min(Minimum - Baseline, 0), 5)) %>% - select(Dataset, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie, grouping) %>% - left_join(baselines, by = 'Dataset') %>% - select(Dataset, Task_type, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie, Fields, grouping) - preprocessibility$Dataset <- as.factor(preprocessibility$Dataset) - preprocessibility$Dataset <- fct_reorder(preprocessibility$Dataset, preprocessibility$Fields) - } else { - preprocessibility <- data %>% - left_join(baselines, by = 'Dataset') %>% - rename(Baseline = Max.y, Max = Max.x, Min = Min.x, Task_type = Task_type.x, - grouping = paste0(grouping, '.x'), grouping2 = paste0(grouping2, '.x')) %>% - mutate(Win = ifelse(Max > Baseline, 1, 0)) %>% - mutate(Lose = ifelse(Max < Baseline, 1, 0)) %>% - mutate(Tie = ifelse(Max == Baseline, 1, 0)) %>% - group_by(Dataset, grouping, grouping2) %>% - summarise(Minimum = round(min(Max), 4), - Maximum = round(max(Max), 4), - Wins = sum(Win), - Loses = sum(Lose), - Ties = sum(Tie), - Baseline = round(mean(Baseline), 4), - Win = round(min(max((Wins / (Wins + Loses + Ties)), 0), 1), 5), - Loss = round(min(max((Loses / (Wins + Loses + Ties)), 0), 1), 5), - Tie = round(min(max((Ties / (Wins + Loses + Ties)), 0), 1), 5), - Postive_preprocessibility = round(max(Maximum - Baseline, 0), 5), - Negative_preprocessibility = round(min(Minimum - Baseline, 0), 5)) %>% - select(Dataset, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie, grouping, grouping2) %>% - left_join(baselines, by = 'Dataset') %>% - select(Dataset, Task_type, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie, Fields, grouping, grouping2) - preprocessibility$Dataset <- as.factor(preprocessibility$Dataset) - preprocessibility$Dataset <- fct_reorder(preprocessibility$Dataset, preprocessibility$Fields) - } - return(preprocessibility) -} -datatable(calculate_preprocessibility(all_engines_mcl, all_engines_mcl_baselines, 'Imputation', 'Removal'), width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -## All Models - -In this case we calculate the results for all models. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -preprocessibility_bin <- calculate_preprocessibility(all_engines_bin, all_engines_bin_baselines, NULL) -preprocessibility_mcl <- calculate_preprocessibility(all_engines_mcl, all_engines_mcl_baselines, NULL) -preprocessibility_reg <- calculate_preprocessibility(all_engines_reg, all_engines_reg_baselines, NULL) - -preprocessibility <- rbind(preprocessibility_bin, preprocessibility_mcl, preprocessibility_reg) -datatable(preprocessibility, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=12, fig.width=20, echo=FALSE, warning=FALSE} -a <- ggplot(data = preprocessibility, aes(color = Task_type, fill = Task_type)) + - geom_segment(aes(x = Maximum, xend = Minimum, y = Dataset, yend = Dataset), size = 1) + - geom_point(aes(x = Maximum, y = Dataset), size = 4) + - geom_point(aes(x = Minimum, y = Dataset), size = 4) + - geom_point(aes(x = Baseline, y = Dataset), shape = 13, size = 7) + - labs(title = 'Performance range', - subtitle = 'X-mark stands for baseline preprocessing \nAccuracy for clasification, R2 for regression', - x = 'Accuracy / R2', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -b <- ggplot(data = preprocessibility, aes(color = Task_type, fill = Task_type)) + - geom_segment(aes(x = Postive_preprocessibility, xend = Negative_preprocessibility, y = Dataset, yend = Dataset), size = 1) + - geom_point(aes(x = Postive_preprocessibility, y = Dataset), size = 4) + - geom_point(aes(x = Negative_preprocessibility, y = Dataset), size = 4) + - labs(title = 'Preprocessibility', - subtitle = 'Left marker for negative \nRight for positive', - x = 'Preprocessibility', - y = 'Dataset') + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y= element_blank(), - axis.text.y = element_blank(), - legend.position = "none") - -c <- ggplot(data = tidyr::pivot_longer(preprocessibility, cols = c(Loss, Tie, Win)), aes(color = name, fill = name)) + - geom_bar(aes(x = value, y = Dataset), stat = 'identity') + - labs(title = 'Comparison to baseline', - subtitle = 'For each preprocessing strategy', - x = 'Fraction', - y = 'Dataset', - color = 'Result', - fill = 'Result') + - paper_theme() + - #scale_color_manual(values = c("#f97c7c", "#e2e2e2", "#ededaf")) + - #scale_fill_manual(values = c("#f97c7c", "#e2e2e2", "#fcfcda")) + - scale_color_manual(values = c("#FF6B6B", "#889696", "#C3EB78")) + - scale_fill_manual(values = c("#FF6B6B", "#889696", "#C3EB78")) + - theme(axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "bottom") - -a | b | c -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -calcualte_stats <- function(preprocessibility, verbose = FALSE) { - percentage_of_wins <- round(sum(preprocessibility$Wins) / (sum(preprocessibility$Wins) + sum(preprocessibility$Loses) + sum(preprocessibility$Ties)), 3) - percentage_of_ties <- round(sum(preprocessibility$Ties) / (sum(preprocessibility$Wins) + sum(preprocessibility$Loses) + sum(preprocessibility$Ties)), 3) - percentage_of_loses <- round(sum(preprocessibility$Loses) / (sum(preprocessibility$Wins) + sum(preprocessibility$Loses) + sum(preprocessibility$Ties)), 3) - - postive_preprocessibility_count <- sum(preprocessibility$Postive_preprocessibility > 0) - neutral_preprocessibility_count <- nrow(preprocessibility[preprocessibility$Postive_preprocessibility == 0 & - preprocessibility$Negative_preprocessibility == 0, ]) - negative_preprocessibility_count <- sum(preprocessibility$Negative_preprocessibility < 0) - - postive_preprocessibility_mean <- round(mean(abs(preprocessibility$Postive_preprocessibility)), 3) - negative_preprocessibility_mean <- round(mean(abs(preprocessibility$Negative_preprocessibility)), 3) - - if (verbose) { - cat('Percentage of wins: ', percentage_of_wins, '\n') - cat('Percentage of ties: ', percentage_of_ties, '\n') - cat('Percentage of loses:', percentage_of_loses, '\n\n') - - cat('Positive preprocessibility count:', postive_preprocessibility_count, '\n') - cat('Neutral preprocessibility count:', neutral_preprocessibility_count, '\n') - cat('Negative preprocessibility count:', negative_preprocessibility_count, '\n\n') - - cat('Positive preprocessibility mean:', postive_preprocessibility_mean, '\n') - cat('Negative preprocessibility mean:', negative_preprocessibility_mean, '\n') - } - - df <- data.frame(Percentage_of_wins = percentage_of_wins, - Percentage_of_ties = percentage_of_ties, - Percentage_of_loses = percentage_of_loses, - Postive_preprocessibility_count = postive_preprocessibility_count, - Neutral_preprocessibility_count = neutral_preprocessibility_count, - Negative_preprocessibility_count = negative_preprocessibility_count, - Postive_preprocessibility_mean = postive_preprocessibility_mean, - Negative_preprocessibility_mean = negative_preprocessibility_mean) - return(df) -} - -stats_preprocessibility <- calcualte_stats(preprocessibility, TRUE) -``` - -Overall performance across all preprocessing methods in most cases doesn't lead to any improvement. - -In 15.5% of cases, performance was improved, while in 28% diminished in comparison to baseline. - -The improvements were also smaller than the degradations. (0.01 vs 0.05). - -It shows that majority (57%) of preprocessing methods seem to be redundant, yet the rest (43%) have impact on the outcomes, thus we should investigate that. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -a <- ggplot(data = preprocessibility) + - geom_boxplot(aes(x = Postive_preprocessibility, y = Task_type), alpha = 0.5) + - geom_point(aes(x = Postive_preprocessibility, y = Task_type), size = 3) + - labs(x = 'Value', - y = 'Dataset', - color = 'Is FS used?', - fill = 'Is FS used?') + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.x = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "bottom") - -b <- ggplot(data = preprocessibility) + - geom_boxplot(aes(x = Negative_preprocessibility, y = Task_type), alpha = 0.5) + - geom_point(aes(x = Negative_preprocessibility, y = Task_type), size = 3) + - labs(title = 'Tunability aggregated by task type', - subtitle = 'calculated for all strategies, divied by FS usage', - x = 'Value', - y = 'Dataset', - color = 'Is FS used?', - fill = 'Is FS used?') + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.x = element_blank(), - axis.title.y = element_blank(), - legend.position = "bottom") - -b | a -``` - -## Feature Selection Impact - -In this case we calculate the results depeding on the fact, whether we use feature selectio metods or not. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -all_engines_bin_fs <- all_engines_bin -all_engines_bin_fs$Feature_selection <- ifelse(all_engines_bin_fs$Feature_selection != 'None', 'Yes', 'No') -all_engines_mcl_fs <- all_engines_mcl -all_engines_mcl_fs$Feature_selection <- ifelse(all_engines_mcl_fs$Feature_selection != 'None', 'Yes', 'No') -all_engines_reg_fs <- all_engines_reg -all_engines_reg_fs$Feature_selection <- ifelse(all_engines_reg_fs$Feature_selection != 'None', 'Yes', 'No') - -preprocessibility_bin_fs <- calculate_preprocessibility(all_engines_bin_fs, all_engines_bin_baselines, 'Feature_selection') -preprocessibility_mcl_fs <- calculate_preprocessibility(all_engines_mcl_fs, all_engines_mcl_baselines, 'Feature_selection') -preprocessibility_reg_fs <- calculate_preprocessibility(all_engines_reg_fs, all_engines_reg_baselines, 'Feature_selection') -preprocessibility_fs <- rbind(preprocessibility_bin_fs, preprocessibility_mcl_fs, preprocessibility_reg_fs) -datatable(preprocessibility_fs, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=16, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot <- function(data, grouping_name = 'grouping') { - a <- ggplot(data = data, aes(color = grouping, fill = grouping)) + - geom_segment(aes(x = Maximum, xend = Minimum, y = grouping, yend = grouping), size = 1) + - geom_point(aes(x = Maximum, y = grouping), size = 4) + - geom_point(aes(x = Minimum, y = grouping), size = 4) + - geom_point(aes(x = Baseline, y = grouping), shape = 13, size = 7) + - labs(title = 'Performance range', - subtitle = 'X-mark stands for baseline preprocessing \nAccuracy for clasification, R2 for regression', - x = 'Accuracy / R2', - y = 'Dataset', - color = grouping_name, - fill = grouping_name) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "bottom") + - facet_wrap(~fct_rev(Dataset), ncol = 1, switch = 'y', dir = 'v', strip.position = 'right') + - theme(panel.spacing = unit(0.5, "lines")) - - b <- ggplot(data = data, aes(color = grouping, fill = grouping)) + - geom_segment(aes(x = Postive_preprocessibility, xend = Negative_preprocessibility, y = grouping, yend = grouping), size = 1) + - geom_point(aes(x = Postive_preprocessibility, y = grouping), size = 4) + - geom_point(aes(x = Negative_preprocessibility, y = grouping), size = 4) + - labs(title = 'Preprocessibility', - subtitle = 'Left marker for negative \nRigth for positive', - x = 'Preprocessibility', - y = 'Dataset', - color = grouping_name, - fill = grouping_name) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y= element_blank(), - axis.text.y = element_blank(), - legend.position = "none") + - facet_wrap(~fct_rev(Dataset), ncol = 1, switch = 'y', dir = 'v', strip.position = 'right') + - theme(strip.text.y.left = element_blank(), - panel.spacing = unit(0.5, "lines")) - - c <- ggplot(data = tidyr::pivot_longer(data, cols = c(Loss, Tie, Win)), aes(color = name, fill = name)) + - geom_bar(aes(x = value, y = grouping), stat = 'identity') + - labs(title = 'Comparison to baseline', - subtitle = 'For each preprocessing strategy', - x = 'Fraction', - y = 'Dataset', - color = 'Result', - fill = 'Result') + - paper_theme() + - #scale_color_manual(values = c("#f97c7c", "#e2e2e2", "#ededaf")) + - #scale_fill_manual(values = c("#f97c7c", "#e2e2e2", "#fcfcda")) + - scale_color_manual(values = c("#FF6B6B", "#889696", "#C3EB78")) + - scale_fill_manual(values = c("#FF6B6B", "#889696", "#C3EB78")) + - theme(axis.title.x = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "bottom") + - facet_wrap(~fct_rev(Dataset), ncol = 1, switch = 'y', dir = 'v', strip.position = 'right') + - theme(strip.text.y.left = element_blank(), - panel.spacing = unit(0.5, "lines")) - - a | b | c -} - -detailed_plot(preprocessibility_fs, 'Is Feature Selection used?') - -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_preprocessibility_fs_yes <- calcualte_stats(preprocessibility_fs[preprocessibility_fs$grouping == 'Yes', ]) -stats_preprocessibility_fs_no <- calcualte_stats(preprocessibility_fs[preprocessibility_fs$grouping == 'No', ]) -stats_preprocessibility_fs <- rbind(stats_preprocessibility_fs_yes, stats_preprocessibility_fs_no) -rownames(stats_preprocessibility_fs) <- c('FS', 'No FS') -kable(t(stats_preprocessibility_fs)) -``` - -The most influential aspect of the preprocessing is the feature selection. - -If we use such methods, only 49% of times we achieved ties, whereas when it wasn't used it was 71%. - -However it doesn't necessary mean that the changes are for the better, as percentage of grows from 12% to 17%, but the degradations from 17% to 34%. - -Additionally, the mean positive preprocessibility grows from 0.006 to 0.009, but the negative one grows from 0.013 to 0.47. - -It shows that feature selection is powerful tool, but it should be used with caution. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} - -aggregated_plot <- function(data, grouping_name = 'grouping', aggregation = 'aggregation', line = 0.05) { - a <- ggplot(data = data, aes(color = grouping, fill = grouping)) + - geom_boxplot(aes(x = Negative_preprocessibility, y = Task_type), alpha = 0.7, outlier.size = 5) + - geom_vline(xintercept = -line, linetype = 'dashed', color = '#7C843C', linewidth = 1) + - labs(title = 'Preprocessibility distribution ', - subtitle = paste0('aggregated by task type, and grouped by ', aggregation), - x = 'Negative preprocessibility', - y = 'Dataset', - color = grouping_name, - fill = grouping_name) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - - b <- ggplot(data = data, aes(color = grouping, fill = grouping)) + - geom_boxplot(aes(x = Postive_preprocessibility, y = Task_type), alpha = 0.7, outlier.size = 5) + - geom_vline(xintercept = line, linetype = 'dashed', color = '#7C843C', linewidth = 1) + - labs(x = 'Positive preprocessibility', - y = 'Dataset', - color = grouping_name, - fill = grouping_name) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - xlim(0, line) + - theme(axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "bottom") - - - (a | b) / guide_area() + plot_layout(guides = "collect") + plot_layout(height = c(1, 0.05)) -} - -aggregated_plot(preprocessibility_fs, 'Is Feature selection used?', 'feature selection usage') - -``` - -## Removal Impact - -In this case we calculate the results depeding on which removal strategy we use. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -preprocessibility_bin_rm <- calculate_preprocessibility(all_engines_bin, all_engines_bin_baselines, 'Removal') -preprocessibility_mcl_rm <- calculate_preprocessibility(all_engines_mcl, all_engines_mcl_baselines, 'Removal') -preprocessibility_reg_rm <- calculate_preprocessibility(all_engines_reg, all_engines_reg_baselines, 'Removal') -preprocessibility_rm <- rbind(preprocessibility_bin_rm, preprocessibility_mcl_rm, preprocessibility_reg_rm) -datatable(preprocessibility_rm, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=16, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(preprocessibility_rm, 'Removal strategy') -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_preprocessibility_rm_min <- calcualte_stats(preprocessibility_rm[preprocessibility_rm$grouping == 'Min', ]) -stats_preprocessibility_rm_med <- calcualte_stats(preprocessibility_rm[preprocessibility_rm$grouping == 'Med', ]) -stats_preprocessibility_rm_max <- calcualte_stats(preprocessibility_rm[preprocessibility_rm$grouping == 'Max', ]) - -stats_preprocessibility_rm <- rbind(stats_preprocessibility_rm_min, stats_preprocessibility_rm_med, stats_preprocessibility_rm_max) -rownames(stats_preprocessibility_rm) <- c('Min', 'Med', 'Max') -knitr::kable(t(stats_preprocessibility_rm)) -``` - -The general comparison of removal options on all datasets, shows us only that the maximal strategy with highly correlated features removal is the least neutral one, but its effects are in most cases negative. The minimal, and medium approaches seems to be much better. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(preprocessibility_rm, 'Removal strategy', 'removal strategy') -``` - -## Removal impact without FS - -In this case we calculate the results depeding on which removal strategy we use, additionally we include only strategies without feature selection methods. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -preprocessibility_bin_rm_no_fs <- calculate_preprocessibility(all_engines_bin[all_engines_bin$Feature_selection == 'None', ], all_engines_bin_baselines, 'Removal') -preprocessibility_mcl_rm_no_fs <- calculate_preprocessibility(all_engines_mcl[all_engines_mcl$Feature_selection == 'None', ], all_engines_mcl_baselines, 'Removal') -preprocessibility_reg_rm_no_fs <- calculate_preprocessibility(all_engines_reg[all_engines_reg$Feature_selection == 'None', ], all_engines_reg_baselines, 'Removal') -preprocessibility_rm_no_fs <- rbind(preprocessibility_bin_rm_no_fs, preprocessibility_mcl_rm_no_fs, preprocessibility_reg_rm_no_fs) -datatable(preprocessibility_rm_no_fs, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=20, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(preprocessibility_rm_no_fs, 'Removal strategy') -``` - -### Stats - -```{r} -stats_preprocessibility_rm_no_fs_min <- calcualte_stats(preprocessibility_rm_no_fs[preprocessibility_rm_no_fs$grouping == 'Min', ]) -stats_preprocessibility_rm_no_fs_med <- calcualte_stats(preprocessibility_rm_no_fs[preprocessibility_rm_no_fs$grouping == 'Med', ]) -stats_preprocessibility_rm_no_fs_max <- calcualte_stats(preprocessibility_rm_no_fs[preprocessibility_rm_no_fs$grouping == 'Max', ]) - -stats_preprocessibility_rm_no_fs <- rbind(stats_preprocessibility_rm_no_fs_min, stats_preprocessibility_rm_no_fs_med, stats_preprocessibility_rm_no_fs_max) -rownames(stats_preprocessibility_rm_no_fs) <- c('Min', 'Med', 'Max') -knitr::kable(t(stats_preprocessibility_rm_no_fs)) -``` - -The comparison of results without feature selection is more interesting. - -Firstly, it shows that removal options are relatively safe, as the percentage of ties is the highest, and the percentage of loses is the lowest. - -We can also observe, that the more complex strategy we use, the less ties we get, and the more wins and loses. - -Additionally, we can witness that maximal strategy leads to a massive number of poor results, as the negative preprocessibility, and percentage of looses are the highest. It proves that we should not remove highly correlated features. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(preprocessibility_rm_no_fs, 'Removal strategy', 'removal strategy') -``` - -## Removal impact with only FS - -In this case we calculate the results depeding on which removal strategy we use, additioanlyl we include only the methods that use feature selection. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -preprocessibility_bin_rm_fs <- calculate_preprocessibility(all_engines_bin[all_engines_bin$Feature_selection != 'None', ], all_engines_bin_baselines, 'Removal') -preprocessibility_mcl_rm_fs <- calculate_preprocessibility(all_engines_mcl[all_engines_mcl$Feature_selection != 'None', ], all_engines_mcl_baselines, 'Removal') -preprocessibility_reg_rm_fs <- calculate_preprocessibility(all_engines_reg[all_engines_reg$Feature_selection != 'None', ], all_engines_reg_baselines, 'Removal') -preprocessibility_rm_fs <- rbind(preprocessibility_bin_rm_fs, preprocessibility_mcl_rm_fs, preprocessibility_reg_rm_fs) -datatable(preprocessibility_rm_fs, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=16, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(preprocessibility_rm_fs, 'Removal strategy') -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_preprocessibility_rm_fs_min <- calcualte_stats(preprocessibility_rm_fs[preprocessibility_rm_fs$grouping == 'Min', ]) -stats_preprocessibility_rm_fs_med <- calcualte_stats(preprocessibility_rm_fs[preprocessibility_rm_fs$grouping == 'Med', ]) -stats_preprocessibility_rm_fs_max <- calcualte_stats(preprocessibility_rm_fs[preprocessibility_rm_fs$grouping == 'Max', ]) - -stats_preprocessibility_rm_fs <- rbind(stats_preprocessibility_rm_fs_min, stats_preprocessibility_rm_fs_med, stats_preprocessibility_rm_fs_max) -rownames(stats_preprocessibility_rm_fs) <- c('Min', 'Med', 'Max') -knitr::kable(t(stats_preprocessibility_rm_fs)) -``` - -If we include only strategies with feature selection, we can observe that the number of ties is greatly diminished. - -Once more we can see that it doesn't improve the results, as mostly the number of loses grows. - -Another important finding is that addign the feature selection methods more negatively affected the minimal strategy, whereas the medium one benefit the most from it. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(preprocessibility_rm_fs, 'Removal strategy', 'removal strategy') -``` - -## Feature Selection Methods Impact - -```{r echo=FALSE, warning=FALSE, message=FALSE} -preprocessibility_bin_fs_only <- calculate_preprocessibility(all_engines_bin, all_engines_bin_baselines, 'Feature_selection') -preprocessibility_mcl_fs_only <- calculate_preprocessibility(all_engines_mcl, all_engines_mcl_baselines, 'Feature_selection') -preprocessibility_reg_fs_only <- calculate_preprocessibility(all_engines_reg, all_engines_reg_baselines, 'Feature_selection') -preprocessibility_fs_only <- rbind(preprocessibility_bin_fs_only, preprocessibility_mcl_fs_only, preprocessibility_reg_fs_only) -datatable(preprocessibility_fs_only, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=26, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(preprocessibility_fs_only, 'Feature selection method') -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_preprocessibility_fs_only_BORUTA <- calcualte_stats(preprocessibility_fs_only[preprocessibility_fs_only$grouping == 'Boruta', ]) -stats_preprocessibility_fs_only_MCFS <- calcualte_stats(preprocessibility_fs_only[preprocessibility_fs_only$grouping == 'MCFS', ]) -stats_preprocessibility_fs_only_MI <- calcualte_stats(preprocessibility_fs_only[preprocessibility_fs_only$grouping == 'MI', ]) -stats_preprocessibility_fs_only_VI <- calcualte_stats(preprocessibility_fs_only[preprocessibility_fs_only$grouping == 'VI', ]) -stats_preprocessibility_fs_only_none <- calcualte_stats(preprocessibility_fs_only[preprocessibility_fs_only$grouping == 'None', ]) - -stats_preprocessibility_fs_only <- rbind(stats_preprocessibility_fs_only_BORUTA, stats_preprocessibility_fs_only_MCFS, stats_preprocessibility_fs_only_MI, - stats_preprocessibility_fs_only_VI, stats_preprocessibility_fs_only_none) -rownames(stats_preprocessibility_fs_only) <- c('Boruta', 'MCFS', 'MI', 'VI', 'None') -knitr::kable(t(stats_preprocessibility_fs_only)) -``` - -The comparison of feature selection methods shows us that although most methods seem not worth considering, because of high percentage of loses (MI 45%, VI 41%), or generally low impact (MCFS 76% of ties), the Boruta is an interesting exception. - -It achieves simlar wins and loses percentages (22.5% vs 24.5%), and the comparison of positive preprocessibility to negative preprocessibility is the best among all methods (0.008 vs 0.013), beating, even the option without FS (0.006 vs 0.013). - -It is unfortunate, that the preprocessing time of Boruta is among the longest ones. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(preprocessibility_fs_only, 'Feature selection method', 'feature selection method') -``` - -## Imputation impact - -In this case we calculate the results depeding on which imputation method we use. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -all_engines_bin_imp_no_fs <- all_engines_bin[all_engines_bin$Dataset %in% c('breast-w', 'credit-approval', "credit-g-mod", "phoneme-mod", - "car-mod", "satimage-mod", "elevators-mod", "kin8nm-mod"), ] -all_engines_bin_imp_no_fs <- all_engines_bin_imp_no_fs[all_engines_bin_imp_no_fs$Feature_selection == 'None', ] - -all_engines_mcl_imp_no_fs <- all_engines_mcl[all_engines_mcl$Dataset %in% c('breast-w', 'credit-approval', "credit-g-mod", "phoneme-mod", - "car-mod", "satimage-mod", "elevators-mod", "kin8nm-mod"), ] -all_engines_mcl_imp_no_fs <- all_engines_mcl_imp_no_fs[all_engines_mcl_imp_no_fs$Feature_selection == 'None', ] - -all_engines_reg_imp_no_fs <- all_engines_reg[all_engines_reg$Dataset %in% c('breast-w', 'credit-approval', "credit-g-mod", "phoneme-mod", - "car-mod", "satimage-mod", "elevators-mod", "kin8nm-mod"), ] -all_engines_reg_imp_no_fs <- all_engines_reg_imp_no_fs[all_engines_reg_imp_no_fs$Feature_selection == 'None', ] - -all_engines_bin_imp_no_fs <- calculate_preprocessibility(all_engines_bin_imp_no_fs, all_engines_bin_baselines, 'Imputation') -all_engines_mcl_imp_no_fs <- calculate_preprocessibility(all_engines_mcl_imp_no_fs, all_engines_mcl_baselines, 'Imputation') -all_engines_reg_imp_no_fs <- calculate_preprocessibility(all_engines_reg_imp_no_fs, all_engines_reg_baselines, 'Imputation') -preprocessibility_imp_no_fs <- rbind(all_engines_bin_imp_no_fs, all_engines_mcl_imp_no_fs, all_engines_reg_imp_no_fs) -datatable(preprocessibility_imp_no_fs, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=10, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(preprocessibility_imp_no_fs, 'Imputation method') -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_preprocessibility_imp_no_fs_knn <- calcualte_stats(preprocessibility_imp_no_fs[preprocessibility_imp_no_fs$grouping == 'KNN', ]) -stats_preprocessibility_imp_no_fs_median_frequency <- calcualte_stats(preprocessibility_imp_no_fs[preprocessibility_imp_no_fs$grouping == 'Median-frequency', ]) -stats_preprocessibility_imp_no_fs_median_other <- calcualte_stats(preprocessibility_imp_no_fs[preprocessibility_imp_no_fs$grouping == 'Median-other', ]) -stats_preprocessibility_imp_no_fs_mice <- calcualte_stats(preprocessibility_imp_no_fs[preprocessibility_imp_no_fs$grouping == 'MICE', ]) - -stats_preprocessibility_imp_no_fs <- rbind(stats_preprocessibility_imp_no_fs_knn, stats_preprocessibility_imp_no_fs_median_frequency, - stats_preprocessibility_imp_no_fs_median_other, stats_preprocessibility_imp_no_fs_mice) -rownames(stats_preprocessibility_imp_no_fs) <- c('KNN', 'Median-frequency', 'Median-other', 'MICE') -knitr::kable(t(stats_preprocessibility_imp_no_fs)) -``` - -The comparison of imputation methods shows that the KNN is the best option, with the highest percentage of wins (58%) and the lowest percentage of loses (17%). Additionally it yields the best ratio of positive and negative preprocessibility (0.007 vs 0.011). - -Quite surprisingly, MICE which is also an advanced imputation algotihm seems to be the worst, and lose to the basic median-something approaches. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(preprocessibility_imp_no_fs, 'Imputation method', 'imputation method') -``` - -## Models comparison - -In this case we calculate the results depeding on the tree-based mdoels used. - -### New function - -As this time we always groupd by the Dataset and Engine we have make the following changes to the calcualte preprocessibility function. - -```{r} -calculate_preprocessibility_2 <- function(data, baselines, grouping) { - preprocessibility <- data %>% - left_join(baselines, by = c('Dataset', grouping)) %>% - rename(Baseline = Max.y, Max = Max.x, Min = Min.x, Task_type = Task_type.x) %>% - mutate(Win = ifelse(Max > Baseline, 1, 0)) %>% - mutate(Lose = ifelse(Max < Baseline, 1, 0)) %>% - mutate(Tie = ifelse(Max == Baseline, 1, 0)) %>% - group_by(Dataset, Engine) %>% - summarise(Minimum = round(min(Max), 4), - Maximum = round(max(Max), 4), - Wins = sum(Win), - Loses = sum(Lose), - Ties = sum(Tie), - Baseline = round(mean(Baseline), 4), - Win = round(min(max((Wins / (Wins + Loses + Ties)), 0), 1), 5), - Loss = round(min(max((Loses / (Wins + Loses + Ties)), 0), 1), 5), - Tie = round(min(max((Ties / (Wins + Loses + Ties)), 0), 1), 5), - Postive_preprocessibility = round(max(Maximum - Baseline, 0), 5), - Negative_preprocessibility = round(min(Minimum - Baseline, 0), 5)) %>% - select(Dataset, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie, grouping, Engine) %>% - left_join(baselines, by = c('Dataset', 'Engine')) %>% - select(Dataset, Task_type, Maximum, Minimum, Baseline, Postive_preprocessibility, Negative_preprocessibility, Wins, Loses, Ties, Win, Loss, Tie, Fields, grouping) %>% - rename(grouping = Engine) - preprocessibility$Dataset <- as.factor(preprocessibility$Dataset) - - return(preprocessibility) -} -``` - -### Data preprartion - -We prepare the data for engine analysis. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -tree_engines <- validation_summary_table[validation_summary_table$Engine != 'All', ] - -tree_engines_bin <- tree_engines[tree_engines$Task_type == 'Binary' & tree_engines$Metric == 'accuracy', ] -tree_engines_mcl <- tree_engines[tree_engines$Task_type == 'Multiclass' & tree_engines$Metric == 'accuracy', ] -tree_engines_reg <- tree_engines[tree_engines$Task_type == 'Regression' & tree_engines$Metric == 'r2', ] - -tree_engines_bin_baselines <- tree_engines_bin[which(tree_engines_bin$Removal =='Min' & - tree_engines_bin$Imputation =='Median-other' & - tree_engines_bin$Feature_selection =='None'), ] -tree_engines_mcl_baselines <- tree_engines_mcl[which(tree_engines_mcl$Removal =='Min' & - tree_engines_mcl$Imputation =='Median-other' & - tree_engines_mcl$Feature_selection =='None'), ] -tree_engines_reg_baselines <- tree_engines_reg[which(tree_engines_reg$Removal =='Min' & - tree_engines_reg$Imputation =='Median-other' & - tree_engines_reg$Feature_selection =='None'), ] - -tree_engines_bin_baselines$Fields <- tree_engines_bin_baselines$Rows * tree_engines_bin_baselines$Columns -tree_engines_mcl_baselines$Fields <- tree_engines_mcl_baselines$Rows * tree_engines_mcl_baselines$Columns -tree_engines_reg_baselines$Fields <- tree_engines_reg_baselines$Rows * tree_engines_reg_baselines$Columns - -tree_engines_bin_baselines$Dataset <- as.factor(tree_engines_bin_baselines$Dataset) -tree_engines_bin_baselines$Dataset <- fct_reorder(tree_engines_bin_baselines$Dataset, tree_engines_bin_baselines$Fields) - -tree_engines_mcl_baselines$Dataset <- as.factor(tree_engines_mcl_baselines$Dataset) -tree_engines_mcl_baselines$Dataset <- fct_reorder(tree_engines_mcl_baselines$Dataset, tree_engines_mcl_baselines$Fields) - -tree_engines_reg_baselines$Dataset <- as.factor(tree_engines_reg_baselines$Dataset) -tree_engines_reg_baselines$Dataset <- fct_reorder(tree_engines_reg_baselines$Dataset, tree_engines_reg_baselines$Fields) - -tree_engines_baselines <- rbind(tree_engines_bin_baselines, tree_engines_mcl_baselines, tree_engines_reg_baselines) -``` - -```{r echo=FALSE, warning=FALSE, message=FALSE} -tree_engines_bin_preprocessibility <- calculate_preprocessibility_2(tree_engines_bin, tree_engines_bin_baselines, 'Engine') -tree_engines_mcl_preprocessibility <- calculate_preprocessibility_2(tree_engines_mcl, tree_engines_mcl_baselines, 'Engine') -tree_engines_reg_preprocessibility <- calculate_preprocessibility_2(tree_engines_reg, tree_engines_reg_baselines, 'Engine') -tree_engines_preprocessibility <- rbind(tree_engines_bin_preprocessibility, tree_engines_mcl_preprocessibility, tree_engines_reg_preprocessibility) -datatable(tree_engines_preprocessibility, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=26, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(tree_engines_preprocessibility, 'Tree-based model') -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_tree_preprocessibility_decision_tree <- calcualte_stats(tree_engines_preprocessibility[tree_engines_preprocessibility$grouping == 'Decision tree', ]) -stats_tree_preprocessibility_ranger <- calcualte_stats(tree_engines_preprocessibility[tree_engines_preprocessibility$grouping == 'Random forest', ]) -stats_tree_preprocessibility_xgboost <- calcualte_stats(tree_engines_preprocessibility[tree_engines_preprocessibility$grouping == 'XGBoost', ]) -stats_tree_preprocessibility_lightgbm <- calcualte_stats(tree_engines_preprocessibility[tree_engines_preprocessibility$grouping == 'LightGBM', ]) -stats_tree_preprocessibility_catboost <- calcualte_stats(tree_engines_preprocessibility[tree_engines_preprocessibility$grouping == 'CatBoost', ]) - -stats_tree_preprocessibility <- rbind(stats_tree_preprocessibility_decision_tree, stats_tree_preprocessibility_ranger, stats_tree_preprocessibility_xgboost, - stats_tree_preprocessibility_lightgbm, stats_tree_preprocessibility_catboost) -rownames(stats_tree_preprocessibility) <- c('Decision tree', 'Random forest', 'XGBoost', 'LightGBM', 'CatBoost') -knitr::kable(t(stats_tree_preprocessibility)) -``` - -```{r} -tree_engines_preprocessibility %>% - group_by(grouping) %>% - summarise(Maximum = mean(Maximum)) - -``` - -The model comnparison is a very interesting one. We can notice that random forest can benefit the most in term of mean positive preprocessiblity value ,yet if also look at its peroformance we will notice that it is the weakest model (0.694). - -The second-best option is XGBoost as it has second best performance (0.845), and good positive and negative preprocessibilities (0.01 vs 0.043). - -The best one hwoever is the CatBoost, as it achieves the highest scores (0.866), biggest amount of wins (0.22), and also get good positive preprocessibility value (0.01 vs 0.07). - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(tree_engines_preprocessibility, 'Tree-based model', 'tree-based model') -``` - -## Models comparison fs only - -In this case we calculate the results depeding on the tree-based models used, and focus on the examples where feature seelction was used. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -tree_engines_bin_fs_only <- tree_engines_bin[tree_engines_bin$Feature_selection != 'None', ] -tree_engines_mcl_fs_only <- tree_engines_mcl[tree_engines_mcl$Feature_selection != 'None', ] -tree_engines_reg_fs_only <- tree_engines_reg[tree_engines_reg$Feature_selection != 'None', ] -``` - -```{r echo=FALSE, warning=FALSE, message=FALSE} -tree_engines_bin_fs_only <- calculate_preprocessibility_2(tree_engines_bin_fs_only, tree_engines_bin_baselines, 'Engine') -tree_engines_mcl_fs_only <- calculate_preprocessibility_2(tree_engines_mcl_fs_only, tree_engines_mcl_baselines, 'Engine') -tree_engines_reg_fs_only <- calculate_preprocessibility_2(tree_engines_reg_fs_only, tree_engines_reg_baselines, 'Engine') -tree_preprocessibility_fs_only <- rbind(tree_engines_bin_fs_only, tree_engines_mcl_fs_only, tree_engines_reg_fs_only) -datatable(tree_preprocessibility_fs_only, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=20, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(tree_preprocessibility_fs_only, 'Tree-based model') -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_tree_preprocessibility_fs_only_decision_tree <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'Decision tree', ]) -stats_tree_preprocessibility_fs_only_ranger <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'Random forest', ]) -stats_tree_preprocessibility_fs_only_xgboost <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'XGBoost', ]) -stats_tree_preprocessibility_fs_only_lightgbm <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'LightGBM', ]) -stats_tree_preprocessibility_fs_only_catboost <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'CatBoost', ]) - -stats_tree_preprocessibility_fs_only <- rbind(stats_tree_preprocessibility_fs_only_decision_tree, stats_tree_preprocessibility_fs_only_ranger, stats_tree_preprocessibility_fs_only_xgboost, stats_tree_preprocessibility_fs_only_lightgbm, stats_tree_preprocessibility_fs_only_catboost) -rownames(stats_tree_preprocessibility_fs_only) <- c('Decision tree', 'Random forest', 'XGBoost', 'LightGBM', 'CatBoost') -knitr::kable(t(stats_tree_preprocessibility_fs_only)) -``` - -When we look only on the cases when feature selection is used the results are not so different, although we tend to lose to baselines more often than before. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(tree_preprocessibility_fs_only, 'Tree-based model', 'tree-based model') -``` - -## Models comparison no fs - -In this case we calculate the results depeding on the tree-based models used, and focus on the examples where feature seelction was not used. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -tree_engines_bin_fs_only <- tree_engines_bin[tree_engines_bin$Feature_selection == 'None', ] -tree_engines_mcl_fs_only <- tree_engines_mcl[tree_engines_mcl$Feature_selection == 'None', ] -tree_engines_reg_fs_only <- tree_engines_reg[tree_engines_reg$Feature_selection == 'None', ] -``` - -```{r echo=FALSE, warning=FALSE, message=FALSE} -tree_engines_bin_fs_only <- calculate_preprocessibility_2(tree_engines_bin_fs_only, tree_engines_bin_baselines, 'Engine') -tree_engines_mcl_fs_only <- calculate_preprocessibility_2(tree_engines_mcl_fs_only, tree_engines_mcl_baselines, 'Engine') -tree_engines_reg_fs_only <- calculate_preprocessibility_2(tree_engines_reg_fs_only, tree_engines_reg_baselines, 'Engine') -tree_preprocessibility_fs_only <- rbind(tree_engines_bin_fs_only, tree_engines_mcl_fs_only, tree_engines_reg_fs_only) -datatable(tree_preprocessibility_fs_only, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -### Detailed plot - -```{r fig.height=20, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(tree_preprocessibility_fs_only, 'Tree-based model') -``` - -### Stats - -```{r echo=FALSE, warning=FALSE, message=FALSE} -stats_tree_preprocessibility_fs_only_decision_tree <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'Decision tree', ]) -stats_tree_preprocessibility_fs_only_ranger <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'Random forest', ]) -stats_tree_preprocessibility_fs_only_xgboost <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'XGBoost', ]) -stats_tree_preprocessibility_fs_only_lightgbm <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'LightGBM', ]) -stats_tree_preprocessibility_fs_only_catboost <- calcualte_stats(tree_preprocessibility_fs_only[tree_preprocessibility_fs_only$grouping == 'CatBoost', ]) - -stats_tree_preprocessibility_fs_only <- rbind(stats_tree_preprocessibility_fs_only_decision_tree, stats_tree_preprocessibility_fs_only_ranger, stats_tree_preprocessibility_fs_only_xgboost, stats_tree_preprocessibility_fs_only_lightgbm, stats_tree_preprocessibility_fs_only_catboost) -rownames(stats_tree_preprocessibility_fs_only) <- c('Decision tree', 'Random forest', 'XGBoost', 'LightGBM', 'CatBoost') -knitr::kable(t(stats_tree_preprocessibility_fs_only)) -``` - -If no feature selection is used the number of loses dimisnishes drastivally, and both positivie and negative preprocessibility values get shrinked. - -### Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(tree_preprocessibility_fs_only, 'Tree-based model', 'tree-based model') -``` - -## Parallel plots - -We additionally want to check if something interesting can be seen on the parallel plot, however it doesn't give us any meaningful results. - -```{r echo=FALSE, warning=FALSE, message=FALSE} -tree_engines_bin_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_bin, tree_engines_bin_baselines, 'Engine', 'Imputation') -tree_engines_mcl_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_mcl, tree_engines_mcl_baselines, 'Engine', 'Imputation') -tree_engines_reg_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_reg, tree_engines_reg_baselines, 'Engine', 'Imputation') -tree_preprocessibility_imputation <- rbind(tree_engines_bin_preprocessibility_imputation, tree_engines_mcl_preprocessibility_imputation, tree_engines_reg_preprocessibility_imputation) -tree_preprocessibility_imputation <- tree_preprocessibility_imputation[rep(c(rep(FALSE, 4), TRUE), 500), ] -parcoord_1 <- tree_preprocessibility_imputation[, c(1, 6, 7, 15, 16)] %>% pivot_wider(names_from = grouping2, values_from = Postive_preprocessibility) -parcoord_4 <- tree_preprocessibility_imputation[, c(1, 6, 7, 15, 16)] %>% pivot_wider(names_from = grouping2, values_from = Negative_preprocessibility) - -tree_engines_bin_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_bin, tree_engines_bin_baselines, 'Engine', 'Removal') -tree_engines_mcl_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_mcl, tree_engines_mcl_baselines, 'Engine', 'Removal') -tree_engines_reg_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_reg, tree_engines_reg_baselines, 'Engine', 'Removal') -tree_preprocessibility_imputation <- rbind(tree_engines_bin_preprocessibility_imputation, tree_engines_mcl_preprocessibility_imputation, tree_engines_reg_preprocessibility_imputation) -tree_preprocessibility_imputation <- tree_preprocessibility_imputation[rep(c(rep(FALSE, 4), TRUE), 375), ] -parcoord_2 <- tree_preprocessibility_imputation[, c(1, 6, 7, 15, 16)] %>% pivot_wider(names_from = grouping2, values_from = Postive_preprocessibility) -parcoord_5 <- tree_preprocessibility_imputation[, c(1, 6, 7, 15, 16)] %>% pivot_wider(names_from = grouping2, values_from = Negative_preprocessibility) - -tree_engines_bin_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_bin, tree_engines_bin_baselines, 'Engine', 'Feature_selection') -tree_engines_mcl_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_mcl, tree_engines_mcl_baselines, 'Engine', 'Feature_selection') -tree_engines_reg_preprocessibility_imputation <- calculate_preprocessibility(tree_engines_reg, tree_engines_reg_baselines, 'Engine', 'Feature_selection') -tree_preprocessibility_imputation <- rbind(tree_engines_bin_preprocessibility_imputation, tree_engines_mcl_preprocessibility_imputation, tree_engines_reg_preprocessibility_imputation) -tree_preprocessibility_imputation <- tree_preprocessibility_imputation[rep(c(rep(FALSE, 4), TRUE), 625), ] -parcoord_3 <- tree_preprocessibility_imputation[, c(1, 6, 7, 15, 16)] %>% pivot_wider(names_from = grouping2, values_from = Postive_preprocessibility) -parcoord_6 <- tree_preprocessibility_imputation[, c(1, 6, 7, 15, 16)] %>% pivot_wider(names_from = grouping2, values_from = Negative_preprocessibility) -datatable(parcoord_3, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -```{r fig.height=12, fig.width=24, echo=FALSE, warning=FALSE} -a <- ggparcoord(parcoord_1, columns = c(4:7), groupColumn = 3, scale = "globalminmax", showPoints = TRUE) + - labs(title = 'Parallel plot', - subtitle = 'for imputation methods', - x = 'Imputation method', - y = 'Positive preprocessibility', - color = 'Tree-based model', - fill = 'Tree-based model') + - ylim(0, 0.255) + - paper_theme() + - scale_color_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - theme(axis.title.x = element_blank(), - axis.text.x = element_blank(), - legend.position = "none") - -b <- ggparcoord(parcoord_2, columns = c(4:6), groupColumn = 3, scale = "globalminmax", showPoints = TRUE) + - labs(title = 'Parallel plot', - subtitle = 'for removal strategies', - x = 'Removal strategy', - y = 'Positive preprocessibility', - color = 'Tree-based model', - fill = 'Tree-based model') + - ylim(0, 0.255) + - paper_theme() + - scale_color_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - axis.text.y = element_blank(), - axis.title.x = element_blank(), - axis.text.x = element_blank(), - legend.position = "none") - -c <- ggparcoord(parcoord_3, columns = c(4:8), groupColumn = 3, scale = "globalminmax", showPoints = TRUE) + - labs(title = 'Parallel plot', - subtitle = 'for feature selection methods', - x = 'Feature selection method', - y = 'Positive preprocessibility', - color = 'Tree-based model', - fill = 'Tree-based model') + - ylim(0, 0.255) + - paper_theme() + - scale_color_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - axis.text.y = element_blank(), - axis.title.x = element_blank(), - axis.text.x = element_blank(), - legend.position = "none") - -a1 <- ggparcoord(parcoord_4, columns = c(4:7), groupColumn = 3, scale = "globalminmax", showPoints = TRUE) + - labs(title = 'Parallel plot', - subtitle = 'for imputation methods', - x = 'Imputation method', - y = 'Negative preprocessibility', - color = 'Tree-based model', - fill = 'Tree-based model') + - ylim(-1, 0) + - paper_theme() + - scale_color_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - theme(plot.title = element_blank(), - plot.subtitle = element_blank(), - legend.position = "none") - -b1 <- ggparcoord(parcoord_5, columns = c(4:6), groupColumn = 3, scale = "globalminmax", showPoints = TRUE) + - labs(title = 'Parallel plot', - subtitle = 'for removal strategies', - x = 'Removal strategy', - y = 'Negative preprocessibility', - color = 'Tree-based model', - fill = 'Tree-based model') + - ylim(-1, 0) + - paper_theme() + - scale_color_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - theme(plot.title = element_blank(), - plot.subtitle = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "bottom") - -c1 <- ggparcoord(parcoord_6, columns = c(4:8), groupColumn = 3, scale = "globalminmax", showPoints = TRUE) + - labs(title = 'Parallel plot', - subtitle = 'for feature selection methods', - x = 'Feature selection method', - y = 'Negative preprocessibility', - color = 'Tree-based model', - fill = 'Tree-based model') + - ylim(-1, 0) + - paper_theme() + - scale_color_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#74533d", "#7C843C", "#afc968", "#B1805B", '#D6E29C')) + - theme(plot.title = element_blank(), - plot.subtitle = element_blank(), - axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "none") - -(a | b | c) / (a1 | b1 | c1) -``` - -# Summary - -Now, let's try to summarise all information we have gathered so far: - -5.3 - All models - -- Preprocessing strategies in most cases (56,5%) doesn't have impact on tree-based models performance, -- It doesn't howver mean that they don't need preprocessing as in 15.5% of cases it has a positive impact, and in 28% it is negative, -- We should notice that average improvements are much smaller than degradations (preprocessibility: 0.01 vs 0.05). - -5.4 - Feature selection - -- Feature selection yields the biggest impact on tree-based models performance, as the comaprison of ties percentage with FS is 48.5%, whereas without 70.5%, -- Unfortunately, it is mostly negative impact, as wins percentage of grows from 12% to 17%, but the degradations from 17% to 34%, -- Furthermore the mean positive preprocessibility grows from 0.006 to 0.009, but the negative one grows from 0.013 to 0.47, -- It shows that feature selection is powerful tool, but it should be used with caution. - -5.5 - Removal - all models - -- Removal strategies have the smallest impact on tree-based models performance, as the comparison of ties percentage for min, med, and max strategies are 63%, 61%, and 45% respectively, -- It is also the most balanced strategy, as the postive preprocessibilities are 0.008, 0.008, and 0.007, and the negative ones are 0.04, 0.036, and 0.039, -- We can also notice, that trees deal pretty well with highly correlated columns, as the max strategy has the highest percentage of degradations of 38%, when other approaches stay at 24%, and 23%. - -5.6- Removal - no feature selection - -- Without concerning FS, we get much more ties than before (85%, 71%, and 55%), which shows that removal strategies are the most neutral approach, -- Once more max strategy seems to be the worst, as it looses much more often than the others (5%, 15%, 32%), and wins similarly (10%, 13%, 13%), -- We should probably use min or med approaches, as they show positive impact on final results (0.005 vs 0.002, 0.005, vs 0.003, 0.005 vs 0.011). - -5.7- Removal - feature selection only - -- FS methods, once more prove to lower the performance, as each removal strategy looses more often than wins (14.5% vs 31%, 19% vs 30%, 19.5% vs 41%) -- However, this time medium approach benefits the most from feature selection, thus it should be used when we also include fs. - -5.8 - Feature selection methods impact - -- Different FS methods contribute to the performance differently, -- Two worst methods, are MI, and VI as they have much more losses than wins (14.5% vs 35%, 22.5% vs 41%), -- MCFS seems to balance the results even more than lack of fs, as the number of ties is higher (76%, 70.5%), and the preprocessibilities ratios are smaller (0.001 vs 0.01, 0.006 vs 0.13), however it is still a rather negative outcome, -- The best method, beating the lack of FS is Boruta, as it has higher positive preprocessibility (0.008 vs 0.013). It also wins the almost the same amount of times as loses (22.5% vs 24.5%). - -5.9 - Imputation impact - -- The KNN algorithm is the strongest imputation method as it wins in majority of cases (58% vs 16.5%). Furthermore it has the highest postitive impact (0.017), and the lowest negative one (0.011), -- The MICE algorithm on the other hand is the worst, as it has lows winratio, and the highest number of loses (29% vs 58%), -- The median approaches are similar, and relatively neutral (29% vs 29%, 27.5% vs 32.5%). - -5.10 - Models comaprison - -- Random forest can benefit the most in term of mean positive preprocessiblity value ,yet if also look at its peroformance we will notice that it is the weakest model (0.694). -- The second-best option is XGBoost as it has second best performance (0.845), and good positive and negative preprocessibilities (0.01 vs 0.043). -- The best one is the CatBoost, as it achieves the highest scores (0.866), biggest amount of wins (0.22), and also get good positive preprocessibility value (0.01 vs 0.07). -- The other models does not benefit from preprocessing, and loose more often. - -5.11 and 5.12 - nothing new - -Thus, theoretically, if we want to get the best possible results, we should choose lack of preprocessing or Boruta, random forest, XGBoost, or CatBoost models, medium data preprocessing, and KNN as an imputation method - -# Experimental validation - -Eventually we want to validate our findings thus we check if lack of preprocessing or Boruta, random forest, XGBoost, or CatBoost models, medium data preprocessing, and knn as an imputation method will lead us to the best improvements possible. - -# Data preparation - -```{r} -tree_engines_top <- validation_summary_table[validation_summary_table$Engine %in% c('CatBoost', 'XGBoost', 'Random forest') & - validation_summary_table$Imputation %in% c('KNN') & - validation_summary_table$Feature_selection %in% c('Boruta', 'None') & - validation_summary_table$Removal %in% c('Med'), ] - -tree_engines_top_bin <- tree_engines_top[tree_engines_top$Task_type == 'Binary' & tree_engines_top$Metric == 'accuracy', ] -tree_engines_top_mcl <- tree_engines_top[tree_engines_top$Task_type == 'Multiclass' & tree_engines_top$Metric == 'accuracy', ] -tree_engines_top_reg <- tree_engines_top[tree_engines_top$Task_type == 'Regression' & tree_engines_top$Metric == 'r2', ] - - - -tree_engines_top_baselines <- validation_summary_table[validation_summary_table$Engine %in% c('CatBoost', 'XGBoost', 'Random forest') & - validation_summary_table$Removal =='Min' & - validation_summary_table$Imputation =='Median-other' & - validation_summary_table$Feature_selection =='None', ] - -tree_engines_top_bin_baselines <- tree_engines_top_baselines[tree_engines_top_baselines$Task_type == 'Binary' & tree_engines_top_baselines$Metric == 'accuracy', ] -tree_engines_top_mcl_baselines <- tree_engines_top_baselines[tree_engines_top_baselines$Task_type == 'Multiclass' & tree_engines_top_baselines$Metric == 'accuracy', ] -tree_engines_top_reg_baselines <- tree_engines_top_baselines[tree_engines_top_baselines$Task_type == 'Regression' & tree_engines_top_baselines$Metric == 'r2', ] - -tree_engines_top_bin_baselines$Fields <- tree_engines_top_bin_baselines$Rows * tree_engines_top_bin_baselines$Columns -tree_engines_top_mcl_baselines$Fields <- tree_engines_top_mcl_baselines$Rows * tree_engines_top_mcl_baselines$Columns -tree_engines_top_reg_baselines$Fields <- tree_engines_top_reg_baselines$Rows * tree_engines_top_reg_baselines$Columns - -tree_engines_top_bin_baselines$Dataset <- as.factor(tree_engines_top_bin_baselines$Dataset) -tree_engines_top_bin_baselines$Dataset <- fct_reorder(tree_engines_top_bin_baselines$Dataset, tree_engines_top_bin_baselines$Fields) - -tree_engines_top_mcl_baselines$Dataset <- as.factor(tree_engines_top_mcl_baselines$Dataset) -tree_engines_top_mcl_baselines$Dataset <- fct_reorder(tree_engines_top_mcl_baselines$Dataset, tree_engines_top_mcl_baselines$Fields) - -tree_engines_top_reg_baselines$Dataset <- as.factor(tree_engines_top_reg_baselines$Dataset) -tree_engines_top_reg_baselines$Dataset <- fct_reorder(tree_engines_top_reg_baselines$Dataset, tree_engines_top_reg_baselines$Fields) - -tree_engines_top_baselines <- rbind(tree_engines_top_bin_baselines, tree_engines_top_mcl_baselines, tree_engines_top_reg_baselines) - -tree_engines_top_bin_preprocessibility <- calculate_preprocessibility_2(tree_engines_top_bin, tree_engines_top_bin_baselines, 'Engine') -tree_engines_top_mcl_preprocessibility <- calculate_preprocessibility_2(tree_engines_top_mcl, tree_engines_top_mcl_baselines, 'Engine') -tree_engines_top_reg_preprocessibility <- calculate_preprocessibility_2(tree_engines_top_reg, tree_engines_top_reg_baselines, 'Engine') -tree_engines_top_preprocessibility <- rbind(tree_engines_top_bin_preprocessibility, tree_engines_top_mcl_preprocessibility, tree_engines_top_reg_preprocessibility) -datatable(tree_engines_top_preprocessibility, width = '100%', options = list(scrollX = TRUE, paging = TRUE, pageLength = 5)) -``` - -## Detailed plot - -```{r fig.height=18, fig.width=20, echo=FALSE, warning=FALSE} -detailed_plot(tree_engines_top_preprocessibility, 'Tree-based model') -``` - -## Stats - -```{r} -stats_tree_engines_top_preprocessibility_1 <- calcualte_stats(tree_engines_top_preprocessibility, FALSE) -stats_tree_engines_top_preprocessibility_2 <- calcualte_stats(tree_engines_top_preprocessibility[tree_engines_top_preprocessibility$grouping == 'CatBoost', ], FALSE) -stats_tree_engines_top_preprocessibility_3 <- calcualte_stats(tree_engines_top_preprocessibility[tree_engines_top_preprocessibility$grouping == 'XGBoost', ], FALSE) -stats_tree_engines_top_preprocessibility_4 <- calcualte_stats(tree_engines_top_preprocessibility[tree_engines_top_preprocessibility$grouping == 'Random forest', ], FALSE) - -stats_tree_engines_top_preprocessibility <- rbind(stats_tree_engines_top_preprocessibility_1, stats_tree_engines_top_preprocessibility_2, stats_tree_engines_top_preprocessibility_3, stats_tree_engines_top_preprocessibility_4) -rownames(stats_tree_engines_top_preprocessibility) <- c('All', 'CatBoost', 'XGBoost', 'Random forest') -knitr::kable(t(stats_tree_engines_top_preprocessibility)) -``` - -The results show us that the models after such filtering win more often than loose, their positive preprocessibilties are a few times higher than the negative ones. It proves that our guide for good preprocessing strategy works. - -```{r} -tree_engines_top_preprocessibility %>% - group_by(grouping) %>% - summarise(Maximum = mean(Maximum)) -``` - -## Aggregated plot - -```{r fig.height=8, fig.width=18, echo=FALSE, warning=FALSE} -aggregated_plot(tree_engines_top_preprocessibility, 'Tree-based model', 'tree-based model') -``` diff --git a/docs/articles/AutoML24Workshop & MScThesis/08_MSc_time_analysis.Rmd b/docs/articles/AutoML24Workshop & MScThesis/08_MSc_time_analysis.Rmd deleted file mode 100644 index 6f81fba..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/08_MSc_time_analysis.Rmd +++ /dev/null @@ -1,606 +0,0 @@ ---- -title: "Masters Thesis forester: Time analysis" -author: "Hubert Ruczyński" -date: "`r Sys.Date()`" -output: - html_document: - toc: yes - toc_float: yes - toc_collapsed: yes - theme: lumen - toc_depth: 3 - number_sections: yes - code_folding: hide - latex_engine: xelatex ---- - -```{css, echo=FALSE} -body .main-container { - max-width: 1820px !important; - width: 1820px !important; -} -body { - max-width: 1820px !important; - width: 1820px !important; - font-family: Helvetica !important; - font-size: 16pt !important; -} -h1,h2,h3,h4,h5,h6{ - font-size: 24pt !important; -} -``` - -# Imports and settings - -```{r, warning=FALSE, message=FALSE} -library(ggplot2) -library(patchwork) -library(scales) -library(dplyr) -library(forcats) -library(kableExtra) -library(knitr) -library(DT) -library(GGally) -library(tidyr) -``` - -# Data import - -```{r} -duration_train_df <- readRDS('MSc_processed_results/training_duration.RData') -duration_preprocessing <- readRDS('MSc_processed_results/preprocessing_duration.RData') -training_summary_table <- readRDS('MSc_processed_results/training_summary_table.RData') -testing_summary_table <- readRDS('MSc_processed_results/testing_summary_table.RData') -validation_summary_table <- readRDS('MSc_processed_results/validation_summary_table.RData') -``` - -## Name changes - -As the data comes from the `forester` package in a raw form, in order to prepare plots for the Thesis/paper we rename some values, so they look nicer on plots. - -```{r} -change_factors <- function(dataset, score = FALSE) { - dataset$Task_type <- as.factor(dataset$Task_type) - dataset$Task_type <- fct_recode(dataset$Task_type, 'Binary' = 'binary', 'Multiclass' = 'multiclass', 'Regression' = 'regression') - dataset$Feature_selection <- as.factor(dataset$Feature_selection) - dataset$Feature_selection <- fct_recode(dataset$Feature_selection, 'None' = 'none', 'VI' = 'VI', 'MCFS' = 'MCFS', 'MI' = 'MI', 'Boruta' = 'BORUTA') - dataset$Imputation <- as.factor(dataset$Imputation) - dataset$Imputation <- fct_recode(dataset$Imputation, 'Median-other' = 'median-other', 'Median-frequency' = 'median-frequency', 'KNN' = 'knn', 'MICE' = 'mice') - dataset$Removal <- as.factor(dataset$Removal) - dataset$Removal <- fct_recode(dataset$Removal, 'Min' = 'removal_min', 'Med' = 'removal_med', 'Max' = 'removal_max') - if (score) { - dataset$Engine <- as.factor(dataset$Engine) - dataset$Engine <- fct_recode(dataset$Engine, 'LightGBM' = 'lightgbm', 'CatBoost' = 'catboost', 'Random forest' = 'ranger', 'XGBoost' = 'xgboost', 'Decision tree' = 'decision_tree', 'All' = 'all') - } - return(dataset) -} - -duration_train_df <- change_factors(duration_train_df) -training_summary_table <- change_factors(training_summary_table, TRUE) -testing_summary_table <- change_factors(testing_summary_table, TRUE) -validation_summary_table <- change_factors(validation_summary_table, TRUE) -``` - -# Time analysis - -An important aspect of our analysis is the time complexity of different approaches, as extended preprocessing module leads to more time consuming computations, which could be spent for example on training the models. On the other hand, thorough preparation step might result in removing lots of unnecessary columns, so the model should be able to learn faster. Despite the absolute preprocessing time, another important aspect is the relative duration to training time. Ex. if the training takes 1000 seconds than preprocessing lasting 100 is not so much as in the case when training takes 100 seconds. We will work on slightly modified data frame presented below. - -```{r} -duration_df <- duration_train_df -full_duration <- duration_preprocessing$Duration + duration_df$Duration -duration_df$Preprocessing_duration <- duration_preprocessing$Duration -duration_df$Preprocessing_duration_fraction <- round(duration_df$Preprocessing_duration / full_duration, 3) -duration_df$Full_duration <- full_duration -rmarkdown::paged_table(duration_df) -``` - -## Training time preparation - -In this section we prepare the data for the analysis of training time. - -```{r, echo=FALSE} -column_fractions <- c() -max_fields_num <- c() -task_type <- c() -Columns <- c() -datasets <- unique(training_summary_table$Dataset) -for (i in 1:length(unique(training_summary_table$Dataset))) { - cols <- training_summary_table[training_summary_table$Dataset == datasets[i], 'Columns'] - rows <- training_summary_table[training_summary_table$Dataset == datasets[i], 'Rows'] - Columns <- c(Columns, max(cols)) - column_fractions <- c(column_fractions, round(min(cols) / max(cols), 2)) - max_fields_num <- c(max_fields_num, max(rows) * max(cols)) - if (i <= 10) { - task_type <- c(task_type, 'Binary') - } else if (i > 18) { - task_type <- c(task_type, 'Regression') - } else { - task_type <- c(task_type, 'Multiclass') - } -} -left_columns <- data.frame(Dataset = datasets, Column_fraction = column_fractions, Columns = Columns, - Max_fields_number = max_fields_num, Task_type = task_type) -``` - -```{r} -left_columns_binary <- left_columns[left_columns$Task_type == 'Binary', ] -left_columns_binary$Dataset <- as.factor(left_columns_binary$Dataset) -left_columns_binary$Dataset <- fct_reorder(left_columns_binary$Dataset, left_columns_binary$Max_fields_number) - -left_columns_multiclass <- left_columns[left_columns$Task_type == 'Multiclass', ] -left_columns_multiclass$Dataset <- as.factor(left_columns_multiclass$Dataset) -left_columns_multiclass$Dataset <- fct_reorder(left_columns_multiclass$Dataset, left_columns_multiclass$Max_fields_number) - -left_columns_regression <- left_columns[left_columns$Task_type == 'Regression', ] -left_columns_regression$Dataset <- as.factor(left_columns_regression$Dataset) -left_columns_regression$Dataset <- fct_reorder(left_columns_regression$Dataset, left_columns_regression$Max_fields_number) - -left_columns <- rbind(left_columns_binary, left_columns_multiclass, left_columns_regression) - -duration_df$Dataset <- factor(duration_df$Dataset, levels = levels(left_columns$Dataset)) -``` - -## Duration preprocessing fixes - -Unfortuantely, during the data preparation phase some mistakes occured, thus we have to enhance data quality. - -```{r} -duration_preprocessing$Dataset <- gsub(' ', '', duration_preprocessing$Dataset) -duration_preprocessing$Feature_selection <- gsub(' ', '', duration_preprocessing$Feature_selection) -duration_preprocessing$Imputation <- gsub(' ', '', duration_preprocessing$Imputation) -duration_preprocessing$Removal <- gsub(' ', '', duration_preprocessing$Removal) -duration_preprocessing$Dataset <- factor(duration_preprocessing$Dataset, levels = levels(left_columns$Dataset)) -duration_preprocessing <- change_factors(duration_preprocessing) -``` - -## Plots theme - -We define the custom theme for this work, so the adjustments are easier to make. - -```{r} -paper_theme <- function() { - theme_minimal() + - theme(plot.title = element_text(colour = 'black', size = 26), - plot.subtitle = element_text(colour = 'black', size = 16), - axis.title.x = element_text(colour = 'black', size = 18), - axis.title.y = element_text(colour = 'black', size = 16), - axis.text.y = element_text(colour = "black", size = 16), - axis.text.x = element_text(colour = "black", size = 16), - strip.background = element_rect(fill = "white", color = "white"), - strip.text = element_text(size = 6 ), - strip.text.y.right = element_text(angle = 0), - legend.title = element_text(colour = 'black', size = 18), - legend.text = element_text(colour = "black", size = 16), - strip.text.y.left = element_text(size = 16, angle = 0, hjust = 1)) -} -``` - -## Training time - -```{r fig.height=12, fig.width=20, echo=FALSE} -a <- ggplot(data = left_columns, aes(color = Task_type, fill = Task_type)) + - geom_segment(aes(x = Columns * column_fractions, xend = Columns, y = Dataset, yend = Dataset)) + - geom_point(aes(x = Columns * column_fractions, y = Dataset), size = 3) + - geom_point(aes(x = Columns, y = Dataset), size = 3) + - labs(title = 'Columns range', - subtitle = 'full vs minimal columns', - x = 'Number of columns', - y = '', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.text.y = element_blank(), - legend.position = "none") - -b <- ggplot(data = duration_df, aes(x = Duration, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Training time', - subtitle = 'for different ML tasks', - x = 'Duration [s]', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -c <- ggplot(data = left_columns, aes(x = Max_fields_number, y = Dataset, color = Task_type, fill = Task_type)) + - geom_col(alpha = 0.7) + - labs(title = 'Dataset size', - subtitle = 'described as number of fields', - x = 'Number of fields', - y = '', - color = 'Task_type', - fill = 'Task_type') + - scale_x_continuous(trans = log2_trans(), - breaks = trans_breaks('log2', function(x) 2^x), - labels = trans_format('log2', math_format(2^.x))) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.text.y = element_blank(), - legend.position = "none") - -(b | a | c) + plot_layout(widths = c(5, 2, 2)) -``` - -The visualization above presents training duration box-plots for different ML tasks. Each box-plot is based on 38 different preprocessing strategies. An intention behind this analysis is to find out if training times differ significantly depending on the preprocessing strategy used before. - -The middle plot presents us the difference between initial number of columns (right dot), and the minimal number of columns after the most harsh preprocessing (left dot). - -The x scale on the plots was transformed by applying the log2 in order to easily detect if maximal and minimal values (which are not outliers) differ more than two times. We will say that the training times differ significantly if this min-max ratio is bigger than 2 times. - -After considering such definition we can say that training times differ significantly on in 9 of 25 datasets. - -It happens for those datasets that have the biggest reduction of columns. - -Interestingly, it is not so highly correlated with the initial number of columns being large. - -Morover, we can notice that the training time doesn't depend so much on the initial number of fields. Of course biggger tasks (eg. multiclass ones) tend to train longer, but it is not so clear. - -## Preprocessing time - -```{r fig.height=12, fig.width=20, echo=FALSE} -d <- ggplot(data = duration_df, aes(x = Preprocessing_duration, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks', - x = 'Duration [s]', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") -(d | a | c) + plot_layout(widths = c(5, 2, 2)) -``` - -We can notice that the preprocessing of regression tasks lasted the longest, multiclass tasks were roughly in the middle, and the binary classification was the quickest. - -We can observe that the time of preprocessing is highly dependent on the dimensionality of considered dataset. - -Once more, larger differences of preprocessing time are more visible, when the number of columns is reduced significantly. - -Additionally, we can witnes that modfied datasets (with suffix -mod) are less time consuming than their original counterparts. - -## Combined time - -```{r fig.height=12, fig.width=20, echo=FALSE} -e <- ggplot(data = duration_df, aes(x = Full_duration, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Combined preprocessing and training time', - subtitle = 'for different ML tasks', - x = 'Duration [s]', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -(e | a | c) + plot_layout(widths = c(3, 1, 1)) -``` - -Finally we want to analyse the combined times of both preprocessing and training. It is crucial as the process of preparing the data and training of models is always connected. The plot shows us that the duration of whole process was shorter for smaller tasks which were mostly the binary classification ones. - -We can clearly see that multiclass tasks are the most time consuming, and the regression tasks are in the middle, and binary ones were relatively fast. - -## Preprocessing + Training - -```{r fig.height=12, fig.width=20, echo=FALSE} -(d | b + theme(axis.text.y = element_blank(), axis.title.y = element_blank(),)) / guide_area() + plot_layout(guides = "collect") + plot_layout(height = c(1, 0.05)) -``` - -## Time fraction - -```{r fig.height=12, fig.width=20, echo=FALSE} -f <- ggplot(data = duration_df, aes(x = Preprocessing_duration_fraction, y = Dataset, color = Task_type, fill = Task_type)) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Preprocessing time fraction', - subtitle = ' in comparison to full process, for different ML tasks', - x = 'Fraction of preprocessing time', - y = 'Dataset', - color = 'Task type', - fill = 'Task type') + - xlim(0, 1) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -(f | a | c) + plot_layout(widths = c(3, 1, 1)) -``` - -Probably even more insightful analysis can be derived from the analysis of fraction of time spent on preprocessing compared to the one of training. Intuitively we can understand that the more on the left is the observation, the shorter the relative preprocessing time. - -As we can see for almost every dataset we can witness that some preprocessing options are disproportionately time consuming to the training time, thus comes the conclusion that we always have to be sensitive when it comes to the choice of preprocessing methods. - -Quite interestingly, the fractions doesn't depend so much on the number of initial size of the dataset, but the combination of both this and the number of deleted columns. The most interesting cases are although the multiclass tasks, as they tend to get heavily reduced, and whilst maintaining the big sizes their preprocessing should take more time, however due to increadibly long training of models for those tasks, the preprocessing time is not so significant. - -## Preprocessing components analysis - -It is also extremely important to analyse the execution times depending on different preprocessing strategies. Those times are not only crucial for evaluation of different preprocessing steps, but more importantly let us gain the intuition which steps are time consuming, and which ones are almost cost-free. - -### Feature selection impact - -```{r fig.height=12, fig.width=20, echo=FALSE} -bool_fs <- duration_df -bool_fs$Feature_selection <- ifelse(bool_fs$Feature_selection != 'None', 'Yes', 'No') - -g <- ggplot(data = bool_fs, aes(x = Preprocessing_duration, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - theme_minimal() + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by presence of feature selection', - x = 'Duration [s]', - y = 'Dataset', - color = 'Is feature selection used?', - fill = 'Is feature selection used?') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - - -g1 <- ggplot(data = bool_fs, aes(x = Duration, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - theme_minimal() + - labs(title = 'Training time', - x = 'Duration [s]', - y = 'Dataset', - color = 'Is feature selection used?', - fill = 'Is feature selection used?') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - axis.text.y = element_blank(), - legend.position = "none") - -(g | g1) / guide_area() + plot_layout(guides = "collect") + plot_layout(height = c(1, 0.05)) -``` - -We observe that for the majority of datasets, the preprocessing was much cheaper than the model training, however in some cases it was far from the truth. In the case of the largest datasets, especially from the regression tasks, we can observe a high impact of preprocessing in overall time complexity, which in some cases outlasted the training time a few times. Interestingly, the fractions do not depend so much on the initial size of the dataset, but on the combination of both this and the number of deleted columns. It shows that ML enthusiasts should be careful while designing their preprocessing pipeline. - -### No feature selection removal strategies - -```{r fig.height=12, fig.width=20, echo=FALSE} -no_fs <- duration_preprocessing[duration_preprocessing$Feature_selection == 'None', ] - -h <- ggplot(data = no_fs, aes(x = Duration, y = Dataset, color = factor(Removal), fill = factor(Removal))) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by removal strategy', - x = 'Duration [s]', - y = 'Dataset', - color = 'Removal strategy', - fill = 'Removal strategy') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -h -``` - -The results indicate that for each task, the differences between the three strategies are in most cases insignificant. The medium approach is the most time-consistent option, whereas the other two tend to differ a lot. Interestingly, the maximal approach is not always the longest-lasting one, even though it adds more methods than other strategies. We can see, however, that it always lasts longer than the medium variant, which is due to the presence of highly correlated column removal. - -### No feature selection Imputation methods - -```{r fig.height=10, fig.width=20, echo=FALSE} -no_fs_imp <- no_fs[no_fs$Dataset %in% c('breast-w', 'credit-approval', "credit-g-mod", "phoneme-mod", "car-mod", "satimage-mod", "elevators-mod", "kin8nm-mod"), ] -i <- ggplot(data = no_fs_imp, aes(x = Duration, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by imputation strategy', - x = 'Duration [s]', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -j <- ggplot(data = no_fs, aes(x = Duration, y = Dataset, color = factor(Imputation), fill = factor(Imputation))) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Preprocessing time comparison', - subtitle = 'for different ML tasks, divided by imputation strategy', - x = 'Duration [s]', - y = 'Dataset', - color = 'Imputation method', - fill = 'Imputation method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -i -``` - -```{r} -no_fs_imp_average <- no_fs_imp %>% group_by(Imputation) %>% summarise(Average_duration = round(mean(Duration))) -no_fs_imp_average -``` - -Another factor to analyse is to check the impact of imputation methods on the preprocessing times. We have a wide range of data, as the mod-variants are the original datasets with introduced missing values with completely at random strategy. - -The plot above show us that the slowest method is definitely KNN (22s), the second one is MICE (16s), and two fastest ones are median-other (6s) and median-fequency (5s). It is worth noticing that MICE can lead to over 8 times longer preprocessing times, than the fastest method. - -```{r fig.height=16, fig.width=20, echo=FALSE} -j -``` - -Additionally, we include the plot for all datasets, as it is interesting to see that the differences are not so significant for the tasks not including missing values. - -### Different feature selection methods - -```{r fig.height=16, fig.width=20, echo=FALSE} -only_fs <- duration_preprocessing[duration_preprocessing$Feature_selection != 'None', ] -only_fs_niche <- only_fs[only_fs$Feature_selection %in% c('MI', 'MCFS'), ] -only_fs_top <- only_fs[only_fs$Feature_selection %in% c('VI', 'Boruta'), ] - -k <- ggplot(data = only_fs, aes(x = Duration, y = Dataset, color = factor(Feature_selection), fill = factor(Feature_selection))) + - geom_boxplot(alpha = 0.7, outlier.size = 3) + - labs(title = 'Preprocessing time', - subtitle = 'for different ML tasks, divided by feature selection method', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x), limits = c(NA, 4100)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") -k -``` - -We can notice significant differences between the execution times of the methods. - -Moreover, in general we can say that the duration doesn't differ a lot inside a single FS method. - -We want to use that assumptions in order to compare all methods in a more readable way by the comparison of their medians, as the abundance of colors and box-plots is hardly understandable here. - -```{r fig.height=12, fig.width=20, echo=FALSE} -datasets <- unique(only_fs$Dataset) -VI <- c() -MCFS <- c() -MI <- c() -Boruta <- c() - -for (i in unique(only_fs$Dataset)) { - ds <- only_fs[only_fs$Dataset == i, ] - VI <- c(VI, median(ds[ds$Feature_selection == 'VI', 'Duration'])) - MCFS <- c(MCFS, median(ds[ds$Feature_selection == 'MCFS', 'Duration'])) - MI <- c(MI, median(ds[ds$Feature_selection == 'MI', 'Duration'])) - Boruta <- c(Boruta, median(ds[ds$Feature_selection == 'Boruta', 'Duration'])) -} - -median_fs <- data.frame(Dataset = datasets, VI = VI, MCFS = MCFS, Boruta = Boruta, MI = MI) -long_median_fs <- reshape(median_fs, varying = c('MI' ,'VI', 'MCFS', 'Boruta'), v.names = c('Duration'), - times = c('MI' ,'VI', 'MCFS', 'Boruta'), direction = 'long') -long_median_fs <- long_median_fs[, 1:3] - -rownames(long_median_fs) <- NULL -colnames(long_median_fs) <- c('Dataset', 'Method', 'Duration') - -l <- ggplot(data = long_median_fs, aes(x = Duration, y = Dataset, color = factor(Method), fill = factor(Method))) + - geom_point(size = 5, alpha = 0.7) + - labs(title = 'Preprocessing median time', - subtitle = 'for different ML tasks, divided by feature selection method', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") -l -``` - -The visualization above clearly indicates that in the forester package we can witness the division between slow and fast feature selection methods, where VI and Boruta are in the first group, whereas, MCFS and MI in the second one. In order to analyse them thoroughly let's create two subplots that separate those two. - -```{r fig.height=12, fig.width=20, echo=FALSE} -long_median_fs_slow <- long_median_fs[long_median_fs$Method %in% c('VI', 'Boruta'), ] -long_median_fs_fast <- long_median_fs[long_median_fs$Method %in% c('MCFS', 'MI'), ] - -m <- ggplot(data = long_median_fs_slow, aes(x = Duration, y = Dataset, color = factor(Method), fill = factor(Method))) + - geom_point(size = 5, alpha = 0.7) + - labs(title = 'Preprocessing median time', - subtitle = 'for slow feature selection methods', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.title.y = element_blank(), - legend.position = "bottom") - -n <- ggplot(data = long_median_fs_fast, aes(x = Duration, y = Dataset, color = factor(Method), fill = factor(Method))) + - geom_point(size = 5, alpha = 0.7) + - labs(title = 'Preprocessing median time', - subtitle = 'for fast feature selection methods', - x = 'Duration [s]', - y = 'Dataset', - color = 'Feature selection method', - fill = 'Feature selection method') + - scale_x_continuous(trans = log2_trans(), breaks = trans_breaks('log2', function(x) 2^x)) + - annotation_logticks(base = 2, scaled = TRUE) + - paper_theme() + - scale_color_manual(values = c("#afc968", "#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - scale_fill_manual(values = c("#afc968","#74533d", "#7C843C", "#B1805B", '#D6E29C')) + - theme(axis.text.y = element_blank(), - axis.title.y = element_blank(), - legend.position = "bottom") -m | n -``` - -This time we can easily distinguish which preprocessing methods are faster and slower among considered pairs. - -In the case of less time-demanding ones presented on the right plot, almost every time MCFS method is faster than MI, and in some cases the differences are significant as they can reach up to 16 times difference. - -For the slow methods VI is always the most time consuming, and Boruta is the second one. The differences are however pretty significant every time. - -Summing up, the order from fastest to slowest feature selection method is: MCFS, MI, Boruta, VI with average preprocessing times equal to 11s, 16s, 133s (\~3 mins), and 480s (\~8 mins). - -```{r} -only_fs_average <- only_fs %>% group_by(Feature_selection) %>% summarise(Average_duration = round(mean(Duration))) -only_fs_average -``` - -## Summary - -1. The major time differences depend on the number of columns after preprocessing, not the initial number of columns, -2. Training time does not depend much on the dataset size, -3. Preprocessing time is highly dependent on the dimensionality of the considered dataset, -4. Multiclass tasks are the most time-consuming, regression tasks are in the middle, and binary ones are relatively fast, -5. Although for the majority of datasets, the preprocessing time is not so demanding for the largest tasks, the preprocessing can be disproportionately time-consuming, -6. The most time-consuming part is feature selection, as in some cases it may last even 32\~times longer than the ones without them, -7. The order from fastest to slowest feature selection method is: MCFS, MI, Boruta, VI with average preprocessing times equal to 11s, 16s, 133s ($\sim$ 2 minutes), and 480s ($\sim$ 8 minutes), -8. With the right choices made, feature selection can be reasonably fast, -9. We can ignore differences in preprocessing times for removal strategies, as the results are fairly similar and short, -10. The slowest method is KNN, the second one is MICE, and the two fastest ones are median-other and median-frequency. diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_binary_CC18.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_binary_CC18.RData deleted file mode 100644 index c29d533..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_binary_CC18.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_multiclass_CC18.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_multiclass_CC18.RData deleted file mode 100644 index c6442ad..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_multiclass_CC18.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/preprocessing_duration.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/preprocessing_duration.RData deleted file mode 100644 index efc84c0..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/preprocessing_duration.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/testing_summary_table.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/testing_summary_table.RData deleted file mode 100644 index 09f394c..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/testing_summary_table.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_duration.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_duration.RData deleted file mode 100644 index 250303a..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_duration.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_banknote-authentication.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_banknote-authentication.RData deleted file mode 100644 index 437e748..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_banknote-authentication.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_blood-transfusion-service-center.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_blood-transfusion-service-center.RData deleted file mode 100644 index d732bcd..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_blood-transfusion-service-center.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_breast-w.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_breast-w.RData deleted file mode 100644 index af26121..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_breast-w.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-approval.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-approval.RData deleted file mode 100644 index 735201f..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-approval.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-g-mod.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-g-mod.RData deleted file mode 100644 index 2959c6f..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-g-mod.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-g.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-g.RData deleted file mode 100644 index bb6911d..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_credit-g.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_diabetes.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_diabetes.RData deleted file mode 100644 index 68c4b69..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_diabetes.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_kr-vs-kp.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_kr-vs-kp.RData deleted file mode 100644 index bbcc8ae..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_kr-vs-kp.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_phoneme-mod.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_phoneme-mod.RData deleted file mode 100644 index a76fa51..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_phoneme-mod.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_phoneme.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_phoneme.RData deleted file mode 100644 index cd08d61..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/binary_phoneme.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_balance-scale.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_balance-scale.RData deleted file mode 100644 index 33a52d1..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_balance-scale.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_car-mod.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_car-mod.RData deleted file mode 100644 index 3a1d285..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_car-mod.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_car.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_car.RData deleted file mode 100644 index 2fc2e0c..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_car.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_dna.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_dna.RData deleted file mode 100644 index a1a744a..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_dna.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_satimage-mod.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_satimage-mod.RData deleted file mode 100644 index d94bcc5..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_satimage-mod.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_satimage.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_satimage.RData deleted file mode 100644 index b19f3da..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_satimage.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_wine_quality.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_wine_quality.RData deleted file mode 100644 index 2277d3b..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/multiclass_wine_quality.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_2dplanes.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_2dplanes.RData deleted file mode 100644 index 1915ec9..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_2dplanes.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_Mercedes_Benz_Greener_Manufacturing.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_Mercedes_Benz_Greener_Manufacturing.RData deleted file mode 100644 index 2c2b593..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_Mercedes_Benz_Greener_Manufacturing.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_elevators-mod.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_elevators-mod.RData deleted file mode 100644 index f4e893e..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_elevators-mod.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_elevators.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_elevators.RData deleted file mode 100644 index 2d95614..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary/regression_elevators.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary_table.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary_table.RData deleted file mode 100644 index e6ef794..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/training_summary_table.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/validation_summary_table.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/validation_summary_table.RData deleted file mode 100644 index 8833a02..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_processed_results/validation_summary_table.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/MSc_regression_bench.RData b/docs/articles/AutoML24Workshop & MScThesis/MSc_regression_bench.RData deleted file mode 100644 index b7efeef..0000000 Binary files a/docs/articles/AutoML24Workshop & MScThesis/MSc_regression_bench.RData and /dev/null differ diff --git a/docs/articles/AutoML24Workshop & MScThesis/README.md b/docs/articles/AutoML24Workshop & MScThesis/README.md deleted file mode 100644 index e4183f6..0000000 --- a/docs/articles/AutoML24Workshop & MScThesis/README.md +++ /dev/null @@ -1,25 +0,0 @@ -# Description - -This folder contains the source codes, initial datasets, and aggregated results of [@HuberR21](https://github.com/HubertR21) Master's Thesis named 'The impact of data preprocessing on the quality of tree-based models with forester package', which are also the results of poster, and paper called 'Do tree-based models need data preprocessing?'. The experiments were conducted with forester in version 1.6.1. - -# Structure - -- RData files contain datasets used in the study. -- Rmd notebooks start with 01-08 which indicates the order of their execution. With their usage the user is able to repeat the computations of this study. The subsequent files contain the following operations: - - 01: Dataset selection for multiclass tasks from CC-18 benchmark. - - 02: Creation of 6 datasets with artificially diminished data quality. - - 03: Preparation of preprocessed datasets scenarios. - - 04: Training the models on each preprocessed dataset. - - 05: Preparation of aggregated results. - - 06: Preliminary analysis of the results in their raw form. - - 07: In-depth results analysis. - - 08: Time complexity analysis. -- The MSc_processed_results directory contains the final results after the 05 step. It includes the following files: - - The preprocessing_duration.RData file: Duration in seconds for each preprocessing strategy. - - The testing_summary_table.RData file: Performance of each model evaluated on testing subset. - - The training_duration.RData file: Duration in seconds for each dataset training. - - The training_summary_table.RData file: Performance of each model evaluated on training subset. - - The validation_summary_table.RData file: Performance of each model evaluated on validation subset. - - The training_summary directory contains RData files with raw results for each dataset separately. *WARNING* Some files are too big for Git, thus they are available on Google Drive only. - -As some files are too big for Git, they are available on Google Drive only: https://drive.google.com/drive/folders/1sQkzZE9sjqdIQTSQ2bKB4z0YfR-8hXm7?usp=sharing.