diff --git a/.env.example b/.env.example new file mode 100644 index 0000000..d28a0ee --- /dev/null +++ b/.env.example @@ -0,0 +1 @@ +PYTHONPATH=src \ No newline at end of file diff --git a/.gitignore b/.gitignore index 297a10f..ecf8921 100644 --- a/.gitignore +++ b/.gitignore @@ -1,16 +1,51 @@ +# Data files *.pkl *.csv !ceas_08.csv !phishing_email.csv *.eml *.npz + +# Environment files *.env +env/ +venv/ +ENV/ + +# Log files *.log + +# Compiled files +*.pyc +*.pyo +*.pyd + +# Python cache __pycache__/ + +# Project directories backup/ trash/ others/ +downloads/ + +# Configuration files config.json +# Python build files +build/ +develop-eggs/ +dist/ +eggs/ +.eggs/ +lib/ +lib64/ +parts/ +var/ +*.egg-info/ +.installed.cfg +*.egg + +# Specific scripts email_generator.py -shap_analysis.py \ No newline at end of file +shap_analysis.py diff --git a/README.md b/README.md index b970213..dd47ec2 100644 --- a/README.md +++ b/README.md @@ -12,96 +12,14 @@ This project leverages advanced machine learning algorithms to detect and classi Our solution applies a combination of processes such as data preprocessing, feature engineering, and model training techniques to identify spam and phishing emails. The project addresses real-world challenges like imbalanced datasets by utilizing SpamAssassin and CEAS datasets for training and evaluation, ultimately enhancing the model's ability to filter phishing and spam emails effectively. -## Key Technologies +### Key Technologies -- **BERT for Feature Extraction**: Enhances contextual understanding of email content. -- **Stacked Ensemble Learning**: Combines XGBoost, Bagged SVM, and Logistic Regression for robust detection. -- **Optuna for Hyperparameter Tuning**: Optimizes model performance by fine-tuning key parameters. -- **Flask**: Provides a web interface for real-time email classification. - -## Installation - -To set up the project, clone the repository and install the necessary dependencies: - -```bash -git clone https://github.com/Koon-Kiat/Detecting-Spam-and-Phishing-Emails-Using-Machine-Learning -cd Detecting-Spam-and-Phishing-Emails-Using-Machine-Learning -conda create --name python=3.8.20 -conda activate -conda env update --file environment.yaml --prune -``` - -Once the dependencies are installed, you can run the phishing email detection program using the following command: - -```bash -python main.py -``` - -## Data - -The project utilizes merged datasets from SpamAssassin (Hugging Face) and CEAS (Kaggle) to enhance email threat detection: - -- **SpamAssassin**: Contains real-world spam and legitimate emails. -- **CEAS 2008**: Specially curated for anti-spam research, with a focus on phishing examples. - -## Merging Datasets - -TThe project integrates the **[Spam Assassin](https://huggingface.co/datasets/talby/spamassassin)** and **[CEAS 2008](https://www.kaggle.com/datasets/naserabdullahalam/phishing-email-dataset?select=CEAS_08.csv)** datasets, aligning them by columns and ensuring label consistency. This creates a robust, well-labeled dataset that improves phishing and spam detection accuracy. - -## File Structure for Storing Results - -```plaintext -project_root/ -├── additional_model_training/ -│ ├── base_model_optuna.py -│ ├── base_model.py -├── config.json -├── data_pipeline/ -│ ├── data_integration/ -│ ├── data_preprocessing/ -│ ├── noise_injection/ -│ ├── data_splitting/ -│ ├── feature_engineering/ -│ ├── feature_extraction/ -│ └── models_and_parameters/ -├── datasets/ -├── evaluation_on_third_dataset.py -├── evaluationonthirddataset/ -├── flask_app.py -├── logs/ -├── main.py -├── manifest_python.xml -├── multi_model_evaluation/ -├── README.md -├── requirements.txt -├── single_model_evaluation/ -├── spamandphishingdetection/ -├── static/ -├── templates/ -├── third_dataset_evaluation/ -``` - -## Technology Stack - -### Programming Languages - -- **Python** - -### Libraries and Frameworks - -- **Machine Learning**: scikit-learn, TensorFlow, transformers, imbalanced-learn +- **Programming Language**: Python +- **ML/DL Libraries**: scikit-learn, TensorFlow, transformers, imbalanced-learn - **NLP**: NLTK -- **Data Handling**: pandas, numpy -- **Web Framework**: Flask +- **Data Processing**: pandas, numpy +- **Development Tools**: Git, Anaconda - **Optimization**: Optuna - -### Tools - -- **Version Control**: Git -- **Environment Management**: Anaconda - -### Additional Technologies - - **Feature Extraction**: BERT - **Ensemble Learning**: XGBoost, Bagged SVM, Logistic Regression, Stacked Ensemble Learning - **Data Preprocessing**: One-Hot Encoding, Standard Scaling, Imputation, Rare Category Removal, Noise Injection @@ -110,8 +28,21 @@ project_root/ - **Noise Injection**: Adding controlled random variations to features to improve model generalization and reduce overfitting - **Stacked Ensemble Learning**: Combining multiple models for robust detection +## Features + +- **Advanced Spam and Phishing Detection**: Utilizes sophisticated algorithms to accurately identify malicious emails. +- **Support for Handling Imbalanced Datasets**: Implements techniques to manage and balance skewed data distributions. +- **Automated Model Training and Evaluation**: Streamlines the process of training and assessing machine learning models. + ## Methodologies +### Data Sources + +The project utilizes merged datasets from SpamAssassin (Hugging Face) and CEAS (Kaggle) to enhance email threat detection: + +- **SpamAssassin**: Contains real-world spam and legitimate emails. +- **CEAS 2008**: Specially curated for anti-spam research, with a focus on phishing examples. + ### Data Preprocessing - **Cleaning**: Removing duplicates, handling missing values, and correcting errors. @@ -146,7 +77,7 @@ project_root/ - **Confusion Matrix**: Displaying the performance of each model. - **Learning Curves**: Visualizing model performance as a function of training data size. -These results are stored in the `data_pipeline` folder. +These results are stored in the `output` folder. ## Evaluation @@ -155,6 +86,25 @@ These results are stored in the `data_pipeline` folder. - **Confusion Matrix**: Displays the performance of each model in predicting "Safe" vs. "Not Safe" emails. - **Learning Curve**: A plot showing model performance (accuracy/loss) as a function of training data size, helping to visualize overfitting, underfitting, and the effectiveness of adding more training data. + +## Installation + +To set up the project, clone the repository and install the necessary dependencies: + +```bash +git clone https://github.com/Koon-Kiat/Detecting-Spam-and-Phishing-Emails-Using-Machine-Learning +cd Detecting-Spam-and-Phishing-Emails-Using-Machine-Learning +conda create --name python=3.8.20 +conda activate +conda env update --file environment.yaml --prune +``` + +Once the dependencies are installed, you can run the phishing email detection program using the following command: + +```bash +python main.py +``` + ### Example Output ``` @@ -184,26 +134,12 @@ Classification Report for Test Data: weighted avg 0.XX 0.XX 0.XX XX ``` -## Flask Application - -The accompanying Flask application provides a user-friendly interface where users can input email content for real-time spam and phishing detection. The system returns an analysis of whether an email is "Safe" or "Not Safe." - -### Key Features: - -- **User Interface**: - - - The main interface is provided by `index.html` and `taskpane.html` located in the `templates` folder. - - Users can upload or paste email content for evaluation. - -- **Instant Feedback**: - - - The `/evaluateEmail` endpoint processes the email content and returns immediate results, flagging malicious content. - - This endpoint utilizes the `single_model_evaluation` module for classification. +## License -- **Integration**: - - The Flask app communicates with the machine learning model backend for classification. - - Static assets such as icons are served from the `static/assets` folder. +This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. -### Example Usage: +## Acknowledgments -To evaluate an email, users can navigate to the main interface, input the email content, and submit it for evaluation. The system will process the input and provide instant feedback on whether the email is "Safe" or "Not Safe." +- SpamAssassin Public Corpus +- CEAS 2008 Dataset Contributors +- Open Source ML Community diff --git a/enviornment.yaml b/config/enviornment.yaml similarity index 81% rename from enviornment.yaml rename to config/enviornment.yaml index 8ae7651..ceef7a0 100644 --- a/enviornment.yaml +++ b/config/enviornment.yaml @@ -28,6 +28,11 @@ dependencies: - tabulate=0.9.0 - pip - pip: - - -r requirements.txt + - matplotlib==3.7.5 + - seaborn==0.13.2 + - wordcloud==1.9.3 + - contractions==0.1.73 + - optuna==4.0.0 + - pyspellchecker==0.8.1 - datasets==3.0.2 --upgrade - https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz \ No newline at end of file diff --git a/datasets/ceas_08.csv b/data/ceas_08.csv similarity index 100% rename from datasets/ceas_08.csv rename to data/ceas_08.csv diff --git a/datasets/phishing_email.csv b/data/phishing_email.csv similarity index 100% rename from datasets/phishing_email.csv rename to data/phishing_email.csv diff --git a/manifest_python.xml b/extensions/manifest_python.xml similarity index 100% rename from manifest_python.xml rename to extensions/manifest_python.xml diff --git a/main.py b/main.py index 635b9b5..cc401c6 100644 --- a/main.py +++ b/main.py @@ -16,8 +16,8 @@ from imblearn.over_sampling import SMOTE # Handling imbalanced data import tensorflow as tf # TensorFlow library from bs4 import MarkupResemblesLocatorWarning # HTML and XML parsing -from datasets import load_dataset # Load datasets -from spamandphishingdetection import ( +from datasets import load_dataset # Load datasets\ +from src.spamandphishingdetection import ( initialize_environment, load_config, get_file_paths, @@ -53,7 +53,7 @@ def main(): nlp, loss_fn = initialize_environment(__file__) - config = load_config("config.json") + config = load_config("config/config.json") file_paths = get_file_paths(config) # Load the datasets @@ -200,7 +200,7 @@ def main(): # ************************* # logging.info(f"Beginning Data Cleaning ['body']...") df_clean_body = load_or_clean_data( - 'Merged Dataframe', combined_df, 'body', "data_pipeline/data_cleaning/cleaned_data_frame.csv", data_cleaning) + 'Merged Dataframe', combined_df, 'body', "output/main_model_evaluation/data_cleaning/cleaned_data_frame.csv", data_cleaning) # Verifying the Cleaned Combine DataFrame # Concatenate the Cleaned DataFrame with the Merged DataFrame @@ -218,7 +218,7 @@ def main(): # ***************************** # logging.info(f"Beginning Noise Injection...") noisy_df = generate_noisy_dataframe( - df_cleaned_combined, 'data_pipeline/noise_injection/noisy_data_frame.csv') + df_cleaned_combined, 'output/main_model_evaluation/noise_injection/noisy_data_frame.csv') logging.info(f"Noise Injection completed.\n") # ************************* # @@ -285,7 +285,7 @@ def main(): y_train=y_train, y_test=y_test, pipeline=pipeline, - dir='data_pipeline/feature_extraction', + dir='output/main_model_evaluation/feature_extraction', ) logging.info( f"Data for Fold {fold_idx} has been processed or loaded successfully.\n") diff --git a/additional_model_training/stacked_models/params/XGB_ADA_LG_Best_Params_Fold_1.json b/output/additional_models/stacked_models/params/XGB_ADA_LG_Best_Params_Fold_1.json similarity index 100% rename from additional_model_training/stacked_models/params/XGB_ADA_LG_Best_Params_Fold_1.json rename to output/additional_models/stacked_models/params/XGB_ADA_LG_Best_Params_Fold_1.json diff --git a/additional_model_training/stacked_models/params/XGB_KNN_LG_Best_Params_Fold_1.json b/output/additional_models/stacked_models/params/XGB_KNN_LG_Best_Params_Fold_1.json similarity index 100% rename from additional_model_training/stacked_models/params/XGB_KNN_LG_Best_Params_Fold_1.json rename to output/additional_models/stacked_models/params/XGB_KNN_LG_Best_Params_Fold_1.json diff --git a/additional_model_training/stacked_models/params/XGB_LightGB_LG_Best_Params_Fold_1.json b/output/additional_models/stacked_models/params/XGB_LightGB_LG_Best_Params_Fold_1.json similarity index 100% rename from additional_model_training/stacked_models/params/XGB_LightGB_LG_Best_Params_Fold_1.json rename to output/additional_models/stacked_models/params/XGB_LightGB_LG_Best_Params_Fold_1.json diff --git a/additional_model_training/stacked_models/params/XGB_RF_LG_Best_Params_Fold_1.json b/output/additional_models/stacked_models/params/XGB_RF_LG_Best_Params_Fold_1.json similarity index 100% rename from additional_model_training/stacked_models/params/XGB_RF_LG_Best_Params_Fold_1.json rename to output/additional_models/stacked_models/params/XGB_RF_LG_Best_Params_Fold_1.json diff --git a/data_pipeline/models_and_parameters/Best_Parameter_Fold_1.json b/output/main_model_evaluation/models_and_parameters/Best_Parameter_Fold_1.json similarity index 100% rename from data_pipeline/models_and_parameters/Best_Parameter_Fold_1.json rename to output/main_model_evaluation/models_and_parameters/Best_Parameter_Fold_1.json diff --git a/requirements.txt b/requirements.txt index ad956eb..ef3d959 100644 Binary files a/requirements.txt and b/requirements.txt differ diff --git a/additional_model_training/base_model.py b/scripts/base_model.py similarity index 97% rename from additional_model_training/base_model.py rename to scripts/base_model.py index cfcda28..05bb0d8 100644 --- a/additional_model_training/base_model.py +++ b/scripts/base_model.py @@ -60,7 +60,9 @@ def main(): nlp, loss_fn = initialize_environment(__file__) - config = load_config("config.json") + config_path = os.path.normpath(os.path.join( + os.path.dirname(__file__), '..', 'config', 'config.json')) + config = load_config(config_path) file_paths = get_file_paths(config) # Load the datasets @@ -207,7 +209,7 @@ def main(): # ************************* # logging.info(f"Beginning Data Cleaning ['body']...") df_clean_body = load_or_clean_data( - 'Merged Dataframe', combined_df, 'body', "data_pipeline/data_cleaning/cleaned_data_frame.csv", data_cleaning) + 'Merged Dataframe', combined_df, 'body', "output/main_model_evaluation/data_cleaning/cleaned_data_frame.csv", data_cleaning) # Verifying the Cleaned Combine DataFrame # Concatenate the Cleaned DataFrame with the Merged DataFrame @@ -225,7 +227,7 @@ def main(): # ***************************** # logging.info(f"Beginning Noise Injection...") noisy_df = generate_noisy_dataframe( - df_cleaned_combined, 'data_pipeline/noise_injection/noisy_data_frame.csv') + df_cleaned_combined, 'output/main_model_evaluation/noise_injection/noisy_data_frame.csv') logging.info(f"Noise Injection completed.\n") # ************************* # @@ -292,7 +294,7 @@ def main(): y_train=y_train, y_test=y_test, pipeline=pipeline, - dir='data_pipeline/feature_extraction', + dir='output/main_model_evaluation/feature_extraction', ) logging.info( f"Data for Fold {fold_idx} has been processed or loaded successfully.\n") diff --git a/additional_model_training/base_model_optuna.py b/scripts/base_model_optuna.py similarity index 97% rename from additional_model_training/base_model_optuna.py rename to scripts/base_model_optuna.py index b55b6ed..9d4b3df 100644 --- a/additional_model_training/base_model_optuna.py +++ b/scripts/base_model_optuna.py @@ -59,7 +59,9 @@ # Main processing function def main(): nlp, loss_fn = initialize_environment(__file__) - config = load_config("config.json") + config_path = os.path.normpath(os.path.join( + os.path.dirname(__file__), '..', 'config', 'config.json')) + config = load_config(config_path) file_paths = get_file_paths(config) # Load the datasets @@ -206,7 +208,7 @@ def main(): # ************************* # logging.info(f"Beginning Data Cleaning ['body']...") df_clean_body = load_or_clean_data( - 'Merged Dataframe', combined_df, 'body', "data_pipeline/data_cleaning/cleaned_data_frame.csv", data_cleaning) + 'Merged Dataframe', combined_df, 'body', "output/main_model_evaluation/data_cleaning/cleaned_data_frame.csv", data_cleaning) # Verifying the Cleaned Combine DataFrame # Concatenate the Cleaned DataFrame with the Merged DataFrame @@ -224,7 +226,7 @@ def main(): # ***************************** # logging.info(f"Beginning Noise Injection...") noisy_df = generate_noisy_dataframe( - df_cleaned_combined, 'data_pipeline/noise_injection/noisy_data_frame.csv') + df_cleaned_combined, 'output/main_model_evaluation/noise_injection/noisy_data_frame.csv') logging.info(f"Noise Injection completed.\n") # ************************* # @@ -291,7 +293,7 @@ def main(): y_train=y_train, y_test=y_test, pipeline=pipeline, - dir='data_pipeline/feature_extraction', + dir='output/main_model_evaluation/feature_extraction', ) logging.info( f"Data for Fold {fold_idx} has been processed or loaded successfully.\n") diff --git a/evaluation_on_third_dataset.py b/scripts/evaluation_on_third_dataset.py similarity index 97% rename from evaluation_on_third_dataset.py rename to scripts/evaluation_on_third_dataset.py index 18d96f3..b72dc84 100644 --- a/evaluation_on_third_dataset.py +++ b/scripts/evaluation_on_third_dataset.py @@ -10,7 +10,8 @@ import re # Regular expressions from tqdm import tqdm # Progress bar import joblib # Joblib library -from sklearn.metrics import accuracy_score, confusion_matrix, classification_report # Evaluation metrics +# Evaluation metrics +from sklearn.metrics import accuracy_score, confusion_matrix, classification_report from tabulate import tabulate # Pretty-print tabular data from sklearn.compose import ColumnTransformer from sklearn.impute import SimpleImputer @@ -30,7 +31,7 @@ from typing import Dict, List, Union # Type hints import pickle # Serialization library from sklearn.decomposition import PCA # Dimensionality reduction -from spamandphishingdetection import ( +from src.spamandphishingdetection import ( initialize_environment, DatasetProcessor, count_urls, @@ -40,7 +41,7 @@ BERTFeatureExtractor, BERTFeatureTransformer, ) -from evaluationonthirddataset import ( +from src.evaluationonthirddataset import ( load_config, get_file_paths, load_or_extract_headers, @@ -49,7 +50,6 @@ ) - def main(): nlp, loss_fn = initialize_environment(__file__) config = load_config() @@ -134,8 +134,10 @@ def main(): else: logging.info( "The number of rows in the Merge Evaluation Dataframe matches the Processed Evaluation Dataframe.") - merged_evaluation.to_csv(file_paths['merged_evaluation_file'], index=False) - logging.info(f"Data successfully saved to: {file_paths['merged_evaluation_file']}") + merged_evaluation.to_csv( + file_paths['merged_evaluation_file'], index=False) + logging.info( + f"Data successfully saved to: {file_paths['merged_evaluation_file']}") logging.info("Data Integration completed.\n") # ************************* # diff --git a/additional_model_training/xgb_ada_lg.py b/scripts/xgb_ada_lg.py similarity index 95% rename from additional_model_training/xgb_ada_lg.py rename to scripts/xgb_ada_lg.py index 8de42c9..7b7a7b0 100644 --- a/additional_model_training/xgb_ada_lg.py +++ b/scripts/xgb_ada_lg.py @@ -53,8 +53,9 @@ def main(): nlp, loss_fn = initialize_environment(__file__) - - config = load_config("config.json") + config_path = os.path.normpath(os.path.join( + os.path.dirname(__file__), '..', 'config', 'config.json')) + config = load_config(config_path) file_paths = get_file_paths(config) # Load the datasets @@ -201,7 +202,7 @@ def main(): # ************************* # logging.info(f"Beginning Data Cleaning ['body']...") df_clean_body = load_or_clean_data( - 'Merged Dataframe', combined_df, 'body', "data_pipeline/data_cleaning/cleaned_data_frame.csv", data_cleaning) + 'Merged Dataframe', combined_df, 'body', "output/main_model_evaluation/data_cleaning/cleaned_data_frame.csv", data_cleaning) # Verifying the Cleaned Combine DataFrame # Concatenate the Cleaned DataFrame with the Merged DataFrame @@ -219,7 +220,7 @@ def main(): # ***************************** # logging.info(f"Beginning Noise Injection...") noisy_df = generate_noisy_dataframe( - df_cleaned_combined, 'data_pipeline/noise_injection/noisy_data_frame.csv') + df_cleaned_combined, 'output/main_model_evaluation/noise_injection/noisy_data_frame.csv') logging.info(f"Noise Injection completed.\n") # ************************* # @@ -286,7 +287,7 @@ def main(): y_train=y_train, y_test=y_test, pipeline=pipeline, - dir='data_pipeline/feature_extraction', + dir='output/main_model_evaluation/feature_extraction', ) logging.info( f"Data for Fold {fold_idx} has been processed or loaded successfully.\n") @@ -295,13 +296,12 @@ def main(): # ***************************************** # logging.info( f"Beginning Model Training and Evaluation for Fold {fold_idx}...") - with open(os.path.join(os.path.dirname(__file__), '..', 'config.json')) as config_file: - config = json.load(config_file) - base_dir = config['base_dir'] + + base_dir = config['base_dir'] # Train the model and evaluate the performance for each fold model_path = os.path.join( - base_dir, 'additional_model_training', 'stacked_models', f'XGB_ADA_LG_Fold_{fold_idx}.pkl') - params_path = os.path.join(base_dir, 'additional_model_training', 'stacked_models', + base_dir, 'output', 'additional_models', 'stacked_models', f'XGB_ADA_LG_Fold_{fold_idx}.pkl') + params_path = os.path.join(base_dir, 'output', 'additional_models', 'stacked_models', 'params', f'XGB_ADA_LG_Best_Params_Fold_{fold_idx}.json') ensemble_model, test_accuracy = xgb_ada_lg_model_training( X_train_balanced, diff --git a/additional_model_training/xgb_knn_lg.py b/scripts/xgb_knn_lg.py similarity index 95% rename from additional_model_training/xgb_knn_lg.py rename to scripts/xgb_knn_lg.py index 2fbf14a..cfd58c3 100644 --- a/additional_model_training/xgb_knn_lg.py +++ b/scripts/xgb_knn_lg.py @@ -53,8 +53,9 @@ def main(): nlp, loss_fn = initialize_environment(__file__) - - config = load_config("config.json") + config_path = os.path.normpath(os.path.join( + os.path.dirname(__file__), '..', 'config', 'config.json')) + config = load_config(config_path) file_paths = get_file_paths(config) # Load the datasets @@ -201,7 +202,7 @@ def main(): # ************************* # logging.info(f"Beginning Data Cleaning ['body']...") df_clean_body = load_or_clean_data( - 'Merged Dataframe', combined_df, 'body', "data_pipeline/data_cleaning/cleaned_data_frame.csv", data_cleaning) + 'Merged Dataframe', combined_df, 'body', "output/main_model_evaluation/data_cleaning/cleaned_data_frame.csv", data_cleaning) # Verifying the Cleaned Combine DataFrame # Concatenate the Cleaned DataFrame with the Merged DataFrame @@ -219,7 +220,7 @@ def main(): # ***************************** # logging.info(f"Beginning Noise Injection...") noisy_df = generate_noisy_dataframe( - df_cleaned_combined, 'data_pipeline/noise_injection/noisy_data_frame.csv') + df_cleaned_combined, 'output/main_model_evaluation/noise_injection/noisy_data_frame.csv') logging.info(f"Noise Injection completed.\n") # ************************* # @@ -286,7 +287,7 @@ def main(): y_train=y_train, y_test=y_test, pipeline=pipeline, - dir='data_pipeline/feature_extraction', + dir='output/main_model_evaluation/feature_extraction', ) logging.info( f"Data for Fold {fold_idx} has been processed or loaded successfully.\n") @@ -296,13 +297,12 @@ def main(): # ***************************************** # logging.info( f"Beginning Model Training and Evaluation for Fold {fold_idx}...") - with open(os.path.join(os.path.dirname(__file__), '..', 'config.json')) as config_file: - config = json.load(config_file) - base_dir = config['base_dir'] + + base_dir = config['base_dir'] # Train the model and evaluate the performance for each fold model_path = os.path.join( - base_dir, 'additional_model_training', 'stacked_models', f'XGB_KNN_LG_Fold_{fold_idx}.pkl') - params_path = os.path.join(base_dir, 'additional_model_training', 'stacked_models', + base_dir, 'output', 'additional_models', 'stacked_models', f'XGB_KNN_LG_Fold_{fold_idx}.pkl') + params_path = os.path.join(base_dir, 'output', 'additional_models', 'stacked_models', 'params', f'XGB_KNN_LG_Best_Params_Fold_{fold_idx}.json') ensemble_model, test_accuracy = xgb_knn_lg_model_training( X_train_balanced, diff --git a/additional_model_training/xgb_lightgb_lg.py b/scripts/xgb_lightgb_lg.py similarity index 95% rename from additional_model_training/xgb_lightgb_lg.py rename to scripts/xgb_lightgb_lg.py index 70d87e1..58ecdbe 100644 --- a/additional_model_training/xgb_lightgb_lg.py +++ b/scripts/xgb_lightgb_lg.py @@ -53,8 +53,9 @@ def main(): nlp, loss_fn = initialize_environment(__file__) - - config = load_config("config.json") + config_path = os.path.normpath(os.path.join( + os.path.dirname(__file__), '..', 'config', 'config.json')) + config = load_config(config_path) file_paths = get_file_paths(config) # Load the datasets @@ -201,7 +202,7 @@ def main(): # ************************* # logging.info(f"Beginning Data Cleaning ['body']...") df_clean_body = load_or_clean_data( - 'Merged Dataframe', combined_df, 'body', "data_pipeline/data_cleaning/cleaned_data_frame.csv", data_cleaning) + 'Merged Dataframe', combined_df, 'body', "output/main_model_evaluation/data_cleaning/cleaned_data_frame.csv", data_cleaning) # Verifying the Cleaned Combine DataFrame # Concatenate the Cleaned DataFrame with the Merged DataFrame @@ -219,7 +220,7 @@ def main(): # ***************************** # logging.info(f"Beginning Noise Injection...") noisy_df = generate_noisy_dataframe( - df_cleaned_combined, 'data_pipeline/noise_injection/noisy_data_frame.csv') + df_cleaned_combined, 'output/main_model_evaluation/noise_injection/noisy_data_frame.csv') logging.info(f"Noise Injection completed.\n") # ************************* # @@ -286,7 +287,7 @@ def main(): y_train=y_train, y_test=y_test, pipeline=pipeline, - dir='data_pipeline/feature_extraction', + dir='output/main_model_evaluation/feature_extraction', ) logging.info( f"Data for Fold {fold_idx} has been processed or loaded successfully.\n") @@ -297,12 +298,11 @@ def main(): logging.info( f"Beginning Model Training and Evaluation for Fold {fold_idx}...") # Train the model and evaluate the performance for each fold - with open(os.path.join(os.path.dirname(__file__), '..', 'config.json')) as config_file: - config = json.load(config_file) - base_dir = config['base_dir'] + + base_dir = config['base_dir'] model_path = os.path.join( - base_dir, 'additional_model_training', 'stacked_models', f'XGB_LightGB_LG_Fold_{fold_idx}.pkl') - params_path = os.path.join(base_dir, 'additional_model_training', 'stacked_models', + base_dir, 'output', 'additional_models', 'stacked_models', f'XGB_LightGB_LG_Fold_{fold_idx}.pkl') + params_path = os.path.join(base_dir, 'output', 'additional_models', 'stacked_models', 'stacked_models', 'params', f'XGB_LightGB_LG_Best_Params_Fold_{fold_idx}.json') ensemble_model, test_accuracy = xgb_lightgb_lg_model_training( X_train_balanced, diff --git a/additional_model_training/xgb_rf_lg.py b/scripts/xgb_rf_lg.py similarity index 95% rename from additional_model_training/xgb_rf_lg.py rename to scripts/xgb_rf_lg.py index 496e88a..2bf208b 100644 --- a/additional_model_training/xgb_rf_lg.py +++ b/scripts/xgb_rf_lg.py @@ -53,8 +53,9 @@ def main(): nlp, loss_fn = initialize_environment(__file__) - - config = load_config("config.json") + config_path = os.path.normpath(os.path.join( + os.path.dirname(__file__), '..', 'config', 'config.json')) + config = load_config(config_path) file_paths = get_file_paths(config) # Load the datasets @@ -201,7 +202,7 @@ def main(): # ************************* # logging.info(f"Beginning Data Cleaning ['body']...") df_clean_body = load_or_clean_data( - 'Merged Dataframe', combined_df, 'body', "data_pipeline/data_cleaning/cleaned_data_frame.csv", data_cleaning) + 'Merged Dataframe', combined_df, 'body', "output/main_model_evaluation/data_cleaning/cleaned_data_frame.csv", data_cleaning) # Verifying the Cleaned Combine DataFrame # Concatenate the Cleaned DataFrame with the Merged DataFrame @@ -219,7 +220,7 @@ def main(): # ***************************** # logging.info(f"Beginning Noise Injection...") noisy_df = generate_noisy_dataframe( - df_cleaned_combined, 'data_pipeline/noise_injection/noisy_data_frame.csv') + df_cleaned_combined, 'output/main_model_evaluation/noise_injection/noisy_data_frame.csv') logging.info(f"Noise Injection completed.\n") # ************************* # @@ -286,7 +287,7 @@ def main(): y_train=y_train, y_test=y_test, pipeline=pipeline, - dir='data_pipeline/feature_extraction', + dir='output/main_model_evaluation/feature_extraction', ) logging.info( f"Data for Fold {fold_idx} has been processed or loaded successfully.\n") @@ -298,12 +299,11 @@ def main(): f"Beginning Model Training and Evaluation for Fold {fold_idx}...") # Train the model and evaluate the performance for each fold # Train the model and evaluate the performance for each fold - with open(os.path.join(os.path.dirname(__file__), '..', 'config.json')) as config_file: - config = json.load(config_file) - base_dir = config['base_dir'] + + base_dir = config['base_dir'] model_path = os.path.join( - base_dir, base_dir, 'additional_model_training', 'stacked_models', f'XGB_RF_LG_Fold_{fold_idx}.pkl') - params_path = os.path.join(base_dir, base_dir, 'additional_model_training', 'stacked_models', + base_dir, base_dir, 'output', 'additional_model', 'stacked_models', f'XGB_RF_LG_Fold_{fold_idx}.pkl') + params_path = os.path.join(base_dir, base_dir, 'output', 'additional_model', 'stacked_models', 'params', f'XGB_RF_LG_Best_Params_Fold_{fold_idx}.json') ensemble_model, test_accuracy = xgb_rf_lg_model_training( X_train_balanced, diff --git a/spamandphishingdetection/file_operations.py b/spamandphishingdetection/file_operations.py deleted file mode 100644 index 7ab2a1d..0000000 --- a/spamandphishingdetection/file_operations.py +++ /dev/null @@ -1,49 +0,0 @@ -import json -import os - - -def load_config(config_path='config.json'): - with open(config_path, 'r') as config_file: - config = json.load(config_file) - return config - - -def ensure_directory_exists(path): - if not os.path.exists(path): - os.makedirs(path) - - -def get_file_paths(config): - base_dir = config['base_dir'] - file_paths = { - 'ceas_08_dataset': os.path.join(base_dir, 'datasets', 'ceas_08.csv'), - 'preprocessed_spam_assassin_file': os.path.join(base_dir, 'data_pipeline', 'data_preprocessing', 'preprocessed_spam_assassin.csv'), - 'preprocessed_ceas_file': os.path.join(base_dir, 'data_pipeline', 'data_preprocessing', 'preprocessed_ceas_08.csv'), - 'extracted_spam_assassin_email_header_file': os.path.join(base_dir, 'data_pipeline', 'feature_engineering', 'spam_assassin_extracted_email_header.csv'), - 'extracted_ceas_email_header_file': os.path.join(base_dir, 'data_pipeline', 'feature_engineering', 'ceas_extracted_email_header.csv'), - 'merged_spam_assassin_file': os.path.join(base_dir, 'data_pipeline', 'data_integration', 'merged_spam_assassin.csv'), - 'merged_ceas_file': os.path.join(base_dir, 'data_pipeline', 'data_integration', 'merged_ceas_08.csv'), - 'merged_data_frame': os.path.join(base_dir, 'data_pipeline', 'data_integration', 'merged_data_frame.csv'), - 'cleaned_data_frame': os.path.join(base_dir, 'data_pipeline', 'data_cleaning', 'cleaned_data_frame.csv'), - 'cleaned_ceas_headers': os.path.join(base_dir, 'data_pipeline', 'data_cleaning', 'cleaned_ceas_headers.csv'), - 'merged_cleaned_ceas_headers': os.path.join(base_dir, 'data_pipeline', 'data_cleaning', 'merged_cleaned_ceas_headers.csv'), - 'merged_cleaned_data_frame': os.path.join(base_dir, 'data_pipeline', 'data_cleaning', 'merged_cleaned_data_frame.csv'), - 'noisy_data_frame': os.path.join(base_dir, 'data_pipeline', 'noise_injection', 'noisy_data_frame.csv'), - 'pipeline_path': os.path.join(base_dir, 'data_pipeline', 'feature_extraction') - } - - # Ensure directories exist - for path in file_paths.values(): - ensure_directory_exists(os.path.dirname(path)) - - return file_paths - - -def get_model_path(config, fold_idx): - base_dir = config['base_dir'] - return os.path.join(base_dir, 'data_pipeline', 'models_and_parameters', f'Ensemble_Model_Fold_{fold_idx}.pkl') - - -def get_params_path(config, fold_idx): - base_dir = config['base_dir'] - return os.path.join(base_dir, 'data_pipeline', 'models_and_parameters', f'Best_Parameter_Fold_{fold_idx}.json') diff --git a/evaluationonthirddataset/__init__.py b/src/evaluationonthirddataset/__init__.py similarity index 100% rename from evaluationonthirddataset/__init__.py rename to src/evaluationonthirddataset/__init__.py diff --git a/evaluationonthirddataset/feature_engineering.py b/src/evaluationonthirddataset/feature_engineering.py similarity index 100% rename from evaluationonthirddataset/feature_engineering.py rename to src/evaluationonthirddataset/feature_engineering.py diff --git a/evaluationonthirddataset/file_operations.py b/src/evaluationonthirddataset/file_operations.py similarity index 91% rename from evaluationonthirddataset/file_operations.py rename to src/evaluationonthirddataset/file_operations.py index b14f496..0b09d9f 100644 --- a/evaluationonthirddataset/file_operations.py +++ b/src/evaluationonthirddataset/file_operations.py @@ -16,7 +16,7 @@ def ensure_directory_exists(path): def get_file_paths(config): base_dir = config['base_dir'] file_paths = { - 'dataset': os.path.join(base_dir, 'datasets', 'phishing_email.csv'), + 'dataset': os.path.join(base_dir, 'data', 'phishing_email.csv'), 'preprocessed_evaluation_dataset': os.path.join( base_dir, 'third_dataset_evaluation', 'data_preprocessing', 'preprocessed_evaluation_dataset.csv'), 'extracted_evaluation_header_file': os.path.join( @@ -27,7 +27,7 @@ def get_file_paths(config): base_dir, 'third_dataset_evaluation', 'data_integration', 'merged_evaluation.csv'), 'merged_cleaned_data_frame': os.path.join( base_dir, 'third_dataset_evaluation', 'data_cleaning', 'merged_cleaned_data_frame.csv'), - 'main_model': os.path.join(base_dir, 'data_pipeline', 'models_and_parameters'), + 'main_model': os.path.join(base_dir, 'output', 'models_and_parameters'), 'base_model': os.path.join( base_dir, 'additional_model_training', 'base_models'), 'base_model_optuna': os.path.join( diff --git a/evaluationonthirddataset/pipeline.py b/src/evaluationonthirddataset/pipeline.py similarity index 92% rename from evaluationonthirddataset/pipeline.py rename to src/evaluationonthirddataset/pipeline.py index 3ef4ca4..e2b2533 100644 --- a/evaluationonthirddataset/pipeline.py +++ b/src/evaluationonthirddataset/pipeline.py @@ -5,13 +5,13 @@ import joblib -def save_data_pipeline(data, labels, data_path, labels_path): +def save_output(data, labels, data_path, labels_path): np.savez(data_path, data=data) with open(labels_path, 'wb') as f: pickle.dump(labels, f) -def load_data_pipeline(data_path, labels_path): +def load_output(data_path, labels_path): data = np.load(data_path)['data'] with open(labels_path, 'rb') as f: labels = pickle.load(f) @@ -76,7 +76,7 @@ def run_pipeline_or_load(data, labels, pipeline, dir): # Save the preprocessed data logging.info("Saving processed data...") - save_data_pipeline(data_combined, labels, data_path, labels_path) + save_output(data_combined, labels, data_path, labels_path) else: # Load the preprocessor logging.info(f"Loading preprocessor from {preprocessor_path}...") @@ -84,7 +84,7 @@ def run_pipeline_or_load(data, labels, pipeline, dir): # Load the preprocessed data logging.info("Loading preprocessed data...") - data_combined, labels = load_data_pipeline(data_path, labels_path) + data_combined, labels = load_output(data_path, labels_path) return data_combined, labels diff --git a/spamandphishingdetection/__init__.py b/src/spamandphishingdetection/__init__.py similarity index 99% rename from spamandphishingdetection/__init__.py rename to src/spamandphishingdetection/__init__.py index 509c8d4..8fd1606 100644 --- a/spamandphishingdetection/__init__.py +++ b/src/spamandphishingdetection/__init__.py @@ -35,7 +35,6 @@ from .pipeline import run_pipeline_or_load from .learning_curve import plot_learning_curve - from .modeltraining.base_model import model_training as base_model_training from .modeltraining.main_model import model_training as main_model_training from .modeltraining.base_model_optuna import model_training as base_model_training_optuna diff --git a/spamandphishingdetection/bert.py b/src/spamandphishingdetection/bert.py similarity index 100% rename from spamandphishingdetection/bert.py rename to src/spamandphishingdetection/bert.py diff --git a/spamandphishingdetection/data_cleaning.py b/src/spamandphishingdetection/data_cleaning.py similarity index 100% rename from spamandphishingdetection/data_cleaning.py rename to src/spamandphishingdetection/data_cleaning.py diff --git a/spamandphishingdetection/data_cleaning_headers.py b/src/spamandphishingdetection/data_cleaning_headers.py similarity index 100% rename from spamandphishingdetection/data_cleaning_headers.py rename to src/spamandphishingdetection/data_cleaning_headers.py diff --git a/spamandphishingdetection/data_integration.py b/src/spamandphishingdetection/data_integration.py similarity index 100% rename from spamandphishingdetection/data_integration.py rename to src/spamandphishingdetection/data_integration.py diff --git a/spamandphishingdetection/data_splitting.py b/src/spamandphishingdetection/data_splitting.py similarity index 98% rename from spamandphishingdetection/data_splitting.py rename to src/spamandphishingdetection/data_splitting.py index 22ade18..ff9d064 100644 --- a/spamandphishingdetection/data_splitting.py +++ b/src/spamandphishingdetection/data_splitting.py @@ -4,7 +4,7 @@ from sklearn.model_selection import StratifiedKFold -def stratified_k_fold_split(df, n_splits=3, random_state=42, output_dir='data_pipeline/data_splitting'): +def stratified_k_fold_split(df, n_splits=3, random_state=42, output_dir='output/main_model_evaluation/data_splitting'): """ Performs Stratified K-Fold splitting on the DataFrame. diff --git a/spamandphishingdetection/dataset_processor.py b/src/spamandphishingdetection/dataset_processor.py similarity index 100% rename from spamandphishingdetection/dataset_processor.py rename to src/spamandphishingdetection/dataset_processor.py diff --git a/spamandphishingdetection/feature_engineering.py b/src/spamandphishingdetection/feature_engineering.py similarity index 100% rename from spamandphishingdetection/feature_engineering.py rename to src/spamandphishingdetection/feature_engineering.py diff --git a/src/spamandphishingdetection/file_operations.py b/src/spamandphishingdetection/file_operations.py new file mode 100644 index 0000000..51038d1 --- /dev/null +++ b/src/spamandphishingdetection/file_operations.py @@ -0,0 +1,49 @@ +import json +import os + + +def load_config(config_path='config.json'): + with open(config_path, 'r') as config_file: + config = json.load(config_file) + return config + + +def ensure_directory_exists(path): + if not os.path.exists(path): + os.makedirs(path) + + +def get_file_paths(config): + base_dir = config['base_dir'] + file_paths = { + 'ceas_08_dataset': os.path.join(base_dir, 'data', 'ceas_08.csv'), + 'preprocessed_spam_assassin_file': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_preprocessing', 'preprocessed_spam_assassin.csv'), + 'preprocessed_ceas_file': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_preprocessing', 'preprocessed_ceas_08.csv'), + 'extracted_spam_assassin_email_header_file': os.path.join(base_dir, 'output', 'main_model_evaluation', 'feature_engineering', 'spam_assassin_extracted_email_header.csv'), + 'extracted_ceas_email_header_file': os.path.join(base_dir, 'output', 'main_model_evaluation', 'feature_engineering', 'ceas_extracted_email_header.csv'), + 'merged_spam_assassin_file': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_integration', 'merged_spam_assassin.csv'), + 'merged_ceas_file': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_integration', 'merged_ceas_08.csv'), + 'merged_data_frame': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_integration', 'merged_data_frame.csv'), + 'cleaned_data_frame': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_cleaning', 'cleaned_data_frame.csv'), + 'cleaned_ceas_headers': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_cleaning', 'cleaned_ceas_headers.csv'), + 'merged_cleaned_ceas_headers': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_cleaning', 'merged_cleaned_ceas_headers.csv'), + 'merged_cleaned_data_frame': os.path.join(base_dir, 'output', 'main_model_evaluation', 'data_cleaning', 'merged_cleaned_data_frame.csv'), + 'noisy_data_frame': os.path.join(base_dir, 'output', 'main_model_evaluation', 'noise_injection', 'noisy_data_frame.csv'), + 'pipeline_path': os.path.join(base_dir, 'output', 'main_model_evaluation', 'feature_extraction') + } + + # Ensure directories exist + for path in file_paths.values(): + ensure_directory_exists(os.path.dirname(path)) + + return file_paths + + +def get_model_path(config, fold_idx): + base_dir = config['base_dir'] + return os.path.join(base_dir, 'output', 'main_model_evaluation', 'models_and_parameters', f'Ensemble_Model_Fold_{fold_idx}.pkl') + + +def get_params_path(config, fold_idx): + base_dir = config['base_dir'] + return os.path.join(base_dir, 'output', 'main_model_evaluation', 'models_and_parameters', f'Best_Parameter_Fold_{fold_idx}.json') diff --git a/spamandphishingdetection/label_processing.py b/src/spamandphishingdetection/label_processing.py similarity index 100% rename from spamandphishingdetection/label_processing.py rename to src/spamandphishingdetection/label_processing.py diff --git a/spamandphishingdetection/learning_curve.py b/src/spamandphishingdetection/learning_curve.py similarity index 100% rename from spamandphishingdetection/learning_curve.py rename to src/spamandphishingdetection/learning_curve.py diff --git a/spamandphishingdetection/missing_values.py b/src/spamandphishingdetection/missing_values.py similarity index 100% rename from spamandphishingdetection/missing_values.py rename to src/spamandphishingdetection/missing_values.py diff --git a/spamandphishingdetection/modeltraining/__init__.py b/src/spamandphishingdetection/modeltraining/__init__.py similarity index 100% rename from spamandphishingdetection/modeltraining/__init__.py rename to src/spamandphishingdetection/modeltraining/__init__.py diff --git a/spamandphishingdetection/modeltraining/base_model.py b/src/spamandphishingdetection/modeltraining/base_model.py similarity index 87% rename from spamandphishingdetection/modeltraining/base_model.py rename to src/spamandphishingdetection/modeltraining/base_model.py index c43df41..a5f835b 100644 --- a/spamandphishingdetection/modeltraining/base_model.py +++ b/src/spamandphishingdetection/modeltraining/base_model.py @@ -5,12 +5,14 @@ import json from sklearn.metrics import accuracy_score, confusion_matrix, classification_report -with open(os.path.join(os.path.dirname(__file__), '..', '..', 'config.json')) as config_file: +with open(os.path.join(os.path.dirname(__file__), '..', '..', '..', 'config', 'config.json')) as config_file: config = json.load(config_file) base_dir = config['base_dir'] + # Define model_path -model_path = os.path.join(base_dir, 'additional_model_training', 'base_models') +model_path = os.path.join(base_dir, 'output', 'additional_models', 'base_models') +os.makedirs(model_path, exist_ok=True) def model_training(X_train, y_train, X_test, y_test, model, model_name): diff --git a/spamandphishingdetection/modeltraining/base_model_optuna.py b/src/spamandphishingdetection/modeltraining/base_model_optuna.py similarity index 96% rename from spamandphishingdetection/modeltraining/base_model_optuna.py rename to src/spamandphishingdetection/modeltraining/base_model_optuna.py index 667b4ec..01d1a59 100644 --- a/spamandphishingdetection/modeltraining/base_model_optuna.py +++ b/src/spamandphishingdetection/modeltraining/base_model_optuna.py @@ -11,15 +11,17 @@ from sklearn.ensemble import RandomForestClassifier from sklearn.neighbors import KNeighborsClassifier from lightgbm import LGBMClassifier +from sklearn.metrics import accuracy_score, confusion_matrix, classification_report - -with open(os.path.join(os.path.dirname(__file__), '..', '..', 'config.json')) as config_file: +with open(os.path.join(os.path.dirname(__file__), '..', '..', '..', 'config', 'config.json')) as config_file: config = json.load(config_file) base_dir = config['base_dir'] # Define model_path and param_path -model_path = os.path.join(base_dir, 'additional_model_training', 'base_models_optuna') -param_path = os.path.join(base_dir, 'additional_model_training', 'base_models_optuna') +model_path = os.path.join( + base_dir, 'output', 'additional_models', 'base_models_optuna') +param_path = os.path.join( + base_dir, 'output', 'additional_models', 'base_models_optuna') def conduct_optuna_study(X_train, y_train, model_name): diff --git a/spamandphishingdetection/modeltraining/main_model.py b/src/spamandphishingdetection/modeltraining/main_model.py similarity index 100% rename from spamandphishingdetection/modeltraining/main_model.py rename to src/spamandphishingdetection/modeltraining/main_model.py diff --git a/spamandphishingdetection/modeltraining/xgb_ada_lg_model.py b/src/spamandphishingdetection/modeltraining/xgb_ada_lg_model.py similarity index 100% rename from spamandphishingdetection/modeltraining/xgb_ada_lg_model.py rename to src/spamandphishingdetection/modeltraining/xgb_ada_lg_model.py diff --git a/spamandphishingdetection/modeltraining/xgb_knn_lg_model.py b/src/spamandphishingdetection/modeltraining/xgb_knn_lg_model.py similarity index 100% rename from spamandphishingdetection/modeltraining/xgb_knn_lg_model.py rename to src/spamandphishingdetection/modeltraining/xgb_knn_lg_model.py diff --git a/spamandphishingdetection/modeltraining/xgb_lightgb_lg_model.py b/src/spamandphishingdetection/modeltraining/xgb_lightgb_lg_model.py similarity index 100% rename from spamandphishingdetection/modeltraining/xgb_lightgb_lg_model.py rename to src/spamandphishingdetection/modeltraining/xgb_lightgb_lg_model.py diff --git a/spamandphishingdetection/modeltraining/xgb_rf_lg_model.py b/src/spamandphishingdetection/modeltraining/xgb_rf_lg_model.py similarity index 100% rename from spamandphishingdetection/modeltraining/xgb_rf_lg_model.py rename to src/spamandphishingdetection/modeltraining/xgb_rf_lg_model.py diff --git a/spamandphishingdetection/noise_injection.py b/src/spamandphishingdetection/noise_injection.py similarity index 100% rename from spamandphishingdetection/noise_injection.py rename to src/spamandphishingdetection/noise_injection.py diff --git a/spamandphishingdetection/pipeline.py b/src/spamandphishingdetection/pipeline.py similarity index 82% rename from spamandphishingdetection/pipeline.py rename to src/spamandphishingdetection/pipeline.py index 65d5bee..158ed63 100644 --- a/spamandphishingdetection/pipeline.py +++ b/src/spamandphishingdetection/pipeline.py @@ -122,10 +122,10 @@ def run_pipeline_or_load(fold_idx, X_train, X_test, y_train, y_test, pipeline, d # Save the preprocessed data logging.info(f"Saving processed data for fold {fold_idx}...") - save_data_pipeline(X_train_balanced, y_train_balanced, - train_data_path, train_labels_path) - save_data_pipeline(X_test_combined, y_test, - test_data_path, test_labels_path) + save_output(X_train_balanced, y_train_balanced, + train_data_path, train_labels_path) + save_output(X_test_combined, y_test, + test_data_path, test_labels_path) else: # Load the preprocessor logging.info(f"Loading preprocessor from {preprocessor_path}...") @@ -133,15 +133,15 @@ def run_pipeline_or_load(fold_idx, X_train, X_test, y_train, y_test, pipeline, d # Load the preprocessed data logging.info(f"Loading preprocessed data for fold {fold_idx}...") - X_train_balanced, y_train_balanced = load_data_pipeline( + X_train_balanced, y_train_balanced = load_output( train_data_path, train_labels_path) - X_test_combined, y_test = load_data_pipeline( + X_test_combined, y_test = load_output( test_data_path, test_labels_path) return X_train_balanced, X_test_combined, y_train_balanced, y_test -def save_data_pipeline(data, labels, data_path, labels_path): +def save_output(data, labels, data_path, labels_path): """ Save the data and labels to specified file paths. @@ -163,7 +163,7 @@ def save_data_pipeline(data, labels, data_path, labels_path): dump(labels, labels_path) -def load_data_pipeline(data_path, labels_path): +def load_output(data_path, labels_path): """ Load the data and labels from specified file paths. @@ -188,7 +188,7 @@ def load_data_pipeline(data_path, labels_path): return data, labels -def get_fold_paths(fold_idx, base_dir='feature_extraction'): +def get_fold_paths(fold_idx, base_dir): """ Generates file paths for the train and test data and labels for the specified fold. @@ -204,13 +204,27 @@ def get_fold_paths(fold_idx, base_dir='feature_extraction'): tuple The file paths for the train data, test data, train labels, test labels, and preprocessor. """ - train_data_path = os.path.join(base_dir, f"Fold_{fold_idx}_Train_Data.npz") - test_data_path = os.path.join(base_dir, f"Fold_{fold_idx}_Test_Data.npz") - train_labels_path = os.path.join( - base_dir, f"Fold_{fold_idx}_Train_Labels.pkl") - test_labels_path = os.path.join( - base_dir, f"Fold_{fold_idx}_Test_Labels.pkl") - preprocessor_path = os.path.join( - base_dir, f"Fold_{fold_idx}_Preprocessor.pkl") + train_data_path = os.path.normpath(os.path.join( + base_dir, f"Fold_{fold_idx}_Train_Data.npz")) + test_data_path = os.path.normpath(os.path.join( + base_dir, f"Fold_{fold_idx}_Test_Data.npz")) + train_labels_path = os.path.normpath(os.path.join( + base_dir, f"Fold_{fold_idx}_Train_Labels.pkl")) + test_labels_path = os.path.normpath(os.path.join( + base_dir, f"Fold_{fold_idx}_Test_Labels.pkl")) + preprocessor_path = os.path.normpath(os.path.join( + base_dir, f"Fold_{fold_idx}_Preprocessor.pkl")) + + # Check if the files exist + if not os.path.exists(train_data_path): + logging.error(f"Train data file not found: {train_data_path}") + if not os.path.exists(test_data_path): + logging.error(f"Test data file not found: {test_data_path}") + if not os.path.exists(train_labels_path): + logging.error(f"Train labels file not found: {train_labels_path}") + if not os.path.exists(test_labels_path): + logging.error(f"Test labels file not found: {test_labels_path}") + if not os.path.exists(preprocessor_path): + logging.error(f"Preprocessor file not found: {preprocessor_path}") return train_data_path, test_data_path, train_labels_path, test_labels_path, preprocessor_path diff --git a/spamandphishingdetection/rare_category_remover.py b/src/spamandphishingdetection/rare_category_remover.py similarity index 100% rename from spamandphishingdetection/rare_category_remover.py rename to src/spamandphishingdetection/rare_category_remover.py diff --git a/spamandphishingdetection/setup.py b/src/spamandphishingdetection/setup.py similarity index 100% rename from spamandphishingdetection/setup.py rename to src/spamandphishingdetection/setup.py