Skip to content

smasoudrezvani/ArabicFraudDetection_LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArabicFraudDetection

A machine learning project to detect fraudulent comments in ride-hailing services using Arabic text data. The project leverages pre-trained language models like AraBERT to classify user comments as fraudulent or non-fraudulent.


Overview

Detecting fraudulent activities in ride-hailing services is critical for ensuring reliability and customer trust. This project focuses on processing Arabic text data, fine-tuning a pre-trained language model, and deploying an effective system for fraud detection.

Key Features:

  • Custom Preprocessing: Automatically label comments based on fraud-related keywords.
  • Fine-Tuning Pre-trained Models: Fine-tune AraBERT for binary classification tasks.
  • Evaluation Metrics: Calculate metrics like accuracy, precision, recall, and F1-score.
  • Modular Code: Organized and reusable Python modules for data processing, training, and evaluation.
  • Docker Support: Easily run the project in a containerized environment.

Directory Structure

ArabicFraudDetection/
├── data/                    # Store data files or sample datasets
├── src/                     # Python scripts for preprocessing, training, evaluation, etc.
│   ├── preprocessing.py     # Data cleaning and preparation
│   ├── train.py             # Model fine-tuning script
│   ├── evaluate.py          # Evaluation and metrics calculation
│   ├── utils.py             # Helper functions
├── models/                  # Save trained models and checkpoints
├── notebooks/               # Jupyter notebooks for exploratory work
├── tests/                   # Unit and integration tests
├── tmp_trainer/             # 
├── requirements.txt         # Python dependencies
├── Dockerfile               # Dockerfile to containerize the project
├── README.md                # Documentation
├── LICENSE                  # License file
├── .gitignore               # Ignore unnecessary files (e.g., data, logs)
└── setup.py                 # Package installation script

Getting Started

Prerequisites

Ensure you have the following installed:

  • Python 3.8+
  • Pip
  • Docker (if using containerized setup)
  • Virtual environment (optional but recommended)

Installation

Option 1: Run Locally

  1. Clone the repository:

    git clone https://github.com/smasoudrezvani/ArabicFraudDetection_LLM.git
    cd ArabicFraudDetection
  2. Install dependencies:

    pip install -r requirements.txt
  3. Prepare your dataset:

    • Place your raw dataset (e.g., df_rating&comment.xlsx) in the data/ folder.

Option 2: Run with Docker

  1. Build the Docker image:

    docker build -t arabic-fraud-detection-LLM .
  2. Run the Docker container:

    docker run --rm -it arabic-fraud-detection-LLM

Usage (Not using Docker)

1. Preprocess Data

Run the preprocessing script to clean and label the data:

python3 ./src/preprocessing.py --input "data/df_rating&comment.xlsx" --output "data/processed"

2. Train the Model

Fine-tune the pre-trained language model:

python3 ./src/train.py --model aubmindlab/bert-base-arabertv2 --train data/processed/train.json --test data/processed/test.json --output models/fraud_detector

3. Evaluate the Model

Evaluate the model's performance on the test set:

python3 ./src/evaluate.py --model ./models/fraud_detector --test ./data/processed/test.json

Using Docker

1. Run Preprocessing with Docker

You can modify the Dockerfile command or use the container interactively:

docker run -v $(pwd)/data:/app/data arabic-fraud-detection python src/preprocessing.py --input data/df_rating&comment.xlsx --output data/processed

2. Train the Model with Docker

docker run -v $(pwd)/models:/app/models arabic-fraud-detection python src/train.py --model aubmindlab/bert-base-arabertv2 --train data/processed/train.json --test data/processed/test.json --output models/fraud_detector

Results

Metric Value
Accuracy 0.92
Precision 0.90
Recall 0.89
F1 Score 0.89

Advanced Features

  • Hyperparameter Tuning: Use the Trainer API's hyperparameter search functionality to optimize the model.
  • Data Augmentation: Extend the dataset using techniques like back-translation or synonym replacement.
  • Deployment: Deploy the model as a REST API using FastAPI or a user-friendly interface with Streamlit.

Contributing

Contributions are welcome! To contribute:

  1. Fork the repository.
  2. Create a feature branch: git checkout -b feature-name.
  3. Commit changes: git commit -m 'Add feature-name'.
  4. Push to the branch: git push origin feature-name.
  5. Open a pull request.

License

This project is licensed under the Apache License.


Acknowledgments

Special thanks to Hugging Face for providing the tools and pre-trained models that make this project possible.


---

This version includes details about how to use the `Dockerfile` for preprocessing, training, and evaluation. Let me know if you'd like further changes!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published