Skip to content

Commit

Permalink
Updated notebook files and readme.md file
Browse files Browse the repository at this point in the history
  • Loading branch information
advaithsrao committed Dec 6, 2023
1 parent b537dbb commit 9706338
Show file tree
Hide file tree
Showing 7 changed files with 879 additions and 36 deletions.
90 changes: 82 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,19 @@

# Fraud-Detector
Fraud Detection Package with fine-tuned RoBERTa model, equipped with ethical considerations such as Differential Privacy and SMPC
Fraud Detection Package with fine-tuned ML and DL models, equipped with ethical considerations such as Differential Privacy and Homomorphic Encryption

- [Fraud-Detector](#fraud-detector)
- [Team](#team)
- [CI/CD Pipeline Status](#cicd-pipeline-status)
- [Installation and Run Instructions](#installation-and-run-instructions)
- [Abstract](#abstract)
- [Dataset Description](#dataset-description)
- [Data Splits](#data-splits)
- [Training Label Split](#training-label-split)
- [DATA ANNOTATION](#data-annotation)
- [1. AUTOMATED ML LABELING](#1-automated-ml-labeling)
- [2. EMAIL SIGNALS](#2-email-signals)
- [3. MANUAL INSPECTION](#3-manual-inspection)
- [Proposed Methodology](#proposed-methodology)
- [References](#references)
- [Citations](#citations)
Expand Down Expand Up @@ -45,37 +51,105 @@ Fraud Detection Package with fine-tuned RoBERTa model, equipped with ethical con
***

## Abstract
In today's data-driven landscape, the detection of fraud emails within corporate communications is critical. With email communication still being the most used mode of communication in organizations, hackers over time have found creative ways to bypass several security layers. In 2022 alone, email-based scams have led to losses of over $2.7 billion.
In today's data-driven landscape, the detection of fraud emails within corporate communications is critical. With email communication still being the most used mode of communication in organizations, hackers overtime have found creative ways to bypass several security layers. In 2022 alone, email-based scams have led tolosses of over $2.7 billion.

Over the last few years, transformer-based models have enabled remarkable advancements in natural language understanding, making them a great choice for tasks such as text classification and generation. However, with deeper neural network-based architectures and models pre-trained on huge amounts of data, privacy concerns loom larger, making it imperative to ensure data protection while maintaining the integrity of the analysis.
Over the last few years, Transformer-based models have enabled remarkable advancements in NaturalLanguage Understanding, making them a great choice for tasks such as text classification and generation.However, with deeper neural network-based architectures and models pre-trained on huge amounts of data,privacy concerns loom larger, making it imperative to ensure data protection while maintaining the integrity ofthe analysis.

The goal of the project is to explore Ethics-Driven Machine Learning, building a Fraud Detector model using a pre-trained RoBERTa model, with ethical considerations to the model using techniques such as Differential Privacy,Secure Multi-Party Computation, Federated Learning, and Homomorphic Encryption.

***

## Dataset Description
We make use of a rich source of public email communication, the Enron email dataset. In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.
The project makes use of a rich source of public email communication, the Enron email dataset ( https://www.cs.cmu.edu/~enron/ ). In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.

In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had a number of integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.

Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done with the goal to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there has to be multiple heuristics used to effectively label all types of fraudulent emails.

To tackle this problem we use heuristics to label the enron data corpus using email signals as well as perform automated labeling using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail in section 4 below.

To perform fraud annotation on enron dataset as well as provide more fraud examples for modeling, the project uses two more fraud data sources:
Phishing Email Dataset: https://www.kaggle.com/dsv/6090437
Social Engineering Dataset: http://aclweb.org/aclwiki

To perform high-quality testing, the project uses two gold label sets:
1. Gold Fraud Set: Contains `1000` curated fraud emails from the phishing and social engineering dataset. On this fraud set, we recall the test, how many fraud emails does our model miss out.
2. Sanity set: Contains 250000 curated internal email communication emails between employees at Enron. On this dataset, we precision test our model to see if it picks up any non-fraud email that it is not supposed to flag as fraud.

In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had a number of integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while we use it to train a deep neural network model.
Below is a short data summary of the label distribution across different sources (on the x-axis) and labels (on the y-axis).

**Data Splits:**
| | Fraud | Non-Fraud |
| -- | -- | -- |
| Enron Dataset | 2327 | 445090 |
| Phishing Dataset | 4976 | 12515 |
| Social Engineering Dataset | 4160 | 6475 |


To tackle data imbalance, the project also performs data augmentation, creating 9 synthetic emails for every 1 fraud email. The augmentation process makes use of techniques such as:
1. Synonym Replacement
2. Stopword Removal
3. Swapping Noun Phrases

### Data Splits

| Set | Emails |
| --- | --- |
| Train | 224543 |
| Sanity | 250000 |
| Gold Fraud | 1000 |

**Training Label Split:**
### Training Label Split

These are the training label splits before annotation

| Label | Emails |
| --- | --- |
| 0 | 214080 |
| 1 | 10463 |


***

## DATA ANNOTATION

### 1. AUTOMATED ML LABELING

The following heuristics are used to annotate labels for enron email data using the other two data sources:
1. Phishing Model Annotation: We are annotating mails from the Enron dataset using a high-precision model trained on the Phishing mails dataset.
2. Social Engineering Model Annotation: We are annotating mails from the Enron dataset using a high-precision model trained on the Social Engineering mails dataset.

The two ML Annotator models use Term Frequency Inverse Document Frequency (TFIDF) to embed the input text and make use of SVM models with Gaussian Kernel.

### 2. EMAIL SIGNALS

Email Signal based heuristics are used to specifically filter and target suspicious emails for fraud labeling. The signals used are:
Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes can have a higher chance of containing quality fraud emails.
1. Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. We define a set of folders that have a higher chance of containing fraud emails, such as Deleted Items and Junk.
2. Sender Type: The sender type was categorized as `Internal` and `External` based on their email address.
3. Low Communication: We defined a threshold of `4` emails on the basis of the table below. A user qualifies as a low-comm sender if their sent mails are less than this threshold. Mails sent from low-comm senders are assigned with a high probability of being a fraud.
4. Contains Replies and Forwards: If an email contains forwards or replies, we assign a low probability of it being a fraud.

The below table represents the distribution of the length of email bodies in terms of words.

| | |
| -- | -- |
| count | 20131 |
| mean | 12.3 |
| std | 104.9 |
| min | 1 |
| 25% | 1 |
| 50% | 1 |
| 75% | 4 |
| max | 5486 |

### 3. MANUAL INSPECTION

To ensure high-quality labels, we manually inspect the mismatch examples from ML Annotation to relabel the enron dataset.

***

## Proposed Methodology
We fine-tune a pre-trained RoBERTa model for our fraud detection task. To ensure complete privacy in our dataset, we will experiment and explore techniques such as Differential Privacy, Secure Multi-Party Computation (SMPC), Homomorphic Encryption, Federated Learning, and Data Masking. Through our experiments, we will attempt to investigate these techniques and find the optimal way to ensure the right amount of privacy, without losing out on the performance of our fraud classifier model.
We fine-tune ML and DL models for our fraud detection task. To ensure complete privacy in our dataset, we will experiment and explore techniques such as Differential Privacy, Secure Multi-Party Computation (SMPC), Homomorphic Encryption, Federated Learning, and Data Masking. Through our experiments, we will attempt to investigate these techniques and find the optimal way to ensure the right amount of privacy, without losing out on the performance of our fraud classifier model.

***

Expand Down
2 changes: 1 addition & 1 deletion ethics/differential_privacy.py
Original file line number Diff line number Diff line change
Expand Up @@ -485,7 +485,7 @@ def train(
TARGET_EPSILON = 7.5
TARGET_DELTA = 1/total_steps

self.model, optimizer, _ = self.privacy_engine.make_private_with_epsilon(
self.model, optimizer, train_dataloader = self.privacy_engine.make_private_with_epsilon(
module=self.model,
optimizer=optimizer,
data_loader=train_dataloader,
Expand Down
91 changes: 91 additions & 0 deletions notebooks/bert_differential_privacy.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -621,6 +621,97 @@
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 1 | Step: 5000 | Train loss: 0.705 | Eval loss: 0.651 | Eval accuracy: 0.908 | ɛ: 12.03\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "fe800fa4170e46d29adcf7aa711d7787",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/5908 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 2 | Step: 5000 | Train loss: 0.552 | Eval loss: 0.543 | Eval accuracy: 0.925 | ɛ: 16.06\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "891ac8f2cd4a473eb104cef89a048e2e",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/5908 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 3 | Step: 5000 | Train loss: 0.536 | Eval loss: 0.524 | Eval accuracy: 0.930 | ɛ: 18.98\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "db775f4a89d74bd897b91963448049bc",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/5908 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 4 | Step: 5000 | Train loss: 0.524 | Eval loss: 0.598 | Eval accuracy: 0.921 | ɛ: 21.45\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "abb2e52af35449fb80b1ccf523e9ca48",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
" 0%| | 0/5908 [00:00<?, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch: 5 | Step: 5000 | Train loss: 0.536 | Eval loss: 0.534 | Eval accuracy: 0.929 | ɛ: 23.66\n"
]
}
],
"source": [
Expand Down
Loading

0 comments on commit 9706338

Please sign in to comment.