Fraud Detection Package with fine-tuned ML and DL models, equipped with ethical considerations such as Differential Privacy and Homomorphic Encryption
Members |
---|
Advaith Shyamsunder Rao |
Falgun Malhotra |
Hsiao-Chun Hung |
Vanshita Gupta |
All helper functions and run steps can be found in the wiki pages.
Helper | Page |
---|---|
Setup Environment and Integrations | Wiki |
W&B Model tracking and Logging for our experiments | Wiki |
How to: Standalone - Processing and Labeler | Wiki |
How to: Model Training Pipeline | Wiki |
In today's data-driven landscape, the detection of fraud emails within corporate communications is critical. With email communication still being the most used mode of communication in organizations, hackers overtime have found creative ways to bypass several security layers. In 2022 alone, email-based scams have led tolosses of over $2.7 billion.
Over the last few years, Transformer-based models have enabled remarkable advancements in NaturalLanguage Understanding, making them a great choice for tasks such as text classification and generation.However, with deeper neural network-based architectures and models pre-trained on huge amounts of data,privacy concerns loom larger, making it imperative to ensure data protection while maintaining the integrity ofthe analysis.
The goal of the project is to explore Ethics-Driven Machine Learning, building a Fraud Detector model using a pre-trained RoBERTa model, with ethical considerations to the model using techniques such as Differential Privacy,Secure Multi-Party Computation, Federated Learning, and Homomorphic Encryption.
The project makes use of a rich source of public email communication, the Enron email dataset ( https://www.cs.cmu.edu/~enron/ ). In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. The data has been made public and presents a diverse set of email information ranging from internal, marketing emails to spam and fraud attempts.
In the early 2000s, Leslie Kaelbling at MIT purchased the dataset and noted that, though the dataset contained scam emails, it also had a number of integrity problems. The dataset was updated later, but it becomes key to ensure privacy in the data while it is used to train a deep neural network model.
Though the Enron Email Dataset contains over 500K emails, one of the problems with the dataset is the availability of labeled frauds in the dataset. Label annotation is done with the goal to detect an umbrella of fraud emails accurately. Since, fraud emails fall into several types such as Phishing, Financial, Romance, Subscription, and Nigerian Prince scams, there has to be multiple heuristics used to effectively label all types of fraudulent emails.
To tackle this problem we use heuristics to label the enron data corpus using email signals as well as perform automated labeling using simple ML models on other smaller email datasets available online. These fraud annotation techniques are discussed in detail in section 4 below.
To perform fraud annotation on enron dataset as well as provide more fraud examples for modeling, the project uses two more fraud data sources: Phishing Email Dataset: https://www.kaggle.com/dsv/6090437 Social Engineering Dataset: http://aclweb.org/aclwiki
To perform high-quality testing, the project uses two gold label sets:
- Gold Fraud Set: Contains
1000
curated fraud emails from the phishing and social engineering dataset. On this fraud set, we recall the test, how many fraud emails does our model miss out. - Sanity set: Contains 250000 curated internal email communication emails between employees at Enron. On this dataset, we precision test our model to see if it picks up any non-fraud email that it is not supposed to flag as fraud.
Below is a short data summary of the label distribution across different sources (on the x-axis) and labels (on the y-axis).
Fraud | Non-Fraud | |
---|---|---|
Enron Dataset | 2327 | 445090 |
Phishing Dataset | 4976 | 12515 |
Social Engineering Dataset | 4160 | 6475 |
To tackle data imbalance, the project also performs data augmentation, creating 9 synthetic emails for every 1 fraud email. The augmentation process makes use of techniques such as:
- Synonym Replacement
- Stopword Removal
- Swapping Noun Phrases
Set | Emails |
---|---|
Train | 224543 |
Sanity | 250000 |
Gold Fraud | 1000 |
These are the training label splits before annotation
Label | Emails |
---|---|
0 | 214080 |
1 | 10463 |
The following heuristics are used to annotate labels for enron email data using the other two data sources:
- Phishing Model Annotation: We are annotating mails from the Enron dataset using a high-precision model trained on the Phishing mails dataset.
- Social Engineering Model Annotation: We are annotating mails from the Enron dataset using a high-precision model trained on the Social Engineering mails dataset.
The two ML Annotator models use Term Frequency Inverse Document Frequency (TFIDF) to embed the input text and make use of SVM models with Gaussian Kernel.
Email Signal based heuristics are used to specifically filter and target suspicious emails for fraud labeling. The signals used are: Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes can have a higher chance of containing quality fraud emails.
- Suspicious Folders: The Enron data is dumped into several folders for every employee. Folders consist of inbox, deleted_items, junk, calendar, etc. We define a set of folders that have a higher chance of containing fraud emails, such as Deleted Items and Junk.
- Sender Type: The sender type was categorized as
Internal
andExternal
based on their email address. - Low Communication: We defined a threshold of
4
emails on the basis of the table below. A user qualifies as a low-comm sender if their sent mails are less than this threshold. Mails sent from low-comm senders are assigned with a high probability of being a fraud. - Contains Replies and Forwards: If an email contains forwards or replies, we assign a low probability of it being a fraud.
The below table represents the distribution of the length of email bodies in terms of words.
count | 20131 |
mean | 12.3 |
std | 104.9 |
min | 1 |
25% | 1 |
50% | 1 |
75% | 4 |
max | 5486 |
To ensure high-quality labels, we manually inspect the mismatch examples from ML Annotation to relabel the enron dataset.
We fine-tune ML and DL models for our fraud detection task. To ensure complete privacy in our dataset, we will experiment and explore techniques such as Differential Privacy, Secure Multi-Party Computation (SMPC), Homomorphic Encryption, Federated Learning, and Data Masking. Through our experiments, we will attempt to investigate these techniques and find the optimal way to ensure the right amount of privacy, without losing out on the performance of our fraud classifier model.
-
Enron Email Dataset
-
Differential Privacy
-
Secure Multi-Party Computation
-
Homomorphic Encryption
-
Federated Learning
-
RoBERTa Transformer Model
-
PySyft for Federated Learning
-
Tensorflow Privacy
-
ML Privacy Meter
-
Microsoft CrypTFlow for SMPC
-
Facebook Crypten for SMPC
-
Phishing Dataset
-
Social Engineering Dataset
Phishing Email Detection Dataset
- Title: Phishing Email Detection
- URL: https://www.kaggle.com/dsv/6090437
- DOI: 10.34740/KAGGLE/DSV/6090437
- Publisher: Kaggle
- Author: Subhadeep Chakraborty
- Year: 2023
CLAIR Fraud Email Collection
- Title: CLAIR collection of fraud email
- URL: http://aclweb.org/aclwiki
- Author: Radev, D.
- Year: 2008