Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
advaithsrao authored Dec 28, 2023
1 parent c6479f8 commit acc79cb
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,17 +111,17 @@ These are the training label splits before annotation

***

## DATA ANNOTATION
## Data Annotation

### 1. AUTOMATED ML LABELING
### 1. Automated ML Labeling

The following heuristics are used to annotate labels for enron email data using the other two data sources:
1. Phishing Model Annotation: We are annotating mails from the Enron dataset using a high-precision model trained on the Phishing mails dataset.
2. Social Engineering Model Annotation: We are annotating mails from the Enron dataset using a high-precision model trained on the Social Engineering mails dataset.

The two ML Annotator models use Term Frequency Inverse Document Frequency (TFIDF) to embed the input text and make use of SVM models with Gaussian Kernel.

### 2. EMAIL SIGNALS
### 2. Email Signals

Email Signal based heuristics are used to specifically filter and target suspicious emails for fraud labeling. The signals used are:
Person Of Interest: There is a publicly available list of email addresses of employees who were liable for the massive data leak at Enron. These user mailboxes can have a higher chance of containing quality fraud emails.
Expand All @@ -143,7 +143,7 @@ The below table represents the distribution of the length of email bodies in ter
| 75% | 4 |
| max | 5486 |

### 3. MANUAL INSPECTION
### 3. Manual Inspection

To ensure high-quality labels, we manually inspect the mismatch examples from ML Annotation to relabel the enron dataset.

Expand Down

0 comments on commit acc79cb

Please sign in to comment.