Cybersecurity Threat Identification

Objective

This repository leverages data science and machine learning to answer the question:

Can we correctly label critical vulnerabilities with an ML model?

Want to read more in detail? Check out the medium article!

Data

Cybersecurity is an amazingly broad field. The data is vast and it is also very scattered. The sources used here are:

National vulnerability database
Exploit DB
Symantec
CISA
Google searches here and there :)

This is an imbalanced classification problem and is illustrated in the Venn diagram below.

Workflow

The jupyter notebooks in the reposiroty are sequentially numbered to give an idea of how to proceed. The workflow is summarized in the figure below.

For the ML model i used XGBoost and for text data processing I used a TF-IDF Vectorizer. Note that we need to make use of a pipeline since we are dealing with heterogenous data. Moreover, this prevents any data leakage while doing the cross validated grid search.

Metrics

We can achieve a decent ROC of 0.87. However, it depends what we are looking for as different organizations will have different goals and capabilities.

Precision

Precision determines how many selected items are relevant: If we want better precision we need to have as few false positives as possible. Therefore, an efficient model will have high precision, and allocates time and resources only to vulnerabilities that require patching.

Recall

Recall poses the question: How many relevant items are selected? A perfect recall score of 1.0 implies no false negatives were selected. In the context of this project, the recall determines how many vulnerabilities that should be remediated have been flagged.

Results

We fit a grid search cross validation to optimize for a set of success metrics. The results are summarized in the table and plots below.

Moreover, I conducted a bootstrapping experiment to see how reliable are the results. 95% of the time, the precision will be in (0.149, 0.234), for recall it is (0.333, 0.55), and for F1 it is (0.180, 0.337).

Conclusion

It's crucial to state that the findings here might be challenged in the future as new data becomes available. This project also omits existing defense mechanisms organizations might have, and only takes a bird's eye view of things. To put it plainly: your mileage may vary based on how secure your defenses are and on the jackpot in your safe. Furthermore, this project does not take into account the effect of time. Indeed, some CVEs might already have an existing patch, which may or many not have been implemented. It is also possible that with more sophisticated tools and higher computing power, previous innocuous CVEs can wreak havoc on an organization. With a rather simple workflow, critical vulnerabilities can be identified with a relatively high performance. This means that as new CVEs are released on NVD, a ML model can be used to evaluate if they are likely to be exploited in the wild.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
artifacts		artifacts
data		data
papers		papers
plots		plots
00_JSON_Exploration.ipynb		00_JSON_Exploration.ipynb
00_Parse_JSON_trials.ipynb		00_Parse_JSON_trials.ipynb
01_Parse_NVD_JSON.ipynb		01_Parse_NVD_JSON.ipynb
01_SearchSploit_Organize_Output.ipynb		01_SearchSploit_Organize_Output.ipynb
02_Compile_Exploited_CVEs.ipynb		02_Compile_Exploited_CVEs.ipynb
02_exploitdb_EDA.ipynb		02_exploitdb_EDA.ipynb
02_nvd_combined_EDA.ipynb		02_nvd_combined_EDA.ipynb
02_test_load_and_combine_exploitdb.ipynb		02_test_load_and_combine_exploitdb.ipynb
03_Combine_DFs_and_Munge.ipynb		03_Combine_DFs_and_Munge.ipynb
03_Make_Predictions_All_Features.ipynb		03_Make_Predictions_All_Features.ipynb
04_Optimizing_Features_All.ipynb		04_Optimizing_Features_All.ipynb
04_Optimizing_NLP_Features.ipynb		04_Optimizing_NLP_Features.ipynb
05_ModelSelection.ipynb		05_ModelSelection.ipynb
06_ClassificationReport_Bootstrap_Colab.ipynb		06_ClassificationReport_Bootstrap_Colab.ipynb
06_ClassificationReport_Hist.ipynb		06_ClassificationReport_Hist.ipynb
ExploitReader.py		ExploitReader.py
FakePersona.py		FakePersona.py
GridSearchEvaluation.py		GridSearchEvaluation.py
LICENSE		LICENSE
NVD_Extract_Data.py		NVD_Extract_Data.py
README.md		README.md
bootstrap_classification_dicts.zip		bootstrap_classification_dicts.zip
broadcom_scraper.py		broadcom_scraper.py
combine_wild_cve.py		combine_wild_cve.py
exploitdb_scraper_prog.py		exploitdb_scraper_prog.py
get_Exploitdb_CSV_SUPERSEDED.py		get_Exploitdb_CSV_SUPERSEDED.py
munge_help.py		munge_help.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cybersecurity Threat Identification

Objective

Can we correctly label critical vulnerabilities with an ML model?

Data

Workflow

Metrics

Precision

Recall

Results

Conclusion

About

Releases

Packages

Languages

License

NadimKawwa/Cyber_Threat_Identification

Folders and files

Latest commit

History

Repository files navigation

Cybersecurity Threat Identification

Objective

Can we correctly label critical vulnerabilities with an ML model?

Data

Workflow

Metrics

Precision

Recall

Results

Conclusion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages