Version 1
: Received: 18 October 2024 / Approved: 18 October 2024 / Online: 18 October 2024 (11:18:03 CEST)
How to cite:
Meléndez, R.; Ptaszynski, M.; Fumito, M. Comparative Investigation of Traditional Machine Learning Models and Transformer Models for Phishing Email Detection. Preprints2024, 2024101467. https://doi.org/10.20944/preprints202410.1467.v1
Meléndez, R.; Ptaszynski, M.; Fumito, M. Comparative Investigation of Traditional Machine Learning Models and Transformer Models for Phishing Email Detection. Preprints 2024, 2024101467. https://doi.org/10.20944/preprints202410.1467.v1
Meléndez, R.; Ptaszynski, M.; Fumito, M. Comparative Investigation of Traditional Machine Learning Models and Transformer Models for Phishing Email Detection. Preprints2024, 2024101467. https://doi.org/10.20944/preprints202410.1467.v1
APA Style
Meléndez, R., Ptaszynski, M., & Fumito, M. (2024). Comparative Investigation of Traditional Machine Learning Models and Transformer Models for Phishing Email Detection. Preprints. https://doi.org/10.20944/preprints202410.1467.v1
Chicago/Turabian Style
Meléndez, R., Michal Ptaszynski and Masui Fumito. 2024 "Comparative Investigation of Traditional Machine Learning Models and Transformer Models for Phishing Email Detection" Preprints. https://doi.org/10.20944/preprints202410.1467.v1
Abstract
Phishing emails pose a significant threat to cybersecurity worldwide. There are already tools that mitigate the impact of these emails by filtering them, but these tools are only as reliable as their ability to detect new formats and techniques for creating phishing emails. In this paper we investigated how traditional models and transformer models work on the classification task of identifying if an email is phishing or not. We realized that transformer models, in particular DistilBERT, BERT, and RoBERTa had a significantly higher performance compared to traditional models like Logistic Regression, Random Forest, Support Vector Machine, and Naive Bayes.
The process consisted in using a large and robust dataset of emails and applying preprocessing and optimization techniques to maximize the best result possible. roBERTa showed its outstanding capacity to identify phishing emails by achieving the maximum accuracy of 0.9943. Even though they were free successful, traditional models performed marginally worse; SVM performed the best, with an accuracy of 0.9854. The results emphasize the value of sophisticated text processing methods and the possibility of transformer models to improve email security by thwarting phishing attempts.
Keywords
Phishing detection; Phishing emails; Machine Learning; Transformer Models; Traditional 14 Models; Supervised Learning; Text Classification, Cyber threat Mitigation; Cybersecurity
Subject
Computer Science and Mathematics, Information Systems
Copyright:
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.