2.1. Natural Language Processing (NLP)
NLP-based models use existing relationships between sentences, words, or letter parts of a language in a given text dataset. This made us explore the possibility of synthesizing an uploaded video from a phishing webpage to feed our neural network model. NLP architectures use modeling, preprocessing, and feature extraction:
Data preprocessing: It is imperative for text in a given dataset to be preprocessed into a pattern that the model can easily understand because preprocessing effectively turns every character and word in the dataset into a format that the machine learning classifier can understand to extract useful patterns or learn from them. The fact that algorithms learn from data and the quality of the dataset used in training an ML model directly impacts the performance of that model making AI to be data-centric, and hence, priority is given to data preprocessing during NLP.
NLP stemming and Lemmatization
Stemming and lemmatization are the two major data preprocessing tasks for natural language processing. During stemming, there is an end-to-end iteration of each word in the dataset to convert them to their base forms such as the mapping of "university" to "univers", and "calamity" to "calam" while lemmatization uses the word’s morphology from vocabulary dictionary to find their corresponding roots.
The final preprocessing stage of NLP is sentence segmentation, this process breaks large text into linguistically meaningful sentences where trivial words such as “an,” “the,” “a,” etc that don’t add much meaning or information to the text are removed during stop word removal, next we use tokenization to split every text into words and fragments, the result is a combination of word index and tokenized text which could be represented by a numerical token before feeding them to any of the deep learning or machine learning models for prediction.
2.2. Long Short-Term Memory (LSTM)
For this research, We opted for Long Short-Term Memory a variant of recurrent neural network (RNN) because of its effective solution to vanishing and exploding gradients which are Long-term dependency problems in Recurrent Neural Networks. The most important functioning part of an LSTM network is the cell state which serves as a memory to the network thereby enabling it to remember the past. Hence their suitability for capturing long-term dependencies and sequence prediction problems [
16]. LSTM network has an input gate, a forget gate, and an output gate which are sigmoid activation functions with an output value of 0 or 1.
It was easy to use the sigmoid function as a gate because we are only given out positive values that could give a straight answer on whether a particular feature should be kept or discarded.
In an LSTM network, the Input Gate tells what new information is to be stored in the cell state, the forget gate gives clear instructions by telling what information is to be thrown away from the cell state, while the output gate gives activation to the output for more accurate prediction. It is during this activation which occurs after filtering the cell state that the output goes through the activation function where the output portion to be predicted is determined, and this occurs when the current LSTM block goes through softmax layer to predict value for the current block.
To mitigate the effect of phishing attack, several methods, frameworks had been proposed for phishing attack detection but with varying results, these methods are classified based on their different approaches which we classified as Non-Machine Learning, machine learning (Bayesian-based, non-Bayesian-based) and deep learning-based. As attackers continue to navigate potential vulnerabilities to existing phishing detection solution, they are begining to rely on several images and uploaded videos rather than traditional text to enable them to evade detection, the inability of existing machine learning-based model to detect such phishing site is a peculiar limitation to existing AI-based solution. Palla Yaswanth and V. Nagaraju [
17] used Huang and Premaratne data from Kaggle repository with an equal number of phishing and legitimate datasets for novel network of phishing predictions with an accuracy of 95% for naive Bayes and 94.67% for random forest based on parameter turning. During the comparison of the performances of naive Bayes [
18] and random forest for detection of phishing sites in a network, there was no testing of the model against sophisticated form of phishing attack and causes of the 5% failure rate of naive bayes in the research.
Abdul Karim et al.. [
19] proposed a hybrid model which combines logistic regression, support vector machine, and decision tree in conjunction with soft and hard voting, the proposed hybrid model used Grid Search Hyper-parameter Optimization, cross fold validation, and canopy feature selection method to select relevant features from the dataset. The proposed hybrid model resulted in an accuracy of 98.2% by using the only attribute properties of the uniform resource locator. The sole reliance on the attribute of the URL makes this approach extremely vulnerable to URL manipulations as any attacker with little experience in web technology can use a malicious webpage with a friendly URL to fool the model.
Ishwarya et al. [
20] proposed a phishing detection method comprising of Naive Bayes algorithm, SVM, KNN, and random forest including evaluation of the performances of each of the four (4) classifiers in detection of phishing email. The implementation of each classifiers resulted in the highest accuracy of 98.2% for naive Bayes, albeit the use an imbalance dataset comprising 87% ham and 13% spam for the research surely indicate biased in the proposed model, and the problem of Bayesian poisoning was not addressed in the proposed model.
Kamal Omari [
21] used the UCI phishing domains dataset to proposed machine learning-based model for the purpose of investigating Logistic Regression (LR), k-Nearest Neighbors (KNN), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT), Random Forest (RF), and Gradient Boosting for phishing detection task. Hence, we believed that the 98.1% accuracy for phishing detection task obtained from the Naive Bayes classifier by Ishwarya et al. (2023) [
20] was due to a massive imbalance in the dataset having 87% ham and 13% spam which was not addressed, also the proposed model doesn’t address detection evasion through uploaded video and images on a phishing webpage.
Ann Zeky et al. [
22] proposed an extraction-based Naive Bayes model for phishing detection with emphasis on the extraction of relevant features like unusual characters, spelling mistakes, domain names, and URL analysis from unseen web pages for effective classification of a website into malicious and benign. By training the proposed model with a relatively balanced dataset of 7000 records in which 54% are malicious and 46% are benign leading to an accuracy of 99.1%. By using a combination of content extraction and URL analysis, we believed the proposed model would not be vulnerable to malicious URLs in the sense that even if the attacker tried to use a friendly URL to deceive the model, that the model does not rely on the properties of URL alone but also uses background webpage extraction means the proposed model will still be able to classify webpages correctly, albeit an attacker will still be able to use Bayesian poisoning.
Nishitha et al. [
23] compared performances of machine learning algorithms and deep learning for phishing detection classification by implementing KNN, Decision tree, Random Forest, Logistic Regression as machine learning algorithm, convolusional neural network and recurrent neural network as deep learning in which logistic regression and CNN had the best performances with an accuracy of 95% and 96% respectively, albeit the proposed model only uses the URL properties and so couldn’t be used for a sophisticated phishing attack that relies on images and video content.
Twana and Murat [
24] while assuming the absence of a single solution to detect most phishing attacks and to investigate the impact of feature selection on Naive Bayes model. They [
24] developed 6 Naive Bayes-based models in which each model involves a single feature selection technique chosen from individual FS, forward FS, Backward FS, Plus-I takeaway-r FS, AR1, and All. The experiment resulted in the Naive Bayes model with Plus-I takeaway-r feature selection having the best performance with an accuracy of 93.39% while the Naive Bayes classifier with individual feature selection technique has the least performance with an accuracy of 92.05% thereby leading to the conclusion that feature selection has a direct impact on the accuracy of phishing detection.
Jaya T et al. [
25] explored the prospect of using unsupervised learning to cluster spam and ham messages in mail using frequency weight-age of words in the message content in more of a natural language processing task and comparing the performances of each of Random Forest, Logistic, Random Tree, Bayes Net, and Naive Bayes algorithms with LTSM Algorithms for phishing detection. The experiment resulted in LSTM which is deep learning based having an encouraging performance, followed by random forest.
One limitation that is peculiar to each and every previously proposed models, frameworks, and approaches is that they can only detect text-based and URL-based phishing webpages and URLs as they are only trained based on text and properties of the Uniform Resource Locator. Current machine learning and deep learning models are not trained to detect more complex and increasingly sophisticated phishing attack which relies heavily on SEO friendly URL, putting text-on images, and Deep-fake AI generated video to evade detection. Hence, there vulnerabilities to complex form of phishing attack.