1. Introduction
Multiple sclerosis (MS) is a chronic and often disabling neurological condition that primarily affects the central nervous system [
1], comprising the brain and spinal cord. It is considered an autoimmune disease, which means that the immune system mistakenly targets and damages the protective covering of nerve fibers called myelin [
2]. This damage disrupts the normal flow of electrical impulses along the nerves, leading to a wide range of neurological symptoms [
3].
The symptoms of MS can vary significantly from person to person, making it a highly unpredictable condition. Common symptoms include fatigue, muscle weakness, difficulty walking, numbness or tingling sensations, problems with coordination and balance, and visual disturbances [
4]. These symptoms can come and go, or they may persist and worsen over time. In some cases, individuals with MS may experience more severe symptoms, such as difficulty with speech, cognitive impairments [
5], and issues with bladder and bowel control [
6].
While the exact cause of MS remains unknown, it is believed to involve a complex interplay of genetic and environmental factors. There is currently no cure for multiple sclerosis [
7], but various treatments are available to manage symptoms, reduce the frequency of relapses, and slow the progression of the disease. These treatments often include medications, physical therapy, and lifestyle modifications to improve overall well-being and quality of life for individuals living with MS [
8].
Traditional MS recognition methods have several shortcomings, which have led to the development and adoption of more advanced diagnostic techniques. Some of the main shortcomings [
9] of traditional MS recognition methods include (i) Reliance on Clinical Symptoms: Traditional MS diagnosis often relies on clinical symptoms alone. Since MS symptoms can be nonspecific and mimic other conditions [
10], this approach may lead to delayed or misdiagnosis, as many other diseases can produce similar symptoms [
11]. This delay can hinder early intervention and treatment, potentially allowing the disease to progress. (ii) Lack of Objectivity: Clinical assessment of MS is subjective and can vary from one healthcare provider to another. The lack of objective, quantifiable measures [
12] make it challenging to establish a definitive diagnosis and monitor disease progression accurately. (iii) Invasive Procedures: Historically, diagnosing MS required invasive procedures such as a lumbar puncture (spinal tap) [
13] to analyze cerebrospinal fluid for abnormalities. These procedures can be uncomfortable and carry some risks.
Recently, scholars have tended to use machine learning methods to recognize MS. Machine learning methods [
14] are a subset of artificial intelligence techniques that enable computers to learn from data and make predictions or decisions without being explicitly programmed [
15]. These methods involve the development of algorithms that can analyze and interpret patterns in large datasets, automatically adjusting their parameters to improve performance over time. Machine learning encompasses various approaches, including supervised learning [
16] (where models are trained on labeled data to make predictions), unsupervised learning [
17] (for discovering patterns and structures in data), and reinforcement learning [
18] (for decision-making in dynamic environments). These methods have wide-ranging applications, from natural language processing and image recognition to autonomous robotics and recommendation systems, and are fundamental in enabling computers to perform tasks that require learning from experience. Lopez [
19] employed the Haar wavelet transform and logistic regression (LR) method. Han, et al. [
20] used an adaptive genetic algorithm (AGA) to detect MS. Zhao, et al. [
21] proposed the Dirichlet mixture of Gaussian processes with split-kernel (DMGPS) to identify MS. Han, et al. [
22] proposed using Hu moment invariant (HMI) to classify MS.
This paper proposes a novel method that combines wavelet entropy (WE) and a particle swarm optimization-based neural network (PSONN). The contributions of our study are below:
2. Dataset
The dataset is from Ref [
20]. Its demographic description is listed in
Table 1, where two categories exist: (i) MS; and (ii) healthy control (HC).
Figure 1 shows one sample of MS.
Data harmonization [
25] is the process of bringing together and standardizing data from different sources or formats to make it consistent and compatible for analysis, integration, or other purposes. The goal of data harmonization is to create a unified and cohesive dataset that can be used for various analytical, reporting, or decision-making tasks. Histogram stretching (HS) [
26], also known as contrast stretching or intensity stretching, is a technique in image processing that aims to improve the contrast and visibility of an image by expanding the range of pixel intensities. It involves linearly scaling the intensity values of an image so that the minimum and maximum values span the entire available range, typically from 0 to 255 in an 8-bit image [
27]. HS can enhance the visual appearance of an image, making details more distinguishable and improving its overall quality. HS is often used in image enhancement and preprocessing to make images more suitable for subsequent analysis or visualization. HS is used in our method due to its ease of implementation. Suppose
denotes the original brain slice, and
stands for the processed brain slice. Let
denote the coordinates, HS operation
is defined via:
where
stand for the lowest and highest grayscale intensity values of
, respectively.
3. Methodology
3.1. Wavelet Entropy
Wavelet entropy [
28] is a mathematical and statistical measure used in signal processing and data analysis to characterize the complexity or irregularity of a signal or time series data. It is derived from the concept of wavelet transform, which is a mathematical technique for decomposing a signal into different frequency components and analyzing their variations over time [
29]. Wavelet entropy quantifies the amount of information or disorder within these frequency components and is particularly useful in fields like neuroscience [
30], finance, and environmental science for analyzing complex and non-stationary signals. To compute wavelet entropy, one typically follows these steps:
The first step involves applying the wavelet transform to the time series data [
31]. This transform breaks down the signal into different scales, allowing for the analysis of both low and high-frequency components.
After obtaining the wavelet coefficients, a probability distribution is created to represent the magnitude of each scale. This distribution characterizes how the signal's energy is distributed across different scales or frequencies.
Wavelet entropy [
32] is then calculated based on the probability distribution. It measures the unpredictability or randomness within the signal. Higher wavelet entropy values indicate greater complexity, while lower values suggest more regular or predictable patterns.
Shannon entropy is used since it is the originator of modern information theory. Suppose a variable
is discrete and random, and it falls within the value set
with probability mass function of
, the Shannon entropy is defined as:
where
H is the entropy, and
represents the expected value operation. If the set of possible values of
R is finite, the formula above is transformed as:
where
is the base of the log function, determining the unit of Shannon entropy.
There are seven coefficient matrices (
Figure 2) after a 2-level 2D-DWT [
33], with four in size of 64×64 and three 128-by-128. By using this method, the dimension of the feature vector is reduced from the origin 256×256=65,536 to only seven effectively.
In all, wavelet entropy offers advantages over traditional entropy measures, especially for analyzing signals with varying frequency content and non-stationary behavior. It can capture intricate patterns and fluctuations in time series data, making it a valuable tool for understanding complex systems and processes in various scientific and engineering disciplines.
3.2. Feedforward Neural Network
A feedforward neural network (FNN) [
34], often simply called a feedforward network, is a fundamental type of neural network architecture used in machine learning and deep learning. It represents the simplest form of neural network, where information flows in one direction [
35], from input to output, without any feedback loops or recurrent connections [
36]. Here's a more detailed explanation in four paragraphs:
A FNN consists of multiple layers of interconnected nodes or neurons (
Figure 3). These layers typically include an input layer, one or more hidden layers, and an output layer [
37]. Each neuron in a layer is connected to every neuron in the subsequent layer, but there are no connections that loop back within the same layer or to previous layers. Each connection between neurons has an associated weight, which is adjusted during training to enable the network to learn and make predictions [
38].
The operation of a FNN involves a simple forward propagation of information. Input data is fed into the input layer, and it propagates through the network layer by layer [
39]. Neurons in each layer perform a weighted sum of their inputs and pass the result through an activation function. This output then becomes the input for the next layer. The final output is obtained at the output layer [
40]. This process is deterministic and does not involve any recurrent or feedback connections.
Activation functions play a crucial role in feedforward networks. They introduce nonlinearity into the model, allowing the network to learn complex relationships in data [
41]. Common activation functions include the sigmoid, hyperbolic tangent (tanh), and rectified linear unit (ReLU) functions [
42]. These functions determine the output of a neuron based on its weighted input sum and introduce the concept of thresholds and saturation points [
43].
FNNs are trained using various supervised learning algorithms, with the most popular being backpropagation. During training, the network learns to adjust its weights to minimize the difference between its predictions (output) and the target values in the training data. This is typically done by computing gradients of an error or loss function with respect to the network's weights and then adjusting the weights [
44] in the direction that reduces the error. This process is repeated iteratively until the network's performance on the training data reaches a satisfactory level [
45].
3.3. PSO-Based Neural Network
A particle swarm optimization-based neural network (PSONN) is a combination of two distinct computational techniques: particle swarm optimization (PSO) [
46] and feedforward neural networks (FNNs).
PSO [
47] is a heuristic optimization algorithm inspired by the social behavior of birds or fish swarming, where individual particles adjust their positions in search of an optimal solution. FNNs, on the other hand, are a class of machine learning models inspired by the human brain that can learn and make predictions from data. Combining PSO with FNNs aims to improve the training and optimization process of neural networks [
48].
PSO is an optimization algorithm that simulates the behavior of a swarm of particles in a multidimensional search space. Each particle represents a potential solution to the optimization problem. These particles iteratively update their positions and velocities based on their own experiences and the experiences of their peers in the swarm. PSO is particularly useful for finding global optima in complex, high-dimensional search spaces. In the context of neural networks, PSO can be applied to optimize the weights and biases of the network to minimize a specific cost or error function [
49].
In a PSONN, the PSO algorithm [
50] is used to optimize the weights and biases of the neural network. Each particle in the PSO swarm corresponds to a set of neural network weights and biases. The objective is to find the combination of weights and biases that minimizes the error or cost function associated with the neural network's performance on a specific task. PSO guides the search in weight space [
51], helping the neural network converge to a more optimal solution. The fitness function used in PSO is often related to the neural network's error on a training dataset.
PSONNs offer advantages in optimizing neural network architectures, making them more efficient and capable of solving complex problems. They are used in various applications, including pattern recognition, time series prediction, feature selection, and optimization of deep neural networks. By combining the strengths of PSO's global search capabilities with the learning and generalization abilities of FNNs, this approach can enhance the performance and efficiency of neural network models, making them more suitable for real-world applications. In the future, we shall try more advanced and recent optimization methods [
52].
3.4. 10-Fold Cross Validation
-fold cross-validation is a widely used technique in machine learning and model evaluation that helps assess the performance of a predictive model while maximizing the use of available data [
53]. It involves splitting the dataset into
subsets, or "folds," where
is typically a positive integer like 10. The process can be summarized in four key aspects:
Step 1: Dataset Splitting. The first step in -fold cross-validation is to divide the dataset into roughly equal-sized parts or folds. Each fold represents a subset of the data, and these subsets are often randomly selected to ensure that the cross-validation process is not biased by the order of the data.
Step 2: Training and Testing. -fold cross-validation then proceeds through iterations. During each iteration, one of the folds is held out as the test set, while the remaining -1 folds are used as the training set. This means that every data point in the dataset is used for testing exactly once.
Step 3: Model Training and Evaluation. In each iteration, a predictive model (e.g., a machine learning algorithm) is trained on the training set, and its performance is evaluated on the corresponding test set. Common performance metrics, such as accuracy, precision, recall, or mean squared error, are computed to assess how well the model generalizes to unseen data [
54].
Step 4: Cross-Validation Results. After completing all
iterations,
separate performance metrics are obtained, one for each fold. These metrics are typically averaged to produce a single, overall performance measure that provides a more robust estimate of how well the model is likely to perform on new, unseen data. This average performance metric is often used to compare different models or tuning parameters [
55].
Figure 4 shows the diagram of
-fold cross validation.
-fold cross-validation is a valuable technique for several reasons. It helps in assessing a model's performance more reliably than a single train-test split because it uses all available data for both training and testing [
56]. It also provides insights into the model's stability and generalization performance by examining its performance on different subsets of the data. Additionally,
-fold cross-validation is particularly useful when the dataset is limited in size, as it allows for more efficient and effective use of the available data for model evaluation. Overall, it is a key tool for model selection, hyperparameter tuning, and assessing a model's expected performance in real-world applications.
3.5. Measure on runs
Suppose we carry out
-fold cross-validation
runs. Suppose the confusion matrix
over
-th run (
) is
where the four elements
represent TP, FN, FP, and TN, respectively. Here P means the positive category, MS, and N is the negative category, HC.
The sensitivity [
57] (symbolized as
), specificity [
58] (symbolized as
), precision (symbolized as
), and accuracy (symbolized as
) of
-th run are defined as:
The F1 score [
59] of
-th run
is defined as
Matthews correlation coefficient (MCC) [
60] of
-th run
is:
Fowlkes–Mallows index (FMI) [
61] of
-th run
is defined as:
After running up all the
runs, we deduce the mean and standard deviation (MSD, symbolized as
) are defined below:
The receiver operating characteristic (ROC) curve and the area under the curve (AUC) are reported based on runs.
5. Conclusions
In conclusion, this study demonstrates a promising approach to improving the accuracy of multiple sclerosis diagnosis. By leveraging wavelet entropy for feature extraction and optimizing neural network parameters with a Particle Swarm Optimization (PSO) algorithm, the research shows the potential for more precise and reliable identification of this complex neurological condition. The combination of advanced signal processing [
65,
66] techniques and machine learning methods offers a valuable contribution to the field of medical image analysis [
67,
68], paving the way for enhanced early detection and management of multiple sclerosis, ultimately benefiting patients and healthcare providers alike.
Further research and validation of this approach hold the potential to advance our understanding and treatment of multiple sclerosis. We shall explore advanced feature selection techniques [
69] to identify the most relevant wavelet entropy features for multiple sclerosis recognition. This could involve employing machine learning algorithms or domain-specific knowledge to optimize feature extraction [
70]. Also, conducting larger-scale clinical studies to validate the WE-PSONN model's performance on diverse patient populations. Gathering data from multiple healthcare institutions can help assess the model's generalizability and robustness.