1. Introduction
Machine learning models can be simply defined as a mapping from the input data to the output data. This mapping must be usable and robust for unseen samples. Thus, it cannot be a simple tabular mapping. Instead, outputs must be computed from input data. Hence, machine learning models are usually mathematical functions [
1].
Two major steps in the development of machine learning models are model selection and training by optimization. The neural networks are a well-known set of models that are shown high capabilities to facilitate both steps [
2].
In theory, shallow neural networks, such as two-layer multilayer perceptron (MLP) with activation functions, should be able to approximate any input-output relationship or function if they have enough number of trainable parameters and non-linearity [
3].
Deep learning models are shown to need a much smaller number of trainable parameters while preserving similar performance as large shallow networks [
4]. A fewer number of parameters means shorter training time and high trainability by finding a better optimum in a fixed running time. Many different architectures and building blocks are deployed for deep learning models, such as convolutional neural networks (CNN) [
2], recurrent neural networks (RNN) [
5], and transformer neural networks (TNN) [
6]. Both CNN and TNN are used in image processing and analysis applications and based on dataset size and model size, one can be preferred over the other. TNNs are usually good for large datasets while CNNs have better performance on small datasets [
7,
8].
Medical image processing encompasses a wide range of applications and domains, including areas such as radiology and pathology. However, despite the vast potential of this field, only a limited number of these applications have access to large, well-curated datasets [
9]. Hence, while transformers are shown slightly better performance on large cohorts, deep CNN are more favorable in medical image processing because they can perform better on smaller cohorts [
7,
10].
Two well-known deep CNNs for image segmentations are fully convolutional neural networks (FCNN) [
11] and U-Nets [
12]. However, both FCNN and U-Net are deep neural networks, and their functions cannot be explained clearly [
13]. Although many methods are discussed in [
14,
15,
16] can be used for explainability
, only a few of them is widely accepted in clinical interpretation since direct visualization of their features is not human-interpretable. This opacity in decision making introduces challenges in clinical setups where the ability to audit a model’s decision-making process are crucial. Explainable models provide transparency by making the model’s reasoning accessible which allows clinicians to validate AI outputs in a manner aligned with their expertise. In the next section, we will elaborate more about human-interpretable and manually engineered features. At the same time, we will develop a theoretical model chronologically from the ground up to include the most essential hand-crafted and machine learning-based features in an explainable shallow convolutional neural network. ExShall-CNN and its source code are publicly available at:
https://github.com/MLBC-lab/ExShall-CNN
2. Theory and Calculation
Otsu [
17] suggested a global thresholding by minimizing the intra-class variance which yields to maximizing inter-class variance for background/foreground segmentation of images.
where
L is the length of the pixels range, e.g. 256 for grayscale images and
p and
P are probability density and probability cumulative functions, respectively.
This equation is usually solved iteratively, and its computation is often very fast even for very large images since it only needs the frequency of colors or shades of grays to calculate the threshold value
. Although this method is highly efficient, it determines a single threshold for the entire image without taking into account the spatial distribution of pixel values. This means it may overlook important variations and contextual information present in different regions of the image, potentially affecting the accuracy of segmentation or analysis. By not considering how pixel values are distributed spatially, the method may not capture the nuances needed for more complex image processing tasks. To address the disparity between local and global pixel value distributions, Sauvola [
18] proposed using a local threshold:
where
n is the local neighborhood diameter, and
and
are the mean and standard deviation of local neighborhoods, respectively.
k and
r are two coefficients that are determined based on application. It is common to select the value of
r as half of the pixels’ maximum value. Hence, two values of
n and
k are needed to be determined to have an acceptable performance. If segmentation labels are available, these two values can be optimized adaptively based on the dataset.
It can be inferred that the mean and standard deviation serve as local features, with Eq. 2 acting as a classification threshold. Thus, the advantages of the Sauvola method compared to Otsu highlight that local features effectively differentiate between background and foreground in classification tasks. Consequently, a general approach can focus on extracting local features followed by classification. The only raw data available are the pixel values and their spatial arrangement, and all other features must be derived from this foundational information. Mean and standard deviation are examples of features created from this raw data, which can be further expanded.
Raw data comes from two sources: the pixel values themselves and the values of adjacent pixels. Features can be defined as any transformations of these raw values into a new domain that simplifies classification, such as linear mappings. Keeping this in mind, various mappings have been proposed, referred to as kernel methods [
19]. Some common kernels are depicted in
Table 1 [
20].
Kernel methods fully encompass the Otsu and Sauvola methods, as both rely on two key factors: mean and standard deviation. These factors represent linear or second-order combinations of adjacent pixel values. Hence, if a more complicated model can have all the given kernel mappings, it can also cover the Otsu and Sauvola methods. All the kernels contain some fundamental mathematical operations, such as summation, subtraction, multiplying, division, power, and exponent.
Otsu, Sauvola, and kernel methods are notable for their explainability, particularly in visual terms. With these methods, the classifications directly reflect the visual content. Additionally, they are mathematically transparent, illustrating the relationships between neighboring pixels in the classification process.
We will show all the Otsu, Sauvola, and Kernel-based methods can be assimilated in a shallow deep CNN with appropriate activation functions. CNN provides a weighted summation of raw data, and activation can bring the kernel mappings. To assimilate the fundamental operations, i.e., summation/subtraction, multiplication/division, power, and exponent, we propose the following activation functions:
Where
x is the input vector and
k is the convolution kernel vector. Thus, the length of both vectors is equal to the kernel size. To prove that all basic mathematical operations can be implemented by these activation functions, a simple 2-dimensional input is assumed. More dimension is only an extension of this simplification. The equations presented in
Table 2 are implemented in a shallow convolutional neural network.
As it can be seen in
Table 2, all required operations are implementable with shallow convolutional neural networks with the given activation functions if there are enough residuals. Residuals that are a well-known contribution to CNNs were first proposed in [
21]. Since then, they have been widely used in almost all successful CNNs. It is easy to envision that with the right combinations and configurations of these operations, all the kernels listed in
Table 1 can be incorporated, along with many more complex combinations that have yet to be categorized into existing kernel types. An important consideration for these equations (especially taking logarithms) is that negative numbers do not have real-valued logarithms. Hence, we designed complex convolutional neural networks (CCNN) instead of real-valued CNNs.
Moreover, it is known that the receptive fields of CNNs are enlarged through depth [
22,
23]. Since we are trying to use a shallow CNN for its higher explainability, then we cannot rely on large receptive fields by depth, and we must compensate for it by the size of kernels. In deep CNNs, kernel sizes of 3, 5, and sometimes 7 are the most common. If we assume that the kernel is center aligned, then by a large kernel of size 7, two neighbor pixels in every direction are included in the equation. After many layers in a deep CNN, this receptor field will be significantly increased proportionally to the sizes of kernel, stride, and dilation of hidden layers. In a shallow CNN, there is no depth growth of receptor field, and a wide range of sizes of kernels and dilations take the responsibility to achieve almost similar receptor field sizes in deep CNNs. Larger kernel sizes cause the non-linear equations to converge slowly through backpropagation. Hence, they usually need more time to be trained, while their number of parameters is usually significantly less than the deep CNNs.
3. Material and Methods
In this study, we used the Retina Blood Vessel segmentation dataset as a widely used benchmark for medical image segmentation [
24]. This dataset contains 100 color images with their binary mask. The dataset is divided into two subsets: 80 images along with their corresponding masks are designated for training, while the remaining 20 images and their corresponding masks are reserved for testing. This dataset provides a well-curated collection of retinal fundus images, each with precise annotations for blood vessel segmentation. Accurate segmentation of blood vessels is essential in ophthalmology, as it assists in the early diagnosis and treatment of retinal diseases, including diabetic retinopathy and macular degeneration [
25].
The structure of our proposed shallow CNN model is shown in
Figure 1. As shown in this figure, there are three different modules, namely, Conv, Log-Conv-Exp (LCE), and Conv-Log-Conv-Exp (CLCE), each with different kernel sizes. The Conv layer finds weighted summation and subtraction of juxtaposed pixels and assimilates a linear kernel (
Table 2, row 1). The LCE layer calculates multiplications and divisions of neighbor pixels and assimilates cosine, polynomial, and somehow sigmoid kernels (
Table 2, rows 2 and 3). Finally, the CLCE layer approximates RBF, Laplacian, and Chi-Squared kernels (
Table 2, row 4). Since the weights and biases of convolutional layers can be positive or negative, the expected logarithm values are complex numbers. Hence, the model is totally implemented in the complex numbers set except for the last layer, which aggregates real numbers. Unlike other layers, Conv and Aggregate layers do not have non-linear activation functions. Here we use kernel sizes 1, 3, 5, 9, 13, 17, 21, and 25. As a result, we have 8 * 3 = 24 modules.
We compare our explainable shallow CNN to two well-known models for image segmentation, namely FCNN [
11] and U-Net [
12]. The former architecture is similar to our proposed model nevertheless much deeper and the latter is the current state-of-the-art in medical image segmentation. Both models have two important differences from our proposed shallow CNN
: Both are DNNs with several hidden layers and also use typical ReLU activation functions. All three models are trained and validated on a similar training dataset and evaluated on the testing set.