Preprint
Article

This version is not peer-reviewed.

MixCFormer: A CNN-Transformer Hybrid with Mixup Augmentation for Enhanced Finger Vein Attack Detection

A peer-reviewed article of this preprint also exists.

Submitted:

11 December 2024

Posted:

12 December 2024

You are already at the latest version

Abstract

Finger vein recognition has gained significant attention for its importance in enhancing security, safeguarding privacy, and ensuring reliable liveness detection. As a foundation of vein recognition systems, vein detection faces challenges including low feature extraction efficiency, limited robustness, and a heavy reliance on real-world data. Additionally, environmental variability and advancements in spoofing technologies further exacerbate data privacy and security concerns. To address these challenges, this paper proposes MixCFormer, a hybrid CNN-Transformer architecture that incorporates Mixup data augmentation to improve the accuracy of finger vein liveness detection and reduce dependency on large-scale real datasets. First, The MixCFormer model applies baseline drift elimination, morphological filtering, and Butterworth filtering techniques to minimize the impact of background noise and illumination variations, thereby enhancing the clarity and recognizability of vein features. Next, finger vein video data is transformed into feature sequences, optimizing feature extraction and matching efficiency, effectively capturing dynamic time-series information and improving discrimination between live and forged samples. Furthermore, Mixup data augmentation is used to expand sample diversity and decrease dependency on extensive real datasets, thereby enhancing the model’s ability to recognize forged samples across diverse attack scenarios. Finally, the CNN and Transformer architecture leverages both local and global feature extraction capabilities to capture vein feature correlations and dependencies. Residual connections improve feature propagation, enhancing the stability of feature representations in liveness detection. Rigorous experimental evaluations demonstrate that MixCFormer achieves a detection accuracy of 99.51% on finger vein datasets, significantly outperforming existing methods.

Keywords: 
;  ;  ;  

1. Introduction

Biometrics [1] is a method of identity recognition or verification based on an individual’s inherent biological characteristics. Compared to traditional authentication methods, such as passwords, identification cards, and physical keys, biometric features offer distinct advantages: they are not only challenging to replicate but also immune to being forgotten, thus significantly enhancing both security and convenience. Currently, biometric authentication techniques are broadly classified into two categories: physiological features [2] and behavioral patterns [3]. Physiological features include static biometric identifiers, such as fingerprint recognition [4], face recognition [5], vein recognition [6], and iris recognition [7], whereas behavioral patterns encompass dynamic characteristics related to user behavior, such as gait [8], eye movement [9], signature [10], and voice [11].
Vein recognition technology is increasingly gaining prominence in the field of biometric identification due to its unique physiological characteristics and resistance to forgery. Venous blood vessels, which are located beneath the skin, exhibit high connectivity and are difficult to observe under visible light [12]. To capture vein patterns, infrared light with a wavelength of approximately 850 nm is typically employed, as it can penetrate the skin and reveal the underlying vascular structures. This characteristic confers a significant advantage in terms of the stability and biological viability of vein-based biometric traits. However, with the widespread adoption of biometric systems, there has been a concurrent rise in the sophistication of attacks targeting these systems [13]. A growing concern is the threat of spoofing, in which attackers attempt to bypass authentication mechanisms using forged, printed, or electronically reproduced vein patterns. Such attacks pose a substantial risk to the security and integrity of biometric systems. Consequently, ensuring the robustness of these systems—particularly in terms of preventing spoofing and implementing reliable liveness detection—has become an urgent and critical challenge.

1.1. Related Work

The primary objective of finger vein liveness detection is to confirm that the user is a living physiological entity and to ensure that the biometric traits being presented originate from a living individual, rather than from a static image or synthetic material. Currently, methods for biometric in vivo detection can be broadly classified into two categories: traditional methods and deep learning-based methods.
Traditional methods mainly rely on feature analysis techniques and are divided into three main categories such as manual feature extraction, machine learning algorithms and biophysical feature detection: (1) manual feature extraction techniques, such as edge detection [14] and texture analysis [15], have been widely employed in vein recognition. For example, in [16], Gabor filters and Local Binary Patterns (LBP) are utilized to differentiate between live and forged vein samples by extracting features from vein images. While these methods have demonstrated promising results in terms of recognition accuracy, they are often sensitive to variations in lighting conditions and image quality, and their robustness remains a challenge. (2) machine learning algorithms, including Support Vector Machines (SVM) and Random Forests, are commonly applied for vein feature classification and recognition. In [17], SVM was used to enhance the accuracy of vein liveness detection, while [3] investigated the impact of feature fusion on detection performance. However, these traditional machine learning approaches still face limitations in terms of adaptability and generalization when handling complex datasets, particularly struggling with diverse attack scenarios. (3) biophysical detection methods, such as blood flow monitoring and temperature change analysis, have been explored to improve vein liveness detection. In [18], a real-time blood flow monitoring technique was proposed to enhance the reliability of liveness detection, while [19] introduced a method based on temperature variations for vein liveness recognition. Although these approaches improve security to some extent, the challenge of effectively integrating multiple detection mechanisms to address emerging attack strategies remains a pressing issue in practical applications.
Deep learning-based methods have demonstrated exceptional performance in finger vein liveness detection, leveraging a range of advanced algorithms, including Convolutional Neural Networks (CNNs) [20], Long Short-Term Memory Networks (LSTMs) [21], Transformers [22], and multimodal learning approaches [23], among others. Researchers have capitalized on the robust image processing capabilities of CNNs to automatically extract features from vein images through multiple layers of convolution and pooling operations, significantly enhancing recognition accuracy [24]. Meanwhile, LSTMs are well-suited for capturing dynamic features within sequential vein images, owing to their ability to process time-series data, thus improving the system's adaptability to complex, real-world environments [21]. Building on this, studies such as [25,26] have combined CNNs and LSTMs to further boost recognition performance by utilizing CNNs to extract spatial features and LSTMs to handle temporal dependencies. Additionally, the Transformer architecture has been employed in [27] to capture global feature information using a self-attention mechanism, making it particularly effective for processing high-dimensional data and yielding superior recognition results. For instance, Qin et al. [28] introduced a label-enhanced multiscale Vision Transformer for palm vein recognition, while an Attention-based Label Enhancement (ALE) scheme, combined with an Interactive Vein Transformer (IVT) [22], was proposed to learn label distributions for vein classification tasks. Wang et al. [29] also proposed a hybrid deep learning model that integrates the strengths of both CNNs and Transformers, further improving vein recognition performance. Multimodal learning, on the other hand, strengthens the robustness of the system by fusing multimodal data, such as vein images, fingerprints, and facial features. For example, [30,31] proposed a multimodal framework based on CNNs, which significantly enhances recognition accuracy by combining vein and fingerprint features. Similarly, [32] improved the system's capability to handle complex environments by integrating vein images with electrocardiogram (ECG) data using a multi-channel CNN. Moreover, the deep fusion framework developed by Alay et al. [33] demonstrated a substantial improvement in both security and accuracy by jointly analyzing vein images, iris, and facial features. Additionally, Tao et al. [34] applied transfer learning in multimodal settings, which reduced the dependence on large training datasets, thereby enhancing the overall recognition performance.

1.2. Motivation

Based on the analysis above, we observe that traditional manual feature extraction techniques, such as edge detection and texture analysis, can enhance recognition accuracy to some extent. However, their robustness is compromised by variations in lighting conditions and image quality, which results in insufficient reliability in practical applications. Moreover, machine learning algorithms exhibit limited adaptability and generalization ability when dealing with complex data, particularly when confronted with diverse attack scenarios. Biophysical feature detection methods, while providing an additional layer of judgment based on physiological characteristics, still face limitations in terms of real-time processing and adaptability. In contrast, deep learning methods, including CNNs, LSTMs, and Transformers, demonstrate superior capabilities in feature extraction and dynamic behavior analysis. However, these approaches are not without their shortcomings. Although CNNs excel at feature extraction, they are highly sensitive to noise and lighting variations, which can lead to lower recognition rates [35]. Furthermore, CNNs are prone to overfitting when trained on smaller datasets, especially when data diversity is limited. In such cases, the model may learn unrepresentative features, negatively affecting its accuracy in real-world applications [36]. While Generative Adversarial Networks (GANs) can augment data, they tend to be unstable when generating small sample sizes or specific features (e.g., finger vein patterns), leading to the production of samples with significant deviations that reduce the model's generalization ability [37]. LSTMs, which are designed to capture dynamic features in time-series data, face challenges related to high computational complexity and the issue of vanishing gradients when dealing with long sequences [38]. Although Transformers can capture global features through a self-attention mechanism, they require large amounts of training data to perform optimally and exhibit poor performance in small-sample environments [39]. Additionally, due to the relatively subtle nature of finger vein features and the limited sample size, the Transformer model may struggle to fully extract local features, which negatively impacts recognition accuracy [12,40]. In multimodal learning, the fusion of multiple biological features may result in information conflict due to the inherent heterogeneity of the features, which can affect the overall recognition performance [41]. Furthermore, multimodal models typically require large amounts of labeled data for adequate training, which presents challenges in real-world applications, particularly in terms of privacy and data collection [23]. Moreover, these models often exhibit high computational complexity and poor real-time performance, making them less suitable for scenarios that require rapid responses.
In summary, existing deep learning methods for finger vein liveness detection face several key challenges: (1) Insufficient Robustness: Although deep learning techniques excel at feature extraction, their ability to withstand biometric attacks, such as forgery or spoofing, has not been fully validated. This lack of robustness makes the system vulnerable to various types of attacks, compromising its security and reliability. (2) High Data Dependency: Deep learning models typically require large quantities of high-quality labeled data for effective training. However, in the context of finger vein liveness detection, obtaining such data is often hindered by privacy concerns and difficulties in data collection. This heavy reliance on extensive datasets limits the applicability of deep learning models in small-sample scenarios, making it challenging to achieve satisfactory performance in real-world applications where data may be limited. (3) High Computational Complexity and Poor Real-Time Performance: Deep learning models are computationally intensive, and overfitting can occur, particularly when training data is sparse. Overfitting reduces the model's ability to generalize to unseen data, thereby diminishing detection accuracy. This is especially problematic when the model encounters novel attack samples, as it may struggle to accurately identify new or previously unseen threats. Furthermore, the high computational demands of deep learning models can result in poor real-time performance, which is critical in time-sensitive applications, further limiting their effectiveness in practical deployment.

1.3. Our Work

To address the shortcomings in existing finger vein liveness detection methods, we propose a hybrid CNN-Transformer architecture based on Mixup data enhancement, named MixCFormer, which improves the robustness and generalization ability of the model through the introduction of residual linking. The MixCFormer architecture consists of three main modules: firstly, a preprocessing method involving baseline drift elimination and morphological filtering is used to extract features from finger vein videos. This approach reduces the interference of background noise and illumination changes, enhances the prominence of vein features, and improves the model's resistance to forgery and overall robustness. Secondly, sample augmentation is achieved through the Mixup data enhancement module, which alleviates the issue of insufficient training data and significantly improves the model's adaptability to diverse attack samples. Finally, a complementary feature extraction architecture that combines Convolutional Neural Networks (CNN) and Transformer is adopted. Deeper feature fusion between local and global features is facilitated through residual linking, where the CNN branch extracts local, detailed features, and the Transformer branch captures global dependencies, resulting in a more comprehensive and robust feature representation. To summarize, the contributions of our work are summarized as follows:
  • MixCFormer Architecture: We propose MixCFormer, a convolutional-transformer hybrid architecture with residual linking, which combines the local feature extraction capabilities of CNNs with the global context modeling of Transformers. The CNN branch captures local vein texture features, while the Transformer branch integrates global information to capture long-range dependencies. Residual linking enhances the efficiency of feature transfer, improving the stability of feature representation. This architectural synergy enables MixCFormer to achieve higher accuracy and robustness in the complex task of finger vein liveness detection.
  • Mixup Data Enhancement: We introduce the Mixup data augmentation technique to improve the generalization ability of the model, reduce reliance on large-scale real datasets, and enhance the recognition accuracy for forged samples. Additionally, we construct a novel dataset that includes real live finger vein data as well as three types of attack samples (two live attacks and one non-live attack). This dataset enriches the diversity of training samples and provides a comprehensive validation foundation, enhancing the model's ability to recognize and resist various attack scenarios.
  • Feature Sequence Processing: We propose an innovative approach that converts finger vein video data into feature sequences for more efficient processing. This method optimizes feature extraction and matching by capturing dynamically changing temporal information, which enhances the discriminative power between live and forged vein samples. As a result, the model's real-time performance and recognition speed are improved.
  • Noise and Light Variation Suppression Techniques: For the first time, we apply a combination of baseline drift cancellation, morphological filtering, and Butterworth filtering to mitigate the impact of noise and light variation on finger vein liveness detection. Baseline drift cancellation eliminates low-frequency noise, morphological filtering optimizes image structure and accentuates vein features, and Butterworth filtering reduces high-frequency noise. The integration of these three techniques significantly enhances the model's robustness, maintaining excellent detection performance under complex lighting conditions and noisy environments, thereby improving the overall reliability and practicality of the system.
  • Experimental Validation and Performance Enhancement: Rigorous experimental evaluations demonstrate that MixCFormer outperforms current state-of-the-art methods in terms of detection accuracy on finger vein datasets. This performance validation underscores the effectiveness and innovation of the proposed architecture, highlighting MixCFormer’s potential for enhanced performance and broader application in finger vein liveness detection tasks.

2. The Proposed Approach

2.1. MixCFormer Model

The architecture of the MixCFormer algorithm is illustrated in Figure 1. The process begins with the preprocessing of a vein image that contains both live vein data and three types of attack data (two live attacks and one non-live attack). To mitigate the effects of background noise and illumination variations, three techniques—baseline drift elimination, morphological filtering, and Butterworth filtering—are applied. These methods effectively reduce unwanted noise and simultaneously optimize the feature structure of the vein image, thereby enhancing the clarity and distinctiveness of the vein patterns. Following this, two vein images are randomly selected from each training batch and combined using the Mixup technique. The images are linearly fused based on weighted coefficients, generating a new sample with blended features. The weights are drawn randomly from a Beta distribution, which ensures a diverse set of samples, thus improving the generalization capacity and robustness of the model. Finally, a hybrid CNN-Transformer architecture is employed to fully exploit the strengths of both Convolutional Neural Networks (CNNs) and Transformer models for efficient feature extraction and global modeling of vein characteristics. The CNN is responsible for extracting local features from the finger vein images, preserving fine details and texture information. Meanwhile, the Transformer's self-attention mechanism models global vein features by capturing dependencies across different regions, thereby enhancing the ability to distinguish between live and fake vein patterns. Moreover, residual connections are utilized to facilitate the exchange of features between the CNN and Transformer modules, promoting the stability of feature representations and improving the transfer of depth information across layers. This dual-module approach ensures both local detail preservation and comprehensive global context understanding, leading to a more robust and discriminative model for vein recognition.

2.2. Data Acquisition and Processing

2.2.1. Acquisition of Attack Data

The schematic diagram of the finger vein image acquisition process is shown in Figure 3. A finger vein acquisition device and a Finger Clip Pulse Oximeter, developed by our team (Figure 2), were used to construct a large-scale, refined finger vein video datasets. The datasets comprises both real vein samples and three types of attack sample video data, totaling 10,476 video samples. The specific data collection methods are as follows:
(1) Acquisition of Real Human Vein Data: The heart rate of each subject was first measured using a finger-clip pulse oximeter to confirm their liveliness. Subsequently, the finger vein acquisition device was used to capture video data of the veins from the index, middle, and ring fingers of both hands. This process was repeated six times for each subject, with vein data collected from all six fingers. Before each acquisition, the heart rate was re-measured to ensure the validity of the data, ensuring that all six sets of video samples were from live subjects, thus providing high-quality finger vein data.
(2) Acquisition of Heart Rate-based Attack Data: To increase the diversity of attack samples, two types of heart rate-based attack data were designed:
  • Attack Type I: The subject wore thin gloves with disturbance patterns (Figure 4(a)), simulating surface disturbances on the finger veins. The data collection process was identical to that of real human vein data, with the same procedure applied to all six fingers.
  • Attack Type II: The subject wore thick gloves (Figure 4b) with disturbance patterns drawn on the glove surfaces, adding further intrusion to the detection algorithm. The acquisition method was the same as for real vein data.
(3) Acquisition of Heart Rate-free Attack Data: A prosthetic finger made from colored clay (Figure 4c) was used to simulate attacks without heart rate. Each colored clay prosthesis was modeled to resemble the index, middle, and ring fingers of both hands. The finger vein data of these prostheses were recorded using the same video acquisition method to create samples of heart rate-free attack data.
Finally, for the 6-second finger vein video data captured, frame extraction was performed at a sampling frequency of 30 frames per second to generate the corresponding finger vein images. The finger vein images corresponding to the three attack scenarios are shown in Figure 5. The thoughtful design of this attack dataset is evident in its diversity and relevance: by incorporating a range of attack samples with and without heart rate, the dataset effectively simulates various interference scenarios that may occur in real-world applications, thereby enhancing the robustness and generalization ability of the model in distinguishing between camouflage and prosthesis attacks. The heart rate-based attack data improves the model’s capability to recognize interference in genuine live conditions, while the heart rate-free attack data enhances the model’s effectiveness in identifying non-live attacks. Additionally, the diverse nature of this dataset provides a solid foundation for adversarial training, further bolstering the security and adaptability of the system. This comprehensive dataset thus offers valuable support for the practical deployment of finger vein recognition technology.

2.2.2. Generating the Sequence Signal

The finger vein attack dataset is generated through a series of processing steps applied to video images, including frame extraction, image cropping, grayscale conversion, feature extraction, baseline drift correction [42], morphological filtering [43], and bandpass filtering [44]. These stages convert video frames into time-series data that captures the unique characteristics of finger vein patterns. Initially, video frames are extracted and cropped to focus on the region of interest (ROI) containing the veins. The images are then converted to grayscale to enhance contrast and highlight vein structures. Feature extraction isolates key vein patterns, which are corrected for baseline drift to address sensor or environmental variations. Morphological filtering refines the image, removing noise and enhancing vein clarity. Finally, bandpass filtering retains relevant frequency components for vein analysis while eliminating noise. This multi-step pipeline results in time-series data that accurately represents vein patterns for further analysis and recognition. Next, the process is implemented in the following three steps:
Step 1: Preprocessing: For each finger vein video, 180 frames were extracted sequentially. Each frame underwent region cropping and grayscale conversion to focus on key vein regions, reducing background interference and enhancing the legibility of vein features. The grayscale conversion utilized a weighted average method to transform RGB three-channel pixel values into grayscale values, as expressed in Equation 1. During the cropping operation, the target region was confined to a predefined range, and the image size was standardized to 390 × 110 pixels. This standardization improves processing efficiency and ensures effective alignment of features across images.
G k = 0.2989 × R + 0.5870 × G + 0.1140 × B
Where G k represents the grayscale image of the k-th frame, and R , G , and B denote the pixel values of the vein image in the red, green, and blue channels, respectively. This transformation reduces the image to a two-dimensional grayscale matrix, facilitating subsequent processing. During the cropping operation, the target region is confined to a specific range, and the image size is standardized to 390×110 pixels, improving both processing efficiency and feature alignment accuracy.
To further enhance the visibility of vein features, each frame of the grayscale image is processed through frame-by-frame accumulation and averaging, yielding a mean grayscale matrix I m e a n (shown in Equation 2), where N denotes the number of frames ( N =180). The resulting mean grayscale image minimizes inter-frame noise, amplifies vein detail and edge information, and achieves a more stable distribution of vein features. This refined image provides a clearer and more reliable foundation for subsequent feature detection.
I m e a n = 1 N k = 1 N G k
For each grayscale image G k , the total grayscale value S k of the frame is computed as the sum of all pixel grayscale values. These values are then arranged chronologically to form a time series S k . This signal sequence captures temporal variations in the finger vein features, offering dynamic information critical for vein detection. The formula for calculating the summed grayscale value S k is expressed as follows:
S k = i = 1 m j = 1 n G k ( i , j )
Here, m and n denote the number of rows and columns in the grayscale image, respectively. The time series data S k represents the temporal fluctuations in the intensity of the vein features across the video. By combining the grayscale values from all 180 frames, a comprehensive data sequence is generated. This sequence reflects the dynamic behavior of vein features over time. The original waveform of the time series is depicted in Figure 6.
Step 2: Baseline Drift Correction and Morphological Filtering. The extracted time series signals often exhibit baseline drift caused by environmental changes or finger movement, leading to signal instability. To address this, we apply morphological filtering to remove low-frequency baseline drift. First, a linear structuring element (SE) is defined. Morphological opening and closing operations are then applied to the signal to remove high-frequency noise and low-frequency drift, respectively. The opening operation ( f o p e n ) smooths spikes in the signal, while the closing operation ( f c l o s e ) fills in deep valleys, effectively removing baseline drift from the original signal. The equation for baseline drift correction is as follows:
S k = S k f o p e n ( S k ) + f c l o s e ( S k ) 2
Where SE is defined as a linear structural element with a length of 15 and an angle of 0, and f o p e n and f c l o s e represent the morphological opening and closing operations, respectively. The corrected time series is denoted as S k . As illustrated in Figure 7, the application of these operations effectively removes the low-frequency baseline drift from the signal, resulting in a more stable signal while preserving the vein eigenfrequency.
Step 3: Butterworth Filter Design. Building upon the previous step, a Butterworth bandpass filter is designed to effectively remove high-frequency noise and residual low-frequency interference from the S k signal. The Butterworth filter is particularly suited for this task due to its smooth frequency response in the passband, which minimizes amplitude distortion while maintaining the desired frequency characteristics. Its rapid attenuation in the stopband ensures efficient suppression of high-frequency noise, making it an ideal choice for processing biological signals. The design process consists of two key stages: (1) parameter setting and filter design, and (2) filtering of the vein signal.
(1) Parameter Setting and Filter Design
To design the filter, it is essential to select the appropriate passband and stopband frequencies to effectively extract the vein signal characteristics. Based on the spectral properties of the finger vein signal, the passband frequency ( f p ) is set between 0.7 Hz and 3.5 Hz to retain the main low-frequency components of the vein signal. The stopband frequency ( f s ) is set between 0.5 Hz and 5 Hz to ensure that high-frequency noise is adequately suppressed. This frequency selection is guided by an analysis of the finger vein signal spectrum, ensuring that the filter preserves the core information of the vein signal while removing unwanted noise.
Additionally, the normalized passband frequency ( ω p ) and stopband frequency ( ω s ) are calculated based on the set passband frequency f p , stopband frequency f s , and the sampling frequency F s . These calculations are used to further define the design parameters of the filter, as outlined in Equations 5 and 6.
ω p = f p * 2 * p i / F s
ω s = f s * 2 * p i / F s
Where, the sampling frequency ( F s ) is 30 Hz. To ensure signal fidelity within the passband and effectively attenuate noise in the stopband, the minimum filter order n is calculated ( as shown in Equation 7) based on the specified passband and stopband frequencies, along with the corresponding attenuation requirements. Specifically, the maximum attenuation in the passband ( r p ) is set to 3 dB, and the minimum attenuation in the stopband ( r s ) is set to 18 dB.
n = l o g 10 r s 10 1 10 r p 10 1 2 l o g ω s ω p
(2) Vein Signal Filtering
During the filtering stage, the vein signal, after baseline drift elimination, is input into the designed Butterworth filter. The filter removes high-frequency noise, leaving the low-frequency components intact and resulting in a smooth signal curve that effectively eliminates noise interference. The specific formula is as follows:
S k ' ' = f i l t e r ( b , a , S k )
Here, S k represents the vein signal after baseline drift correction, and S k ' ' is the smoothed signal after filtering. The filtering effect is illustrated in Figure 8. Compared to the pre-filtered signal S k , the filtered vein signal S k ' ' exhibits a smoother curve, with more stable fluctuations and no noticeable high-frequency noise components. This smooth signal provides a high-quality dataset for subsequent feature extraction and matching, improving the accuracy and stability of the biometric system. Consequently, the processed feature signal S k ' ' is stored in an Excel table, ensuring a reliable data foundation for further analysis and model training.

2.3. Mixup Data Augmentation

we adopt the Mixup [45] data augmentation method to generate new samples by linearly combining the original data samples, aiming to improve the generalization ability of the model, reduce overfitting, and enhance its ability to recognize diverse attack samples. In the finger vein liveness detection task, the diversity of the dataset and the number of samples are limited, especially in complex environments where the collection of real samples is both difficult and costly. Therefore, the Mixup method effectively extends the training set by generating new samples to alleviate the problem of insufficient data.
The core idea of the Mixup [46] method is to generate new sample pairs by linearly interpolating the training samples. In the finger vein liveness detection task, the input signals are finger vein video sequences, and these sequences are expressed as time-series data formed by the pixel values of each image frame over time. Specifically, two vein signal sequences X i and X j , and their corresponding labels y i and y j are set, and Mixup generates new signal sequences and label pairs by the following formula:
X ' = λ X i + ( 1 λ ) X j
y ' = λ y i + ( 1 λ ) y j
Where λ is a weighting factor sampled from the Beta distribution, typically in the interval [0, 1]. This factor controls the mixing ratio between the samples. By applying linear interpolation, we augment the original sequence of 2,000 samples to generate 4,000 new samples. These augmented samples enhance the diversity of the training data while preserving temporal consistency, thereby helping the model capture the underlying patterns and variations in the venous signals more effectively.

2.4. CNN-Transformer Hybrid Model

The CNN-Transformer hybrid model used for finger vein liveness detection is designed to effectively extract and classify vein features. The model begins by processing the input data through a series of convolutional layers to capture local features. Specifically, it consists of two convolutional layers that expand the input feature dimensions from one channel to 16 and 64 channels, respectively. This is complemented by batch normalization and ReLU activation functions, which ensure a nonlinear representation of the features. Simultaneously, a max pooling layer is employed to reduce the dimensionality of the feature map, enhancing the efficiency of feature representation. Next, the model incorporates a Transformer layer to capture global feature dependencies using a multi-head self-attention mechanism. The data processed by the convolutional layers is reshaped to suit the Transformer input format and passed through three Transformer encoder layers for feature extraction. Residual concatenation is applied to the Transformer output to facilitate efficient information flow and improve model stability. Finally, the extracted features are processed by a fully connected layer, and the output layer generates the final classification results, distinguishing between live and forged samples. The overall architectural design (shown in Figure 9) effectively combines the local feature extraction capabilities of the convolutional network with the global modeling strengths of the Transformer, providing robust support for finger vein liveness detection.

2.4.1. CNN Feature Extraction

In the finger vein liveness detection task, the inputs consist of finger vein sequences, with each sequence capturing the structural features of the finger veins. The CNN module is designed to extract local information from these sequences and convert it into high-level features, thus providing a rich representation of the local characteristics for the Transformer encoder. The module first processes the vein image through a series of convolution operations, which include two convolutional layers and a max pooling layer. Let the input sequence be denoted as X R N × C × L , where N represents the batch size (initial value 64), C denotes the number of channels (initial value 1), and L is the sequence length (initial value 180). The first convolutional layer is responsible for extracting the initial local features of the vein signal. After this layer, the number of output channels is increased to 16. The output feature at the l -th layer, denoted as X ( l ) , can be expressed as Equation 11.
X ( l ) = R e L U ( C o n v ( X ( l 1 ) , W ( l ) ) )
Where W ( l ) represents the input to the l -th layer, with the convolution kernel having a size of 3, a stride of 1, and a padding of 1. The ReLU activation function is applied to introduce nonlinearity into the model. Through successive convolution layers, the model is able to capture local information from the input image, resulting in a series of feature maps that highlight important characteristics, such as edges and textures, which are crucial for live detection.
Subsequently, a Max Pooling operation is applied to extract the maximum value within each local region, which serves to downscale the feature maps while enhancing the robustness of the extracted features. The pooling operation can be expressed as Equation 12.
X p o o l ( l ) = M a x P o o l ( X ( l ) )
The pooling operation reduces the feature size while preserving key features, thereby lowering computational complexity and improving the model's generalization ability. After the two layers of convolution and pooling, the CNN module outputs a feature map with the size of X c n n R N × F × L ' , where F is the final number of channels (64), and L ' is the reduced sequence length (45) after pooling.

2.4.2. Transformer Coding Module

The Transformer module utilizes a convolutional layer to extract local features, which are subsequently used to capture global dependencies in the finger vein signals. Initially, the input sequence is processed through an embedding layer and a positional encoding layer. This step is essential for effectively capturing the temporal variation patterns of the finger vein features, allowing the model to learn the time-dependent information of the signal. This, in turn, provides a solid foundation for subsequent feature extraction and dependency modeling. Once the signals are embedded and position-encoded, they are passed into the Transformer encoder, which consists of multiple stacked encoder layers, as shown in Figure 10. Each encoder layer contains two main components: self-attention mechanism [47] and a feed-forward neural network (FFN) [48]. The self-attention mechanism is used to identify the dependencies between elements within the input sequence. Following this, the feed-forward neural network processes the output of the self-attention mechanism, enhancing the model's ability to capture nonlinear relationships. Each encoder layer is followed by a residual connection and layer normalization. These components are designed to facilitate gradient flow, accelerate the training process, and ensure efficient information propagation through the deep network. Notably, the input and output dimensions of each encoder layer are kept consistent, preventing issues related to missing information or dimensional mismatches.
In our model, we use three stacked encoder layers, each with an input dimension of 64, which corresponds to the length of the feature vector at each time step. Each encoder layer employs 4 attention heads, and a dropout rate of 0.3 is set to effectively prevent overfitting. The main components of the Transformer encoder include the self-attention mechanism, multi-head attention mechanism, feed-forward neural network (FFN), as well as residual connections and layer normalization. The specific process can be divided into the following three steps:
Step 1: Self-Attention Mechanism and Multi-Head Attention Mechanism
The core of the Transformer module is the self-attention mechanism, which aims to capture the global dependencies between time steps in the finger vein signal sequence. For the input sequence X c n n , let X c n n = x p 1 , x p 2 , . . . , x p N , where x p i is the input feature vector at the i -th time step with feature dimension 45. The first step involves computing the query ( Q ), key ( K ), and value ( V ) matrices through linear transformations as follows:
Q = X c n n W Q
K = X c n n W K
V = X c n n W V
Where W Q , W K , and W V are learned parameter matrices that correspond to the linear mappings of the queries, keys, and values, respectively. The output of the self-attention mechanism is computed using Scaled Dot-Product Attention. To do so, we first calculate the dot product between the query and key matrices, then divide the result by d , where d is the dimension of the query and key vectors. This scaling factor helps to stabilize the gradients during training. Next, the softmax function is applied to the scaled result to obtain the attention weights. Finally, the attention weights are multiplied by the value matrix V to compute the output of the self-attention mechanism, as shown in Equation 12. This process efficiently captures the correlations between different time steps in the sequence of finger vein signals, enabling the model to better understand the temporal evolution of vein image features over time.
A t t e n t i o n ( Q , K , V ) = s o f t m a x Q K T d V
To capture features from different subspaces within the input sequence, the Transformer model employs the Multi-Head Attention mechanism. By computing multiple attention heads in parallel, the model is able to extract information from various subspaces, thereby enhancing its learning capacity. The formula is as follows:
M S A ( X c n n ) = C o n c a t ( h e a d 1 , h e a d 2 , . . . , h e a d h ) W O
where W O is the linear projection matrix, h e a d i denotes the computation result of the ith head, specifically:
h e a d i = A t t e n t i o n ( Q i , K i , V i )
Step 2: Residual Connection and Layer Normalization
After the multi-head self-attention output, residual connections and layer normalization are applied. The input X c n n is added to the output of the multi-head self-attention, effectively addressing the vanishing gradient problem. Layer normalization ensures consistency in feature distribution, which further enhances the stability of model training. This residual connection structure is used repeatedly within the Transformer encoder module to facilitate effective gradient propagation throughout the deep network. The specific operation is denoted as:
Z M S A = L a y e r N o r m ( X c n n + M S A ( X c n n ) )
Step 3: Feed-Forward Network (FFN)
The next component is the feed-forward neural network (FFN), which consists of two fully connected layers and an activation function (typically ReLU) to enhance the nonlinear feature representation. The calculation formula for the FFN is as follows:
F F N ( Z M S A ) = R e L U ( Z M S A W 1 + b 1 ) W 2 + b 2
where W 1 and W 2 are the weight matrices, and b 1 and b 2 are the bias terms. The output of the FFN is then passed through residual connectivity followed by layer normalization, producing the final output Z of the encoder layer, as denoted by:
Z = L a y e r N o r m ( Z M S A + F F N ( Z M S A ) )
The output Z is then used as the input to the next encoder layer or the subsequent classification module. In the context of finger vein liveness detection, the encoder module, stacked with multiple layers, effectively captures the global, time-dependent features of the input signal. This enables the extraction of rich feature representations, which are essential for the subsequent classification module to distinguish between live and forged samples.

2.4.3. Fully Connected Network

The fully connected classification network [49] plays a crucial role in performing the final classification of the features extracted through the convolution, pooling, and Transformer encoding processes. This module comprises multiple fully connected layers, denoted as fc, fc1, and fc2, as shown in Figure 11. The Fully Connected Layer Network (FCLN), also referred to as a Multi-Layer Perceptron (MLP), is a standard neural network structure consisting of several fully connected layers. Each fully connected layer contains multiple neurons, with each neuron connected to every neuron in the preceding layer. Through adjusting the connection weights and bias terms, the neurons are capable of learning the features of the input data. This process involves performing a nonlinear transformation via an activation function, allowing the network to capture the complex relationships inherent in the data. In the MixCFormer model, the fully connected layers are responsible for mapping and transforming the extracted vein features into the output space of the target task. Specifically, the input layer of the fully connected network receives 64 output features from the Transformer. The output layer consists of two neurons for the binary classification task, corresponding to the two classes: live (1) or forged (0), indicating the probability or confidence that the input data belongs to each category. The detailed structure of the network is shown in Figure 11. The output features from the Transformer are first linearly transformed by the fully connected layer fc to produce a feature representation of size 64. This layer performs a weighted combination of the features extracted at each time step by the Transformer, effectively fusing them and extracting high-level features for classification. The resulting feature vector is then passed through the fully connected layer fc1, which reduces the feature dimension to 32. This compression helps eliminate unnecessary noise or redundant features while retaining important information. The fc1 layer is followed by a ReLU activation function, enabling the network to learn nonlinear relationships more effectively and enhancing its capacity to represent complex patterns. Finally, after passing through the fc2 layer, the output features are mapped to the target category dimensions, yielding the binary classification output: either live (1) or forged (0). This final output indicates the model's confidence in the classification of the input data.

2.5. Model Training and Optimization

Model training and optimization are crucial steps in developing deep learning models, aiming to improve their predictive performance on specific tasks by minimising the loss function. In this study, the goal is to optimize the model's performance by adjusting its parameters to minimise the loss function. For the finger vein liveness detection task, a binary classification problem, Cross-Entropy Loss [50] is used as the loss function. This loss function is commonly applied in classification tasks as it measures the difference between the predicted probability distribution and the true label distribution. The Cross-Entropy Loss is defined as shown in Equation 2:
L = 1 N i = 1 N y i l o g ( y i ^ ) + ( 1 y i ) l o g ( 1 y i ^ )
where N is the number of samples, y i is the true label (0 or 1), and y i ^ is the predicted probability for the corresponding category. By minimizing this loss function, the model can effectively adjust its weights so that the output probabilities closely align with the true labels.
To optimize the loss function, we use the Adam optimizer [51]. Adam is a gradient-based first-order optimization method that combines the momentum approach with an adaptive learning rate strategy. This allows the learning rate for each parameter to be adjusted dynamically during the training process, thereby accelerating convergence and enhancing training stability. The update formula for the Adam optimizer is as follows:
m t = β 1 m t 1 + ( 1 β 1 ) θ L θ
υ t = β 2 m v 1 + ( 1 β 2 ) ( θ L θ ) 2
m t ^ = m t 1 β 1 t
v t ^ = v t 1 β 2 t
θ t = θ t 1 α m t ^ v t ^ + ϵ
Where m t and υ t are the momentum and variance estimate of the gradient, respectively, β 1 and   β 2 are the decay coefficients, α is the learning rate (initially set to 0.0001), and ϵ is a small smoothing term that prevents division by zero. By dynamically adjusting the learning rate, the Adam optimizer accelerates the training process, reduces training time, and mitigates issues such as vanishing or exploding gradients, which are common challenges in traditional gradient descent methods.

3. Experimental Results

To evaluate the proposed approach, we conducted algorithm validation using a finger vein dataset and performed fair comparison experiments with a baseline algorithm. All experiments were carried out on a system running the Windows 11 operating system, equipped with an Intel(R) Core (TM) i7-14650HX processor. The programming environment was Python 3.10.14, with PyTorch 2.4.0 as the deep learning framework, and CUDA 12.6 serving as the parallel computing platform. The experiments were executed on NVIDIA GeForce RTX 4060 GPUs. During training, the model was configured with an epoch size of 500, a batch size of 64, an initial learning rate of 0.0001, an image resolution of 390×110 pixels, and a sequence dimension of 180. All other hyperparameters were set to their default values.

3.1. Dataset Description

The finger vein video datasets consist of 10,476 videos, including 2,556 real-life finger vein videos and three types of attack videos. The attack videos are comprised of 2,592 real-life thin-gloved finger vein videos, 2,772 real-life thick-gloved finger vein videos, and 2,556 heart-rate-less finger vein videos created using colored clay. Each video is 6 seconds long. Figure 12 presents screenshots from some of the videos in the datasets, where (a) shows a screenshot of a real person's finger vein video; (b) depicts a screenshot of a real person's thin-gloved finger vein video; (c) shows a screenshot of a real person's thick-gloved finger vein video; and (d) illustrates a screenshot of a colored clay finger prosthesis vein video.
During data collection, a finger vein acquisition device was used to capture video of the vein patterns from the index, middle, and ring fingers of both the left and right hands of each subject. Video data were collected from all six fingers (left and right index, middle, and ring fingers) for each subject, with six repetitions per finger. Before each data acquisition session, the subject's heart rate was re-measured to ensure that all six consecutive video sets were collected from living subjects, thereby ensuring the quality of the finger vein video samples.
We processed 1000 real human finger vein videos and 1000 prosthetic finger vein videos in MATLAB environment. Each video was converted into 180 frames, and the finger vein information was extracted from these frames by applying algorithms to eliminate baseline drift and perform filtering. Each set of 180 frames from a video was represented as a feature vector, which was then uniformly outputted into an Excel file. Since the data in each frame is time-dependent, the sequence of 180 data points can be treated as a time series. Next, We labeled the time series of the 1000 real finger vein videos as ‘1’ and the time series of the 1000 prosthetic finger vein videos as ‘0’, resulting in a dataset with 2000 columns and 180 rows of finger vein time series data. Subsequently, we applied Mixup data augmentation to expand the dataset, generating a total of 4000 finger vein sequence samples.

3.2. Evaluation Metrics

In the context of finger vein recognition systems, particularly for the task of finger vein liveness detection, evaluating the performance of the model is of paramount importance. To this end, we employ several evaluation metrics, including Training Loss, Test Loss, Training Accuracy, Test Accuracy, Precision, Recall, and F1 Score. These metrics enable a comprehensive assessment of the model's performance across different stages and provide insights into its classification capabilities from multiple perspectives. Below is a detailed description of each of these evaluation metrics:
(1) Train Loss and Test Loss. Train Loss and Test Loss are key metrics used to evaluate the model's prediction error on the training set and test set, respectively. Train Loss indicates how well the model fits the training data, while Test Loss assesses the model's ability to generalize to unseen data. In the context of the finger vein recognition task, Training Loss provides insight into how effectively the model optimizes its parameters during the training process. A lower training loss generally indicates better fitting to the training data. On the other hand, Test Loss reflects the model's performance on data it has not seen during training, serving as an indicator of its generalization capability. For this binary classification task, the loss function used is Cross-Entropy Loss, which quantifies the difference between the predicted probability distribution and the true labels. It is computed as follows:
T r a i n / T e s t   L o s s = 1 N i = 1 N L o s s y i , y i ^
Where L o s s y i , y i ^ represents the loss for the i -th sample, and N is the total number of samples in the training or test set. A smaller loss value indicates a lower prediction error, which signifies better model performance. If the training loss continues to decrease over time, it suggests that the model is improving its accuracy on the training data. Conversely, a high training loss indicates that the model has not sufficiently learned from the data, potentially signaling an underfitting issue. In terms of test loss, a lower value suggests that the model generalizes well and is capable of making accurate predictions on unseen data. However, if the test loss is significantly higher than the training loss, this may indicate overfitting, where the model performs well on the training data but fails to generalize to new, unseen examples.
(2) Train Accuracy and Test Accuracy. Training Accuracy and Test Accuracy measure the proportion of correct predictions made by the model on the training set and the test set, respectively. Training Accuracy reflects how well the model has learned from the training data, while Test Accuracy assesses the model's ability to generalize to unseen data. Accuracy is defined as the ratio of correctly classified samples to the total number of samples, and is calculated using the following formula:
A c c u r a c y = T P + T N T P + T N + F P + F N
where T P is the number of true positives, T N is the number of true negatives, F P is the number of false positives, and F N is the number of false negatives. A high training accuracy that approaches 100% coupled with a lower test accuracy may indicate overfitting, where the model performs well on the training data but fails to generalize to new, unseen examples. Ideally, both training and test accuracies should be closely aligned. A significant disparity between the two, particularly if the test accuracy is much lower than the training accuracy, suggests that the model is not generalizing well enough and may require further optimization or regularization.
(3) Precision. Precision is a metric that measures the proportion of samples predicted by the model as belonging to the positive class that are actually true positives. In the context of finger vein recognition for attack detection, precision quantifies the model's ability to correctly identify attack samples (i.e., positive class instances). A higher precision value indicates that most of the samples predicted as attacks by the model are indeed actual attacks, resulting in fewer false positives. Precision is calculated as follows:
P r e s i c i o n = T P T P + F P
In the case of finger vein attack detection, high precision is particularly important because it minimizes the occurrence of false alarms. A system with high precision ensures that when an attack is predicted, it is more likely to be a true attack, thereby enhancing the reliability and trustworthiness of the system.
(4) Recall. Recall, also known as sensitivity or true positive rate, measures the proportion of actual positive samples that are correctly identified by the model. Specifically, it quantifies the model's ability to detect all instances of the positive class (e.g., attack samples). In the context of finger vein recognition for attack detection, recall is crucial because it reflects how effectively the model can identify potential attacks, even if some false positives occur. The calculation formula for recall is presented in Equation 31.
R e c a l l = T P T P + F N
A higher recall value indicates that the model is successful in detecting more true attack samples. In finger vein recognition, maximizing recall is particularly important to ensure that no genuine attack samples are missed. However, a higher recall might come at the cost of an increase in false positives, as the model may classify more non-attacks as attacks to avoid missing any true attacks.
(5) F1 Score. The F1 Score is the harmonic mean of precision and recall, offering a balanced evaluation metric that combines both aspects of model performance. Unlike precision and recall, which may offer conflicting perspectives (e.g., high precision might lower recall and vice versa), the F1 Score provides a single measure that accounts for both false positives and false negatives, making it particularly useful when dealing with imbalanced datasets or cases where one class is underrepresented (such as attack samples in finger vein recognition). The F1 Score is calculated in Equation 32:
F 1   S c o r e = 2 × P r e s i c i o n × R e c a l l P r e s i c i o n + R e c a l l
For the finger vein attack detection task, the F 1 score is an essential metric for assessing the model’s overall effectiveness, particularly when attack samples are sparse or imbalanced. A high F 1 score indicates that the model achieves a good balance between precision and recall, ensuring both minimal false positives and false negatives. This metric is especially valuable in security applications, where the cost of missing an attack (false negative) or generating false alarms (false positive) can be significant.

3.3. Comparison Experiment

To evaluate the performance of our approach, we conducted a comprehensive comparative experiment, incorporating 10 benchmarks under both the baseline model and MixUp data augmentation settings. The primary objective of this experiment was to compare the performance of several mainstream network architectures, including CNN, LSTM, and Transformer, on the specific task of finger vein recognition. Additionally, to ensure a thorough evaluation, we examined various typical algorithm combinations. We systematically compared the models based on four key experimental metrics: test accuracy, precision, recall, and F1-score.
In the experiments, the finger vein dataset was split into training and test sets in an 8:2 ratio to assess the performance of the MixCFormer algorithm. After 300 training epochs, the experimental results are presented in Figure 13. Specifically, Figure 13a shows the loss function curves for both the training and validation sets over the course of training, Figure 13b displays the trends in training accuracy and test accuracy, Figure 13c presents the precision curves, and Figure 13d illustrates the recall and F1-score variations. The results in Figure 13 indicate that the model’s loss function for both the training and validation sets gradually decreases as training progresses, suggesting that the model is converging and effectively learning the data features. The accuracy on the test set reached 99.50%, which is very close to the 99.72% achieved on the training set, demonstrating the model’s strong generalization ability without overfitting. Furthermore, the precision was 99.51%, indicating the model’s effectiveness in minimizing false positives—i.e., incorrectly classifying negative samples as positive. As shown in the precision curve in Figure 13c, precision continues to improve as training progresses, further validating the efficiency of the MixCFormer algorithm for finger vein recognition. Additionally, the recall and F1-score curves in Figure 13d exhibit a consistent upward trend, indicating that the model successfully maintains high sensitivity while reducing false positives. These results suggest that the MixCFormer algorithm strikes an effective balance between recall and F1-score, while simultaneously improving recognition accuracy, leading to more comprehensive performance optimization.
The confusion matrix for MixCFormer is shown in Figure 14. From these results, it is evident that the MixCFormer algorithm achieves exceptionally high accuracy in the vein authenticity recognition task, with virtually no false positives. This indicates that the model effectively avoids misclassifying forged images as real. Although there are 2 false negatives, the overall false negative rate remains very low, suggesting that the model performs excellently in terms of both sensitivity and precision.
To evaluate the performance of the improved MixCFormer algorithm, we conducted a series of comparative experiments with several representative models. The results of these experiments are summarized in Table 1 and Table 2. Additionally, we compared and analyzed the loss function curves of the different models throughout the training process, as illustrated in Figure 15. These comparisons allowed for an in-depth analysis of each model's performance in the finger vein recognition task, focusing on key metrics such as loss function, precision, recall, and F1 score. Furthermore, the performance trends of the models during both the training and testing phases are visualized in line graphs, as shown in Figure 16. Collectively, these experimental results provide a comprehensive quantitative analysis, which enables a more thorough evaluation of the advantages of the MixCFormer algorithm.
Table 1 presents a comparative analysis of MixCFormer and 10 other classical recognition algorithms on the finger vein dataset. Among the individual models, CNN achieves the highest precision of 94.50%, followed by LSTM at 94.39% and GRU at 93.78%, while the Transformer performs the worst with 91.50%. Cascade models, such as CNN + LSTM (93.57%) and CFormer (95.50%), show improved performance, with CFormer outperforming the others. Applying MixUp data augmentation to CNN (MixCNN) leads to a significant precision boost to 97.53%, demonstrating the effectiveness of MixUp in enhancing model robustness. However, MixLSTM (93.43%) and MixCLT (93.73%) show more modest improvements. Notably, our proposed MixCFormer model achieves the highest precision of 99.51%, showcasing its superior performance in vein authenticity recognition, surpassing all other models in both individual and combined configurations.
Table 2 presents a comprehensive comparison of experimental metrics—test loss, test accuracy, precision, recall, and F1 score—across various algorithms for vein authenticity recognition. GRU, CNN, and LSTM perform similarly, with CNN achieving the highest test accuracy of 94.25% and precision of 94.50%. Transformer exhibits the lowest performance across all metrics, with a test accuracy of 91.50% and all other scores at 91.50%, indicating its relative inefficacy for this task. Cascade models like CNN + LSTM and CFormer show improvements, with CFormer achieving the highest performance among the cascade models, with 95.50% accuracy and perfect precision, recall, and F1 score of 95.50%. When MixUp data augmentation is applied, MixCNN leads with a substantial boost in performance, achieving 97.50% accuracy, 97.53% precision, and 97.50% recall, demonstrating the effectiveness of MixUp in improving model robustness. While MixLSTM and MixCLT show some gains, their results remain lower than MixCNN. Finally, Our Model stands out with exceptional results: 99.50% test accuracy, 99.51% precision, recall, and F1 score, outperforming all other models and demonstrating the superior effectiveness of our proposed method in vein authenticity recognition.
Figure 15 illustrates the loss function curves for different models during the training and testing phases. The analysis reveals that the MixCFormer model (k) demonstrates the fastest convergence and the most stable performance, with the loss function rapidly and consistently decreasing to 0.0414 after 300 training epochs. In contrast, the Transformer (d) and CFormer (f) models exhibit slower convergence and more fluctuating loss curves, indicating lower learning efficiency. Further examination shows that the CNN, Transformer, and CFormer models experience overfitting after approximately 130 epochs, as evidenced by an increase in the loss function on the test set. The MixCNN model (h) displays a smoother loss curve, suggesting greater stability during training. Hybrid models, such as MixCFormer (k) and MixCNN (h), benefit from the integration of multiple architectures, leading to a faster reduction in the loss function. Overall, the MixCFormer model outperforms all other models in terms of both convergence speed and stability, underscoring its superior performance in the finger vein recognition task.
As shown in Figure 16, MixCFormer demonstrates a stable convergence trend throughout the training process. As the number of iterations increases, the training loss gradually decreases and eventually stabilizes, indicating that the model effectively learns without overfitting. The test accuracy consistently rises, surpassing that of other algorithms, which highlights its superior generalization ability. Additionally, MixCFormer excels in precision, recall, and F1 score, with all metrics continuously improving during training and stabilizing in the later stages, significantly outperforming the comparison models. These trends confirm that MixCFormer excels across all key performance metrics, offering both improved stability and efficient learning capabilities.

4. Conclusions

In this paper, we introduce MixCFormer, a hybrid CNN-Transformer architecture for finger vein liveness detection. The architecture combines preprocessing techniques (baseline drift elimination and morphological filtering) to improve vein feature extraction, Mixup augmentation to address data scarcity, and a CNN-Transformer hybrid for local and global feature fusion. Residual linking further optimizes feature transfer, improving model stability. Experimental results show that MixCFormer outperforms state-of-the-art methods in detection accuracy, robustness, and adaptability, especially under noisy and variable lighting conditions. Experimental results demonstrate that MixCFormer outperforms state-of-the-art methods in terms of detection accuracy, robustness, and adaptability.
Our future work will focus on further optimizing data augmentation techniques to enhance the model's robustness and improve its adaptability to a wide range of attack scenarios. Simultaneously, we will prioritize the construction of more diverse finger vein datasets to strengthen the model's generalization capability across different environments and lighting conditions. In terms of feature extraction, we plan to integrate time-series modeling more effectively to capture dynamic vein patterns, enabling real-time detection. Additionally, we will investigate methods for extracting additional vital sign information, such as pulse and heartbeat, from vein data, thereby unlocking the full potential of vein recognition technology in biometric security applications.

Author Contributions

Conceptualization, Z.W. and H.Q.; methodology, Z.W. and S.Y.; software, Y.L.; validation, Z.W.; formal analysis, H.Q.; investigation, Z.W. and J.W.; resources, Z.W.; data curation, Z.W.; writing—original draft preparation, Z.W.; writing—review and editing, Z.W., S.Y., and J.W.; visualization, Z.W. and S.Y. ; supervision, Z.W.; project administration, Z.W.; funding acquisition, Z.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China (grant 62301241) and in part by the Key Research Program of Higher Education Institutions in Henan Province (grant 25A510017).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

The authors would like to acknowledge the anonymous reviewers and editors whose thoughtful comments helped to improve this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jain, A.K.; Kumar, A.J.S.g.b. Biometrics of next generation: An overview. 2010, 12, 2-3.
  2. Zhang, L.; Li, W.; Ning, X.; Sun, L.; Dong, X. A local descriptor with physiological characteristic for finger vein recognition. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), 2021; pp. 4873-4878.
  3. Shaheed, K.; Liu, H.; Yang, G.; Qureshi, I.; Gou, J.; Yin, Y.J.I. A systematic review of finger vein recognition techniques. 2018, 9, 213. [CrossRef]
  4. Chugh, T.; Cao, K.; Jain, A.K.J.I.T.o.I.F.; Security. Fingerprint spoof buster: Use of minutiae-centered patches. 2018, 13, 2190-2202. [CrossRef]
  5. Liu, Y.; Wei, F.; Shao, J.; Sheng, L.; Yan, J.; Wang, X. Exploring disentangled feature representation beyond face identification. In Proceedings of the Proceedings of the IEEE conference on computer vision and pattern recognition, 2018; pp. 2080-2089.
  6. Wu, Y.; Liao, H.; Zhu, H.; Jin, X.; Yang, S.; Qin, H. Adversarial Contrastive Learning Based on Image Generation for Palm Vein Recognition. In Proceedings of the 2023 2nd International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP), 2023; pp. 18-24.
  7. Nguyen, K.; Proença, H.; Alonso-Fernandez, F.J.A.C.S. Deep learning for iris recognition: A survey. 2024, 56, 1-35. [CrossRef]
  8. Cola, G.; Avvenuti, M.; Musso, F.; Vecchio, A. Gait-based authentication using a wrist-worn device. In Proceedings of the Proceedings of the 13th International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services, 2016; pp. 208-217.
  9. Qin, H.; Zhu, H.; Jin, X.; Song, Q.; El-Yacoubi, M.A.; Gao, X.J.a.p.a. EmMixformer: Mix transformer for eye movement recognition. 2024.
  10. Poddar, J.; Parikh, V.; Bharti, S.K.J.P.C.S. Offline signature recognition and forgery detection using deep learning. 2020, 170, 610-617. [CrossRef]
  11. Xie, J.; Zhao, Y.; Zhu, D.; Yan, J.; Li, J.; Qiao, M.; He, G.; Deng, S.J.A.A.M.; Interfaces. A machine learning-combined flexible sensor for tactile detection and voice recognition. 2023, 15, 12551-12559. [CrossRef]
  12. Qin, H.; Hu, R.; El-Yacoubi, M.A.; Li, Y.; Gao, X.J.I.T.o.C.; Technology, S.f.V. Local attention transformer-based full-view finger-vein identification. 2022, 33, 2767-2782. [CrossRef]
  13. Mathur, L.; Matarić, M.J. Introducing representations of facial affect in automated multimodal deception detection. In Proceedings of the Proceedings of the 2020 international conference on multimodal interaction, 2020; pp. 305-314.
  14. Hsia, C.-H.; Yang, Z.-H.; Wang, H.-J.; Lai, K.-K.J.A.S. A new enhancement edge detection of finger-vein identification for carputer system. 2022, 12, 10127. [CrossRef]
  15. Godoy, R.I.U.; Panzo, E.G.V.; Cruz, J.C.D. Vein Location and Feature Detection using Image Analysis. In Proceedings of the 2021 5th International Conference on Electrical, Telecommunication and Computer Engineering (ELTICOM), 2021; pp. 33-37.
  16. Khellat-Kihel, S.; Cardoso, N.; Monteiro, J.; Benyettou, M. Finger vein recognition using Gabor filter and support vector machine. In Proceedings of the International image processing, applications and systems conference, 2014; pp. 1-6.
  17. Park, K.R.J.C.; Informatics. Finger vein recognition by combining global and local features based on SVM. 2011, 30, 295-309.
  18. Krishnan, A.; Thomas, T.; Mishra, D.J.I.T.o.I.F.; Security. Finger vein pulsation-based biometric recognition. 2021, 16, 5034-5044. [CrossRef]
  19. Crisan, S.; Tebrean, B.J.M. Low cost, high quality vein pattern recognition device with liveness Detection. Workflow and implementations. 2017, 108, 207-216. [CrossRef]
  20. Das, R.; Piciucco, E.; Maiorana, E.; Campisi, P.J.I.T.o.I.F.; Security. Convolutional neural network for finger-vein-based biometric identification. 2018, 14, 360-373. [CrossRef]
  21. Qin, H.; Wang, P.J.A.S. Finger-vein verification based on LSTM recurrent neural networks. 2019, 9, 1687. [CrossRef]
  22. Qin, H.; Gong, C.; Li, Y.; El-Yacoubi, M.A.; Gao, X.; Wang, J.J.I.T.o.B., Behavior,; Science, I. Attention Label Learning to Enhance Interactive Vein Transformer for Palm-Vein Recognition. 2024. [CrossRef]
  23. Tyagi, S.; Chawla, B.; Jain, R.; Srivastava, S.J.J.o.I.; Systems, F. Multimodal biometric system using deep learning based on face and finger vein fusion. 2022, 42, 943-955. [CrossRef]
  24. Liu, W.; Lu, H.; Wang, Y.; Li, Y.; Qu, Z.; Li, Y.J.A.I. Mmran: A novel model for finger vein recognition based on a residual attention mechanism: Mmran: A novel finger vein recognition model. 2023, 53, 3273-3290.
  25. Wang, Y.; Wu, W.; Yao, J.; Li, D. A Palm Vein Recognition Method Based on LSTM-CNN. In Proceedings of the 2023 IEEE 5th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), 2023; pp. 1027-1030.
  26. Abbas, T.J.J.L.M. Finger Vein Recognition with Hybrid Deep Learning Approach. 2023, 4, 23-33. [CrossRef]
  27. Li, X.; Zhang, B.-B.J.I.A. FV-ViT: Vision transformer for finger vein recognition. 2023.
  28. Qin, H.; Gong, C.; Li, Y.; Gao, X.; El-Yacoubi, M.A.J.I.T.o.I.; Measurement. Label enhancement-based multiscale transformer for palm-vein recognition. 2023, 72, 1-17. [CrossRef]
  29. Wang, S.; Qin, H.; Zhang, X.; Xiong, Z.; Wu, Y. VeinCnnformer: convolutional neural network based transformer for vein recognition. In Proceedings of the Fourth International Conference on Computer Vision and Data Mining (ICCVDM 2023), 2024; pp. 400-407.
  30. Kim, W.; Song, J.M.; Park, K.R.J.S. Multimodal biometric recognition based on convolutional neural network by the fusion of finger-vein and finger shape using near-infrared (NIR) camera sensor. 2018, 18, 2296. [CrossRef]
  31. Alshardan, A.; Kumar, A.; Alghamdi, M.; Maashi, M.; Alahmari, S.; Alharbi, A.A.; Almukadi, W.; Alzahrani, Y.J.P.C.S. Multimodal biometric identification: leveraging convolutional neural network (CNN) architectures and fusion techniques with fingerprint and finger vein data. 2024, 10, e2440. [CrossRef]
  32. El-Rahiem, B.A.; El-Samie, F.E.A.; Amin, M.J.M.S. Multimodal biometric authentication based on deep fusion of electrocardiogram (ECG) and finger vein. 2022, 28, 1325-1337. [CrossRef]
  33. Alay, N.; Al-Baity, H.H.J.S. Deep learning approach for multimodal biometric recognition system based on fusion of iris, face, and finger vein traits. 2020, 20, 5523.
  34. Tao, Z.; Zhou, X.; Xu, Z.; Lin, S.; Hu, Y.; Wei, T.J.M.P.i.E. Finger-Vein Recognition Using Bidirectional Feature Extraction and Transfer Learning. 2021, 2021, 6664809. [CrossRef]
  35. Huang, Z.; Guo, C.J.I.J.o.A.I.T. Robust finger vein recognition based on deep CNN with spatial attention and bias field correction. 2021, 30, 2140005.
  36. Babalola, F.O.; Bitirim, Y.; Toygar, Ö.J.S., Image; Processing, V. Palm vein recognition through fusion of texture-based and CNN-based methods. 2021, 15, 459-466. [CrossRef]
  37. Yang, H.; Fang, P.; Hao, Z. A gan-based method for generating finger vein dataset. In Proceedings of the Proceedings of the 2020 3rd International Conference on Algorithms, Computing and Artificial Intelligence, 2020; pp. 1-6.
  38. Madhusudhan, M.; Udayarani, V.; Hegde, C.J.I.J.o.S.A.E.; Management. An intelligent deep learning LSTM-DM tool for finger vein recognition model USING DSAE classifier. 2024, 15, 532-540. [CrossRef]
  39. Lu, S.; Fung, S.; Pan, W.; Wickramasinghe, N.; Lu, X.J.T.V.C. Veintr: robust end-to-end full-hand vein identification with transformer. 2024, 1-9. [CrossRef]
  40. Li, Y.; Ruan, S.; Qin, H.; Deng, S.; El-Yacoubi, M.A.J.I.T.o.I.F.; Security. Transformer based defense GAN against palm-vein adversarial attacks. 2023, 18, 1509-1523. [CrossRef]
  41. Abtahi, M.; Le, D.; Lim, J.I.; Yao, X.J.B.O.E. MF-AV-Net: an open-source deep learning network with multimodal fusion options for artery-vein segmentation in OCT angiography. 2022, 13, 4870-4888. [CrossRef]
  42. Chen, Y.; Ji, D.; Ma, Q.; Zhai, C.; Ma, Y.J.I.T.o.G.; Sensing, R. A Novel Generative Adversarial Network for the Removal of Noise and Baseline Drift in Seismic Signals. 2024. [CrossRef]
  43. Khosravy, M.; Gupta, N.; Patel, N.; Senjyu, T.; Duque, C.A.J.A.n.-i.c.a.; studies, c. Particle swarm optimization of morphological filters for electrocardiogram baseline drift estimation. 2020, 1-21.
  44. Zhang, X.; Jiang, S. Application of fourier transform and butterworth filter in signal denoising. In Proceedings of the 2021 6th International Conference on Intelligent Computing and Signal Processing (ICSP), 2021; pp. 1277-1281.
  45. Jin, X.; Zhu, H.; Yacoubi, M.A.E.; Li, H.; Liao, H.; Qin, H.; Jiang, Y.J.a.p.a. Starlknet: Star mixup with large kernel networks for palm vein identification. 2024.
  46. Qin, H.; Jin, X.; Zhu, H.; Liao, H.; El-Yacoubi, M.A.; Gao, X.J.a.p.a. Sumix: Mixup with semantic and uncertain information. 2024, 8, 28.
  47. Li, Y.; Lu, H.; Wang, Y.; Gao, R.; Zhao, C.J.A.S. ViT-Cap: a novel vision transformer-based capsule network model for finger vein recognition. 2022, 12, 10364.
  48. Sağ, T.; Abdullah Jalil Jalil, Z.J.I.J.o.M.L.; Cybernetics. Vortex search optimization algorithm for training of feed-forward neural network. 2021, 12, 1517-1544.
  49. Basha, S.S.; Dubey, S.R.; Pulabaigari, V.; Mukherjee, S.J.N. Impact of fully connected layers on performance of convolutional neural networks for image classification. 2020, 378, 112-119. [CrossRef]
  50. Mao, A.; Mohri, M.; Zhong, Y. Cross-entropy loss functions: Theoretical analysis and applications. In Proceedings of the International conference on Machine learning, 2023; pp. 23803-23828.
  51. Chandriah, K.K.; Naraganahalli, R.V.J.M.T.; Applications. RNN/LSTM with modified Adam optimizer in deep learning approach for automobile spare parts demand forecasting. 2021, 80, 26145-26159. [CrossRef]
Figure 1. MixCFormer Structure.
Figure 1. MixCFormer Structure.
Preprints 142602 g001
Figure 2. Acquisition equipment: (a) Front vein acquisition device; (b) Finger Clip Pulse Oximeter.
Figure 2. Acquisition equipment: (a) Front vein acquisition device; (b) Finger Clip Pulse Oximeter.
Preprints 142602 g002
Figure 3. Schematic diagram of the finger vein image acquisition process.
Figure 3. Schematic diagram of the finger vein image acquisition process.
Preprints 142602 g003
Figure 4. Three forms of attacks: (a) Thin gloves; (b) Thick gloves; (c) coloured clay.
Figure 4. Three forms of attacks: (a) Thin gloves; (b) Thick gloves; (c) coloured clay.
Preprints 142602 g004
Figure 5. Finger images captured under three attacks: (a) Thin gloves; (b) Thick gloves; (c) coloured clay.
Figure 5. Finger images captured under three attacks: (a) Thin gloves; (b) Thick gloves; (c) coloured clay.
Preprints 142602 g005
Figure 6. The original waveform of the time series.
Figure 6. The original waveform of the time series.
Preprints 142602 g006
Figure 7. Comparison of before and after baseline drift correction.
Figure 7. Comparison of before and after baseline drift correction.
Preprints 142602 g007
Figure 8. Comparison of signals before and after Butterworth filtering.
Figure 8. Comparison of signals before and after Butterworth filtering.
Preprints 142602 g008
Figure 9. CNN-Transformer Feature Extraction Model Structure.
Figure 9. CNN-Transformer Feature Extraction Model Structure.
Preprints 142602 g009
Figure 10. Structure of the Transformer Encoder.
Figure 10. Structure of the Transformer Encoder.
Preprints 142602 g010
Figure 11. This is a figure. Schemes follow the same formatting.
Figure 11. This is a figure. Schemes follow the same formatting.
Preprints 142602 g011
Figure 12. Screenshots from the finger vein video dataset: (a) Real person finger vein video; (b) Real person finger vein video with thin gloves; (c) Real person finger vein video with thick gloves; (d) Finger vein video with colored clay prosthesis.
Figure 12. Screenshots from the finger vein video dataset: (a) Real person finger vein video; (b) Real person finger vein video with thin gloves; (c) Real person finger vein video with thick gloves; (d) Finger vein video with colored clay prosthesis.
Preprints 142602 g012
Figure 13. Experimental results of MixCFormer algorithm training and testing: (a) Loss function curves for training and test sets; (b) Accuracy curves for training and test sets; (c) Precision curves; (d) Recall and F1-score curves.
Figure 13. Experimental results of MixCFormer algorithm training and testing: (a) Loss function curves for training and test sets; (b) Accuracy curves for training and test sets; (c) Precision curves; (d) Recall and F1-score curves.
Preprints 142602 g013aPreprints 142602 g013b
Figure 14. The Confusion Matrix of MixCFormer algorithm.
Figure 14. The Confusion Matrix of MixCFormer algorithm.
Preprints 142602 g014
Figure 15. Loss function curves for different model training and testing experiments: (a) GRU; (b) CNN; (c) LSTM; (d) Transformer; (e) CNN +LSTM; (f) CFormer; (g) CLT; (h) MixCNN; (i) MixLSTM; (j) MixCLT; (k) MixCFormer.
Figure 15. Loss function curves for different model training and testing experiments: (a) GRU; (b) CNN; (c) LSTM; (d) Transformer; (e) CNN +LSTM; (f) CFormer; (g) CLT; (h) MixCNN; (i) MixLSTM; (j) MixCLT; (k) MixCFormer.
Preprints 142602 g015
Figure 16. Performance comparison curve of different algorithms: (a) Test Loss curves; (b) Test Accuracy curves; (c) Precision curves; (d) Recall and F1-score curves.
Figure 16. Performance comparison curve of different algorithms: (a) Test Loss curves; (b) Test Accuracy curves; (c) Precision curves; (d) Recall and F1-score curves.
Preprints 142602 g016
Table 1. Comparative experimental results on finger vein datasets.
Table 1. Comparative experimental results on finger vein datasets.
Models GRU CNN LSTM Transformer Modal Structure Mixup Precision (%)
GRU / / 93.78
CNN / / 94.50
LSTM / / 94.39
Transformer / / 91.50
CNN +LSTM Cascade / 93.57
CFormer Cascade / 95.50
CLT Cascade / 94.26
MixCNN / 97.53
MixLSTM / 93.43
MixCLT Cascade 93.73
Our / 99.51
Table 2. Comparison of experimental metrics for different algorithms.
Table 2. Comparison of experimental metrics for different algorithms.
Models Test Loss
(%)
Test Accuracy
(%)
Precision
(%)
Recall
(%)
F1 Score
(%)
GRU 0.1950 93.75 93.78 93.75 93.75
CNN 0.1613 94.25 94.50 94.25 94.24
LSTM 0.1996 94.25 94.39 94.25 94.25
Transformer 0.2530 91.50 91.50 91.50 91.50
CNN +LSTM 0.2121 93.50 93.57 93.50 93.50
CFormer 0.1706 95.50 95.50 95.50 95.50
CLT 0.2017 94.25 94.26 94.25 94.25
MixCNN 0.0949 97.50 97.53 97.50 97.50
MixLSTM 0.1919 93.37 93.43 93.37 93.37
MixCLT 0.1827 93.63 93.73 93.63 93.63
Our 0.0414 99.50 99.51 99.51 99.51
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated