1. Introduction
Human Activity Recognition (HAR) is now fundamental for applications in medical rehabilitation [
1,
2], intelligence security [
3], and ambient-assisted living [
4,
5]. Traditionally, HAR has relied on data from video surveillance [
6,
7], infrared cameras [
8,
9], and wearable sensors [
10,
11]. However, each of these methods has limitations: video surveillance experiences poor accuracy in low light conditions due to environmental factors and monitoring distance. At the same time, infrared cameras are sensitive to temperature changes and struggle to detect radial motions. Meanwhile, wearable sensors have not seen widespread adoption due to their intrusiveness and limited battery life.
In contrast, radar sensors [
12,
13,
14,
15,
16], which have been increasingly used in recent decades, offer solutions to these challenges, particularly in short-range indoor applications. Radar sensors effectively localize and track human movements, recognize various behaviors, and even monitor vital signs [
17,
18,
19] without the need for individuals to carry the devices. With their high penetration and privacy protection capabilities, radar sensors offer a versatile and non-intrusive option for HAR in sensitive environments, such as bedrooms and bathrooms, addressing the functional and privacy limitations of other types of sensors.
Radar echoes typically contain information regarding time, range, and Doppler frequency. However, researchers [
15,
20,
21] often perform time-frequency analysis on radar echoes to obtain spectrograms with micro-Doppler (
-D) features, which are then used in HAR for recognition and classification. Radar spectrograms offer a unique visual representation of human activity, positioning them as an alternative to conventional image data. Although other radar data representations, such as the Time-Range (TR) domain based on range-fast Fourier transform (FFT) [
22,
23], are available, radar-based representations remain a popular choice for HAR because of their ability to capture the distinct movement characteristics of individual body parts.
In this study, we explored multiple representations of the radar domain to maximize information extraction from radar echoes. The first representation is a two-dimensional (2D) TR domain map generated using the range-FFT. In addition, we used two frequency-time based approaches, namely, the short-time Fourier transform (STFT) [
24] and the smoothed pseudo Wigner-Ville distribution (SPWVD) [
25]. STFT provides apparent and interpretable features in the frequency-time domain, although it is limited by a fixed-length window function, creating a trade-off between time and frequency resolution. In contrast, SPWVD offers high frequency-time resolution and effectively reduces cross-term interference, allowing for the precise representation of
-D features. Although SPWVD resolution exceeds that of STFT, it requires significantly more processing time, making it less suitable for real-time applications, which is a key consideration in this study.
The effectiveness of HAR is influenced by both the creation of diverse radar representations and the choice of feature extraction techniques. Generally, feature extraction methods fall into two categories: manual and automated through deep learning models [
12]. However, manual extractions are prone to interference from noise, require specialized knowledge, and fail to isolate high-level discriminative details from radar based representations, leading to less efficient results. In response, deep learning based HAR approaches have emerged, which use convolutional neural network (CNN) models to enhance feature extraction by capturing a broader set of features from radar signals. CNNs have significantly improved the ability to autonomously learn and distinguish complex data patterns. CNNs are particularly effective in processing image data, including radar-generated maps, because they can perform both feature extraction and classification. This dual functionality has revolutionized fields such as image recognition and computer vision [
12].
A groundbreaking study introduced the first innovative CNN model for document recognition [
26]. However, it was pioneering work that brought CNN algorithms into the spotlight. Another impressive work by [
27] is their CNN architecture, which achieved a top-1 error rate of 37.5% in the ImageNet challenge in 2012. Due to this significant breakthrough, several CNN architectures have been developed, including VGGNet, MobileNet, and ResNet. Due to their excellent performance in image classification, these architectures have also been applied to the processing of radar domain representations for HAR classification. These 2D radar representations, such as TR maps and TD maps with
-D features, act as data inputs for HAR in a manner similar to the way images from vision sensors are used. Although radar and image data are different, these radar spectra can be analyzed in a manner similar to that of visual data, enabling activity recognition based on identifiable patterns. However, training a CNN from scratch requires a large amount of data, which is often rare in specialized applications such as radar-based HAR, resulting in overfitting or underfitting.
To overcome this challenge, we employ a Transfer Learning (TL) approach [
28,
29,
30]. TL allows the use of pre-trained models on large datasets, thus enabling the adaptation of existing CNNs to new tasks. This strategy significantly reduced the need for large amounts of data and accelerated the training process. Therefore, our study leveraged well-known CNN architectures, such as VGG-16 [
31], VGG-19 [
32], ResNet-50 [
31], and MobileNetV2 [
33], which were chosen for their demonstrated effectiveness in image-based learning tasks, which are well suited for processing radar-generated images. We fine-tuned our training samples on these four pre-trained architectures to optimize the HAR system for better recognition accuracy and fast real-time prediction, which is essential for critical applications such as fall detection.
The VGG model, which is known for its deep architecture, provides powerful feature extraction capabilities. ResNet-50 introduced residual learning to solve the gradient vanishing problem, thereby facilitating the training of deeper networks. MobileNetV2 uses depth-wise separable convolutions, which improve computational efficiency by processing each input channel separately and then combining the feature maps with 1 × 1 convolutions. Additionally, MobileNetV2 integrates reverse residual connections and a modified residual link to learn more complex features while maintaining efficiency.
In this study, we suggest analyzing the performance of various preprocessing techniques and CNN architectures, specifically focusing on their potential application in edge-computing scenarios with limited computational resources. By optimizing these techniques with real-time processing, this approach aims to enhance the accessibility and effectiveness of HAR systems in real-world environments. Our proposed framework explores three different preprocessing techniques and four CNN models, resulting in 12 unique data preprocessing and model combinations. We aim to evaluate their recognition accuracy and efficiency, ultimately identifying the most promising combination for potential deployment on resource-constraint devices. Our contributions can be summarized as follows:
Evaluation of Radar 2D Domain Techniques: We empirically evaluated range-FFT based time-range (TR) maps and time-Doppler (TD) maps generated using STFT and SPWVD, quantifying their computational efficiency in real-time HAR systems.
Optimizing models with Transfer Learning (TL): We evaluated the performance of state-of-the-art CNN architectures, including VGG-16, VGG-19, ResNet-50, and MobileNetV2, to improve the accuracy of the proposed HAR system using TL methods.
Performance and Computational analysis of Model-Domain pairs: We conducted a comprehensive analysis of 12 model-domain pairs, focusing on real-time performance to optimize the balance between accuracy and computational efficiency (preprocessing, training and inference times). The analysis is also extended to performance metrics beyond accuracy, such as recall, precision, and F1 score, which are critical to evaluating effectiveness in real-world applications.
The remainder of this paper is structured as follows: In
Section 2, we outline the related work.
Section 3 presents an in-depth description of the radar-based HAR approach, covering the radar technology, the dataset, the preprocessing techniques, and the CNN architecture used.
Section 4 presents a comparative evaluation of different combinations of radar data preprocessing and CNN models. Finally,
Section 5 summarizes the main findings and contributions of this study and suggests possibilities for future research. The methodological flow of this study is illustrated in
Figure 1.
2. Related Work
Researchers have increasingly turned to radar for non-intrusive activity monitoring. Over time, many approaches and techniques have been developed to improve the accuracy of such monitoring, particularly given the scarcity of comprehensive radar datasets. Different radar domains derived from radar echoes have become pivotal in training classification models. Although conventional machine learning methods have been explored in previous studies [
34,
35,
36], recent advances have shifted towards deep learning applications on radar datasets [
12,
37]. To address the challenge of limited data, we applied transfer learning methods [
38,
39], leveraging the capabilities of models pre-trained on large datasets.
In contrast to previous research that primarily examined single radar domains or models [
15,
40,
41], our study analyzed multiple radar domain representations based on the fast Fourier transform (FFT), including Time-Range (TR), short-time Fourier transform (STFT), and smoothed pseudo Wigner-Ville distribution (SPWVD) maps, in conjunction with different state-of-the-art neural network configurations, as shown in
Table 1. The neural networks employed in this study are convolutional neural network (CNN), long-short term memory (LSTM), with comparison of pre-trained models such as VGG-16, VGG-19, ResNet-50, and MobileNetV2. It is important to note that the processing time of the radar data was not addressed in the studies cited in
Table 1, which is a critical factor for real-time applications. Our analysis explicitly addresses this aspect by emphasizing the efficiency of radar data processing, which is crucial for the deployment of HAR systems in time-sensitive real-world environments.
By systematically analyzing 12 model-domain pairs (MDPs), our objective is to provide deeper insight into their effectiveness and contribute to the advancement of radar-based HAR systems. Our study improves accuracy while evaluating four distinct performance metrics along with computational costs. To the best of our knowledge, this comprehensive analysis of MDPs in a single study, focusing on both computational costs and overall classification accuracy, is unmatched in the field. Consequently, our work is highly relevant to the ongoing development of radar-based HAR technologies and sets a valuable benchmark for future research and development.
4. Results and Discussion
In this section, we present a detailed analysis of the results for each CNN model discussed in
Section 3.3.2, using the radar maps described in
Section 3.2 as inputs. The results focus on both performance evaluation and computational efficiency, particularly examining how multiple radar maps contribute to the models’ ability to extract relevant features essential for HAR systems in real-world deployment. To ensure stable and reliable measurements, the experimental results including recognition accuracy and inference time are averaged over five runs with random initialization.
4.1. Performance Comparison of Proposed HAR Models
In this section, we will compare the performance metrics of the 12 MDPs (named M1, M2, M3, etc.), as described in
Section 3.3.5, and the results are listed in
Table 4. Based on these metrics, the M1, M7, and M10 pairs were selected from the 12-MDPs as the best performing pairs in terms of accuracy. The M1, M7, and M10 pairs achieved the highest recognition accuracy in their respective radar domains (TR, STFT, and SPWVD) when used as input, confirming their importance for radar-based HAR systems. To evaluate the classification performance of each class, the confusion matrices were analyzed.
Figure 4 a, b, and c show the confusion matrices for pairs M1, M7, and M10 that perform best. In particular, pair M7 identified A6, representing fall activity, with 100% accuracy, whereas pairs M1 and M10 achieved 97.50% accuracy. The three pairs, except pair M10, detected the A1 class, which represents walking activity, with 100% accuracy and 98.41% for pair M10.
4.2. Comprehensive Performance Analysis on Radar Domains
Evaluating the generalization ability of HAR systems is essential, particularly for limited datasets on radar-based human activity. Due to a lack of data, CNN classifiers tend to overfit. Therefore, it is important to evaluate the model’s performance on new, unseen radar data, which constitute the remaining 20% test set. The evaluation results of our HAR models are shown in
Figure 5, showing consistent results, with the test accuracy showing the smallest variance between the 12 MDPs, as shown in
Table 4.
From the
Figure 5, the findings indicate that the models have strong generalization capabilities, with test accuracies ranging from 92.88% to 98.01%, confirming their effectiveness on the new data. Pair M10 achieved the highest test accuracy of 98.01%. On the other hand, although pairs M2 and M4 achieved perfect or near perfect average training accuracies of 100% and 99.42%, respectively, they had lower test accuracies of 94.30% and 92.88%, indicating that there is room for improvement. This difference highlights the importance of thorough testing of unseen data to accurately determine how well a model adapts to new input. The results emphasize the importance of using cross validation and remaining datasets for testing to evaluate the model generalization in real-world scenarios. Furthermore, choosing the best MDP requires a balanced evaluation of its performance, based on its generalization ability and computational efficiency.
4.3. Computational Efficient and LightWeight HAR Model
Computational efficiency is critical for real-time radar-based HAR system, particularly for resource-constrained edge devices. Therefore, it is crucial to develop a lightweight model that can quickly and accurately recognize human activities while minimizing the inference latency. This section examines computational efficiency using time metrics, such as training time and inference time as defined in
Section 3.3.6, to evaluate the suitability of various models for real-world deployment, as detailed in
Table 5. The fastest predicted pairs, M4, M8, and M12, were chosen based on inference time, an important metric for resource-constrained edge devices that require fast activity prediction.
The inference time is also defined as the time between initiating a prediction request and receiving the prediction output from the test model. This metric is very important for evaluating the performance and efficiency of a model, particularly in applications that require real-time processing on edge devices or standalone systems. Inference time directly affects the user experience and applicability of the model in time-sensitive scenarios, such as fall detection.
4.4. Computational Cost Across Radar Domains
For the TR domain, the preprocessing time for processing the input raw radar data and visualizing an image representing the range over time as shown in
Figure 3, is only 0.035 s, which is very low compared to other techniques illustrated in
Figure 6. However, when inputted to a CNN model, it results in lower accuracy and higher computational cost. Among all models, MobileNetV2 (M4) exhibits the best training and inference efficiency, with a training time of 1.79 s/epoch, an inference time of 2.78 ms / sample and a recognition accuracy of 92.88% as shown in
Figure 6. In contrast, the other three models are known for their higher accuracy, but with increased time measurements.
The STFT-based TD map shows the change in frequency over time (Doppler shift), as shown in
Figure 3 and takes only 0.22 s to preprocess and generate a spectrogram using the STFT method. When this spectrogram is used as an input feature to the network, it provides a good balance between the performance and efficiency. For example, MobileNetV2 (M8) had a training time of 1.49 s/epoch and an inference time of 2.57 ms/sample, with a test accuracy of 96.30%. In contrast, VGG-16 and ResNet-50 achieved higher recognition accuracies of 96.87% and 97.15%, respectively, but had longer training and inference times, indicating higher resource usage illustrated in Tabel
Table 4 and
Table 5. On the other hand, VGG-19 has a longer prediction time of 6.90 ms/sample, making it less suitable for real-time systems.
Despite the lengthy preprocessing time of 52.58 s to generate a spectrogram using the SPWVD method, MobileNetV2 (M12) had a lower training time of 1.34 s/epoch and an inference time of 2.76 ms/sample with a recognition accuracy of 96.01% as shown in
Figure 6, when SPWVD was used as input compared to the TR and STFT domains. This shows the advantage of higher resolution by combining both time and frequency windows simultaneously as detailed in
Section 3.2.3. Despite the advantage, still SPWVD is not suitable for real-time systems due to the higher preprocessing time that require rapid preprocessing and prediction response, from data acquisition to model prediction specifically for critical activity like fall detection in elderly care homes.
This study concludes with two main possibilities for selecting the best model and radar domain for HAR systems. VGG-19 with SPWVD radar map performs well for applications that emphasize recognition accuracy, achieving a high recognition accuracy of 98.01% despite the longer preprocessing time. In contrast, for situations that require fast prediction, MobileNetV2 with STFT radar map is the most efficient, achieving the shortest inference time of 2.57 ms/sample while maintaining a remarkable accuracy of 96.30%, which is ideal for many real-time applications. Although TR maps provide consistent recognition accuracy across models, they often lag in inference time unless combined with MobileNetV2.
However, the study had limitations. A key challenge is the need for more studies on energy consumption, which is required to implement these models in resource-constrained edge devices. Future research should incorporate energy consumption measures to assess the suitability of each model and radar domain combination in low-power environments. In addition, further studies might look at how model reduction and optimization approaches, such as quantization and pruning, can improve the deployment potential of radar-based HAR systems on edge devices.
4.5. Comparison of Pair M8 with State-of-the-Art Models
A detailed comparative analysis is presented in
Table 6. All models used STFT-based spectrogram inputs with a resolution of 224 × 224 pixels. The CNN [53], model trained from scratch achieved 95.44% accuracy, but the inference time per sample was 5.14 ms, which is almost twice that of our proposed MobileNetV2 model. The CNN + LSTM [
23], model had the shortest training time per epoch of 1.12 s, but the accuracy was only 84.90%, and the inference time was as high as 6.04 ms/sample, almost three times that of our proposed model. The Bi-LSTM [
42], model achieved a competitive accuracy of 95.16% with an inference time of 2.77 ms/sample, lagging behind the MobileNetV2 model in both training and inference time.
The proposed MobileNetV2 model, leveraging transfer learning, achieved the highest accuracy of 96.30% and the best inference time of 2.57 ms/sample, making it suitable for real-time applications. This shows that MobileNetV2 not only surpasses the accuracy of other state-of-the-art models trained from scratch but also significantly reduces inference time, providing an efficient option for real-time processing. This comparison highlights the adaptability and potential of our proposed work for a wide range of future applications, setting a benchmark for HAR systems in terms of both performance and efficiency.
5. Conclusions
In this study, we applied three preprocessing techniques, such as: Range-FFT for TR, STFT, and SPWVD, as inputs to CNN models for HAR and evaluated their computational efficiency for edge deployment, resulting in twelve different combinations of model-preprocessing pairs. These combinations include VGG-16, VGG-19, ResNet-50, and MobileNetV2 architectures. Among them, the combination of MobileNetV2 with STFT (model M8) showed balanced performance, setting a new benchmark for the state-of-the-art radar-based HAR system. This result emphasizes the importance of thorough evaluation of the entire process chain. The effectiveness of model M8 highlights its ability to support more advanced edge device models, which are typically associated with TinyML. Our work not only contributes to current methodologies but also lays the foundation for integrating more complex models into low-power, real-time edge systems.
Furthermore, in anticipation of advancements, our future research will focus on integrating neuromorphic federated learning and congestion-aware spiking neural networks to design energy-efficient systems, which is an important aspect not discussed in this study. This strategy aims to improve the real-time performance of radar-based HAR systems and address the trade-off between accuracy and energy efficiency.
Author Contributions
Conceptualisation, F.A., B.A. and A.Z.; methodology, F.A., B.A, S.H. and A.Z.; software, F.A.; validation, F.A., B.A., S.H and A.Z.; formal analysis, F.A., B.A., S.H. and A.Z.; writing original draft, F.H; writing, review and editing, F.A., B.A., S.H., M.A.I., K.A., K.A. and A.Z.; supervision, A.Z. All authors have read and agreed to the published version of the manuscript.