1. Introduction
Human Activity Recognition (HAR) has become a prominent area of research in recent years due to its wide range of applications, including healthcare monitoring, assisted living, sports analytics, and human-computer interaction. HAR systems aim to automatically classify human activities, such as walking, running, or sitting, using sensor data from wearable devices like smartphones and smartwatches [
1,
2]. These systems provide valuable insights for various domains, from detecting falls in elderly individuals to monitoring physical activity levels for fitness applications [
3]. Examples of application areas can be expanded. A HAR system can be developed for detecting unauthorized access in home security systems, ensuring workplace safety in industrial environments, or for use in military applications [
4,
5]. It can be used to create natural and intuitive user interfaces. In this way, devices can be controlled through gestures or body movements [
6]. In augmented reality applications, it can be beneficial for understanding the user's interaction with the physical environment and accordingly placing virtual objects [
7].
There are also challenges in developing a HAR system. People may exhibit different movements in various environments and on different devices. This increases the diversity of datasets and makes the classification task more challenging [
8]. There may be noise and uncertainties in sensor data, which can hinder the accurate recognition of movements. Especially in real-time applications, high computational power may be required to process large amounts of data quickly [
9].
A key motivation behind HAR is the proliferation of wearable sensor devices, which have enabled the continuous and real-time collection of data. Wearable sensors, such as accelerometers, gyroscopes, and magnetometers, have become integral to modern HAR systems, offering an affordable and scalable way to capture human motion [
10]. With the availability of these devices, researchers have explored various machine learning and deep learning techniques to improve the accuracy and efficiency of activity recognition models.
Several studies have demonstrated the potential of canonical machine learning algorithms in HAR. For instance, Banos et al. [
10] investigated the impact of window size on activity recognition performance by utilizing decision trees (DT), k-nearest neighbours (kNN), Naïve Bayes (NB), and nearest centroid classifier (NCC). The study focused on a wide range of fitness activities, which were categorized under warm-up, cool-down and fitness exercises. Three sets of features were extracted to feed the classifiers and kNN was the highest-performing classifier with an F1-score of 0.95 to 1 for certain activities and feature sets. This study addressed the gap in the literature regarding the lack of formal studies on appropriate window size selection for accurate recognition. It offered design guidelines for determining optimal window size to balance recognition speed and accuracy, depending on the complexity of the activity and the richness of the feature set.
Anguita et al. [
11] presented a public HAR dataset and used support vector machines (SVM) to achieve high recognition rates for six basic activities like walking, standing, sitting, laying down, walking downstairs, and walking upstairs. Time and frequency domain signals were obtained from the smartphone sensors including accelerometer and gyroscope data. The SVM classifier achieved an accuracy with an overall recognition rate of 96%. The article contributed a publicly available HAR dataset using smartphones. Additionally, the study showed that smartphones, which are more unobtrusive and less invasive than special-purpose sensors, can be effectively used for human activity recognition tasks with high accuracy.
Gupta and Dallas [
12] focused on classifying six daily activities: walking, running, jumping, sit-to-stand/stand-to-sit transitions, stand-to-kneel-to-stand transitions, and being stationary (sitting or standing). The study employed two classifiers, which were kNN and NB. Features were computed from accelerometer data, focusing on time-domain analysis, as the data was segmented into 6-second windows. After extracting features, some features were selected by utilizing Relief-F and sequential forward floating search algorithms. Both classifiers achieved an overall accuracy of about 98%, with more than 95% accuracy for individual activities. This study introduced new features like mean trend, windowed mean difference, and detrended fluctuation analysis, which improved classification performance for HAR. It proposed a system that reduces the dependency on sensor orientation and position around the waist, making it more practical for real-life applications with minimal user training. Additionally, it addressed a gap in prior studies that often excluded transitions by demonstrating high classification accuracy for daily activities and transitional events.
Reyes-Ortiz et al. [
13] proposed a system targeting various physical activities and transitions including walking, sitting, standing, lying down, walking upstairs, and walking downstairs. Additionally, postural transitions like sit-to-stand and stand-to-sit were included. SVM with probabilistic outputs were used to classify the activities. Both time-domain and frequency-domain features were extracted from the smartphone's triaxial accelerometer and gyroscope signals. The paper mentioned that the proposed system outperformed state-of-the-art baseline methods. The architecture improved classification by combining SVM outputs with heuristic filters, which was an innovative approach in the field.
Catal et al. [
14] utilized an ensemble of classifiers including DT, multi-layer perceptron (MLP), and logistic regression. The proposed ensemble method combined the classifiers using the average of probabilities combination rule. The activities they aimed to predict were walking, jogging, upstairs, downstairs, sitting, and standing. Time-domain features were extracted from accelerometer data to feed the classifiers. The highest accuracy was achieved by their proposed ensemble model, outperforming standalone classifiers. This study suggested that the ensemble approach could be particularly effective for complex activities and provided a strong baseline for future research in activity recognition using mobile devices.
Demrozi et al. [
15] focused on detecting freezing of gait (FoG) episodes in people with Parkinson's Disease instead of predicting daily activities. The study used kNN machine learning model to detect FoG episodes. Data were obtained from three tri-axial accelerometer sensors. The features were extracted by applying windowing and dimensionality reduction techniques. The research contributed to the field by integrating wearable technology and machine learning to address a critical health issue in Parkinson's disease management.
Climent-Pérez et al. [
16] focused on recognizing daily human activities while hiding gender and age. A many-objective evolutionary algorithm was utilized for feature weighting, and RF was utilized for classification. Both time and frequency domain features were extracted from accelerometer data to feed the classifier. The article made a significant contribution by utilizing an optimization algorithm to tackle the dual challenges of high-accuracy activity recognition and privacy preservation.
Asuroglu [
17] proposed a locally weighted random forest classifier (LWRF), which outperformed Climent-Perez et al.’s [
16] study in terms of overall accuracy that utilized random forest (RF) classifier on the same dataset. Additionally, the study benchmarked other machine learning algorithms including DT, MLP, kNN, NB, Logistic Model Tree (LMT), and SVM. The proposed model achieved the highest accuracy, with 91% for human activity recognition and 91.3% for gender recognition among the mentioned classifiers. This study introduced the first use of a local weighted approach in the domain of accelerometer signal-based human activity recognition. It improved upon the Random Forest classifier by adding local weighting to address dataset variability and achieved higher accuracy in predicting complex human activities compared to previous methods. Additionally, the study provided a robust framework for recognizing activities with a limited number of samples.
Garcia-Gonzalez et al.’s [
18] study focused on four types of human activities: inactive, active, walking, and driving. The authors employed various machine learning models, including SVM, DT, MLP, NB, kNN, RF, and Extreme Gradient Boosting (XGB). Both time-domain features and specific proposed additions such as total distance travelled and energy, based on data from accelerometer, gyroscope, magnetometer, and GPS sensors, were extracted to feed the classifiers. The features were extracted using a sliding window method with windows ranging from 20 to 90 seconds. RF outperformed the other algorithms, achieving the highest accuracy of 92.97%. This paper contributed to the literature by comparing different machine learning algorithms using real-life smartphone sensor data rather than laboratory-collected data. It also introduced the use of XGB in human activity recognition and evaluated the impact of excluding gyroscope data from the sensors.
In recent years, the focus of HAR research has shifted towards deep learning, multimodal and multi-sensor fusion approaches. Ordóñez and Roggen [
19] studied on recognizing human activities from multimodal wearable sensors, including static/periodic activities such as modes of locomotion and postures, as well as sporadic activities like gestures. In the study, a combination of Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) recurrent layers, referred to as the DeepConvLSTM model, was utilized. The features were learned automatically through the convolutional and recurrent layers without manual feature engineering. The raw sensor data were processed from wearable devices, such as accelerometers, gyroscopes, and magnetic sensors. The model learned both spatial and temporal dependencies in the sensor signals. The DeepConvLSTM model achieved an improvement in accuracy, outperforming previous state-of-the-art methods by up to 9%. Specifically, the F1-score of 0.930 for locomotion recognition and 0.915 for gesture recognition showed its superiority. The contribution of this paper lied in the combination of CNNs and LSTM layers for human activity recognition. This allowed for both automatic feature extraction and capturing temporal dependencies in the data. The paper also showed that this architecture is well-suited for fusing multimodal sensor data, making it highly applicable in real-life HAR scenarios. The paper outperformed existing methods on benchmark datasets (OPPORTUNITY and Skoda) and provided a robust analysis of hyperparameter tuning to further improve performance.
Similarly, Yang et al. [
20] utilized CNNs for HAR tasks and showed that these models can effectively capture spatial and temporal dependencies in sensor data. Their paper focused on recognizing human activities, such as walking, running, sitting, and other common daily movements from multichannel time series data collected through wearable devices. The paper proposed the use of deep CNN (DCNNs) as the classifier for time-series data. The authors leveraged CNNs due to their effectiveness in automatically learning spatial hierarchies from raw sensor data. The paper did not rely on manual feature extraction. Instead, features were automatically learned by the convolutional layers of the deep network, which processed the raw multichannel time series data from wearable sensors such as accelerometers and gyroscopes. The DCNN achieved the highest accuracy in the experiments, significantly outperforming traditional machine learning methods like kNN and DT. The exact accuracy metrics depend on the dataset used, but the DCNN demonstrated superior performance across multiple datasets. This paper made a substantial contribution by applying DCNNs to multichannel time series data, which was a novel approach at the time of publication. The proposed method also addressed the challenge of dealing with high-dimensional data streams generated by wearable sensors, making it a highly applicable solution in real-life HAR scenarios.
The use of deep learning models, particularly CNNs and recurrent neural networks (RNNs), has gained attention due to their ability to learn hierarchical features directly from raw sensor data, reducing the need for handcrafted features.
DeepSense, a unified deep learning framework developed by Yao et al. [
21], leveraged CNNs to process time-series sensor data for HAR, achieving competitive results across multiple datasets. The paper focused on a variety of tasks including car tracking, human activity recognition, and user identification based on time-series data from mobile sensors such as accelerometers, gyroscopes, and magnetometers. The framework combined CNNs and recurrent neural networks (RNNs). CNNs were used for extracting spatial features from raw sensor data, while RNNs captured the temporal dependencies across time-series data. This approach eliminated the need for manual feature engineering. DeepSense outperformed other methods and achieved state-of-the-art results across multiple tasks. For car tracking, it had a Mean Absolute Error (MAE) of approximately 1.4 meters. For human activity recognition, it achieved an accuracy of up to 95%. For user identification, the accuracy reached up to 96%. The framework was also optimized for deployment on mobile devices, demonstrating feasibility in terms of computation and energy consumption. This made it a versatile and practical solution for real-world sensing applications.
In another study, Ignatov [
22] introduced a real-time HAR system based on CNNs, which was optimized for smartphone accelerometer data, achieving high classification accuracy. The paper focused on recognizing human activities such as walking, jogging, standing, sitting, walking upstairs, and walking downstairs using accelerometer data from smartphones. The proposed system achieved an accuracy of 96.75% and 94.8%, outperforming other models, on the WISDM dataset and UCI HAR dataset, respectively. This study contributed by demonstrating that CNNs, when combined with statistical features from time-series data, can achieve real-time performance with low computational cost. The paper also highlighted that the system can process 1-second time windows in real time, making it highly suitable for continuous monitoring in activity recognition systems.
The integration of deep learning methods has significantly improved the robustness and scalability of HAR systems, particularly in real-world applications where noise and variability in sensor data can pose challenges.
Guan and Plötz [
23] aimed to predict human activities such as walking, running, sitting, standing, lying down, and other activities typically captured by wearable sensors. The study employed an ensemble of LSTM learners due to their ability to capture long-range dependencies for time-series data. No manual feature extraction was involved, as the LSTM learns temporal and sequential patterns from raw sensor data recorded by accelerometers, gyroscopes, and magnetometers. The ensemble of LSTM learners achieved the best performance, outperforming single LSTM models and other deep learning methods. The exact accuracy varied depending on the dataset, but it consistently improved performance on benchmarks like the Opportunity, PAMAP2, and Skoda datasets. The ensemble method was particularly effective in improving accuracy in the presence of noisy and imbalanced data. The paper introduced the concept of epoch-wise bagging to generate multiple diverse LSTM models, which improves the generalization of the ensemble model.
Murad and Pyun [
24] targeted to predict a wide range of human activities, depending on the dataset used, such as walking, running, ascending/descending stairs, standing, sitting, laying, and other daily activities. The study proposed and implemented several deep recurrent neural network (DRNN) architectures based on LSTM, including unidirectional LSTM-based DRNN, bidirectional LSTM-based DRNN, and cascaded bidirectional and unidirectional LSTM-based DRNN. The models used raw multimodal sensor data collected from accelerometers, gyroscopes, and magnetometers. The data was fed into the networks without any handcrafted features. The bidirectional LSTM-based DRNN yielded the best performance on complex datasets like the Opportunity dataset, while the unidirectional DRNN performed best on the UCI-HAD and USC-HAD datasets. For UCI-HAD, the overall classification accuracy reached 96.7%, and for USC-HAD, it reached 97.8%. The study introduced models capable of handling variable-length input sequences. Additionally, it demonstrated superior performance of DRNNs compared to both traditional machine learning classifiers like SVM and kNN and other deep learning approaches like CNN and deep belief networks.
Hu et al. [
25] used LSTM and gated recurrent unit (GRU) to model time-series sensor data and capture temporal dependencies in human activity. No manual feature extraction was performed, as the raw sensor data from accelerometers and gyroscopes is fed directly into the LSTM layers, allowing the network to automatically learn spatial and temporal features. The harmonic loss function applied to LSTM and GRU outperformed other traditional loss functions like cross-entropy loss and focal loss, especially on imbalanced datasets. The contribution of the study to the literature is to introduce a novel harmonic loss function and improve classification accuracy and macro-F1 scores for HAR, particularly on imbalanced datasets, which is a common issue in real-world sensor-based HAR applications.
To the best of our knowledge, this study is the first one that utilizes Kolmogorov-Arnold Network (KAN) on HAR and gender recognition. A minor contribution is that one-versus-all classifications for 24 classes were performed on the PAAL ADL Accelerometery dataset v2.0 [
26]. Additionally, statistical methods were applied to detect the statical significance between the previous methods.
The motivation for employing the KAN method can be elucidated within the context of the data. The dataset, utilized in this study, contains significant class imbalances, with certain activities (e.g., sneezing/coughing, brushing hair) having far fewer samples compared to others (e.g., making a phone call, writing). This imbalance poses challenges for traditional classifiers, which often struggle to maintain performance for minority classes. KAN’s adaptive spline-based activation functions may enable it to model non-linear relationships effectively, even for classes with fewer samples. This adaptability may help it outperform traditional machine learning models in recognizing minority class activities. Variations in how participants perform the same activity (e.g., putting on glasses or standing up) introduce noise and increase the complexity of classification tasks. Additionally, the dataset spans a broad demographic, with participants aged 18–77 years, including both genders. This diversity enriches the dataset but also adds to the complexity of classification tasks. KAN’s architecture may help at capturing complex dependencies in multi-dimensional data, such as the combined time-domain and frequency-domain features extracted from accelerometer signals. Unlike traditional neural networks with fixed activation functions, KAN’s learnable activation functions may allow for dynamic adjustments based on the data, making it particularly suited for datasets with high variability and noise. Finally, by leveraging the Kolmogorov-Arnold representation theorem, KAN may provide a robust framework for function approximation, helping to reduce overfitting, which is a risk in datasets with class imbalance and noise.
The rest of the paper is organized as follows;
Section 2 gives information about the utilized dataset, preprocessing steps, and general definition of KAN. In section 3, evaluation metrics, experimental setup, and empirical results are demonstrated. In section 4, discussion for the empirical results is presented and the article is concluded.
2. Materials and Methods
2.1. Dataset
The dataset utilized in this study is publicly available at
https://zenodo.org/record/5785955 [
26]. There are 52 participants in total, comprising 26 men and 26 women. The participants' ages are distributed across different age groups. The age range is between 18 and 77 years. The average age of the participants in the study is 44.08, with a standard deviation of 17.06, years. The data was collected in Spain, particularly in Alicante province, at various locations such as participants' homes or workplaces. The dataset includes 24 different activities of daily living. These are drinking water, eating meal, opening a bottle, opening a box, brushing teeth, brushing hair, taking off jacket, putting on jacket, putting on a shoe, taking off a shoe, putting on glasses, taking off glasses, sitting down, standing up, writing, making a phone call, typing on a keyboard, saluting, sneezing/coughing, blowing nose, washing hands, dusting, ironing, and washing dishes.
A total of 6,072 files were collected, with up to 5 repetitions of each activity per participant by using the Empatica E4 wrist-worn device. The Empatica E4 is equipped with an accelerometer (used for motion tracking), as well as various sensors measuring blood volume pulse, skin temperature and electrodermal activity. The E4 device records accelerometer data at 32 Hz, ensuring a good temporal resolution for capturing subtle movements. The Empatica E4 was primarily designed for continuous, non-invasive monitoring, making it ideal for health and activity research. The accelerometer in the E4 device was used to collect 3D (x, y, and z axis) acceleration signals, which are crucial for recognizing physical activities based on wrist motion.
Table 1 shows the class labels, daily living activities, and the number of samples that belong to each activity. The class labels, used in this study, were prepared considering Asuroglu’s study [
17] to ensure consistency for prediction tasks. As can be seen from
Table 1, the dataset is imbalanced.
To make data available for feature extraction, a two-step process is employed. First, a fourth-order Butterworth low-pass filter with a 15Hz cutoff frequency is applied to retain motion characteristics while filtering out high-frequency noise. Then, a third-order median filter removes outliers. Data are segmented into windows with a 5-second sliding window, overlapping by 20%, yielding 28,642 samples for feature extraction [
17].
2.2. General Framework
The aim of this study is to predict human activities and classify gender from accelerometer data utilizing Kolmogorov-Arnold Network (KAN) classifier. The first step involves extracting time-domain and frequency-domain features from 3D accelerometer data. Next, these time and frequency features are combined to obtain a feature vector, which feeds the KAN classifier as input. A standardization process, also known as z-score normalization, is applied to the features so that each feature has a mean of 0 and a standard deviation of 1. To train a KAN model, the 10-fold cross validation technique is utilized. Additionally, a grid search approach is employed to optimize the hyper-parameters of the KAN model, maximizing the overall accuracy evaluation metric. Once the models for both multi-class HAR and gender recognition tasks are obtained, a query sample is provided to the classifiers to predict the activity and the gender in which the individual is engaged and belongs, respectively. The general framework of the proposed method is shown in
Figure 1.
2.3. Feature Extraction
Feeding raw accelerometer signal values directly into the proposed classification framework is impractical for several reasons. First, most machine learning classifiers require equal-length input, which accelerometer signals do not provide, as they vary in length [
27]. Additionally, even when signals have the same length, they may not align in the same time dimension, complicating the classification process [
28]. This variability makes preprocessing necessary before applying machine learning algorithms to accelerometer data.
In two previous studies [
16,
17], various time and frequency domain features were extracted from raw accelerometer signals to capture essential motion characteristics. This study adopts the same approach as previous works, extracting features from the x, y, and z channels of the accelerometer, along with the signal magnitude vector (SMV). In total, 62 features are extracted from these channels and vectors. The extracted features for time-domain are: mean, median, standard deviation (
, maximum, minimum, and range (for x, y, z axes and SMV); correlation between axes (x-y, y-z, and z-x), signal magnitude area (for SMV); coefficient of variation and median absolute deviation (for x, y, z axes and SMV); skewness, kurtosis, autocorrelation, interquartile range, 20
th, 50
th, 80
th, 90
th percentiles, number of peaks, peak to peak amplitude, energy, and root mean square (for SMV). The extracted features for frequency-domain are: spectral entropy, energy, and centroid (for SMV); mean and
(for x, y, z axes and SMV); 25
th, 50
th, and 75
th percentiles (for SMV). As a result, 62 features were utilized as samples to feed the classifier, where 48 of them belong to time-domain and 14 of them belong to frequency domain.
The extracted features that belong to time and frequency domains are then combined to create a comprehensive feature vector, representing each observation in the dataset.
2.4. Normalization
In this study, z-score normalization (standardization) was utilized to normalize data so that the features had a mean of 0 and a standard deviation of 1. This transformation is particularly useful for comparing features that have different units or scales. The formula is given in Eq. 1.
where
,
,
, and
represent the standardized value, the original data point, the mean and the standard deviation of the values for the related feature, respectively.
2.5. Kolmogorov-Arnold Network (KAN)
KAN is a type of neural network inspired by the Kolmogorov-Arnold representation theorem, which provides a method for representing multivariate functions using univariate functions and addition. KANs differ from traditional Multi-Layer Perceptrons (MLPs) by using learnable activation functions on edges (weights) rather than fixed activation functions on nodes (neurons) [
29]. Specifically, KANs replace each weight parameter with a univariate function, often parametrized as a spline, making KANs highly adaptable and allowing them to model complex functions more efficiently. KANs are primarily used in function-fitting tasks within scientific and mathematical contexts, where high accuracy and interpretability are crucial. They excel in applications that involve discovering mathematical and physical laws, as well as small-scale artificial intelligence and science tasks such as solving partial differential equations. They show potential in fields that require a balance between performance and interpretability, where they can serve as collaborative tools for scientific discoveries, assisting researchers in rediscovering or approximating complex functional relationships [
30,
31].
In a KAN, a function
can be approximated as follows (2):
where,
represents the
-dimensional input vector. Each
represents one dimension of the data to be processed. The univariate activation functions are symbolized as
and applied to each input dimension. Each
function is a learnable univariate function, parametrized as a spline. This differentiates KANs from traditional neural networks by allowing the activation function on each edge to be learned, enabling more precise approximations of complex functions. In other words, each
is a function mapping one component of the input
to an intermediate representation. Higher-level, univariate functions are symbolized as
and they combine outputs from the
functions. Each
takes as input the sum of the outputs from the corresponding
functions and applies a second level of non-linear transformation. Like
, each
is also a learnable function, allowing flexibility in representation and enabling the model to capture complex dependencies between input dimensions. The outer summation (
) combines the transformed inputs across all dimensions, effectively summing up the contributions from each univariate function combination. This summing operation aligns with the Kolmogorov-Arnold theorem, which states that any continuous multivariate function can be represented by sums and compositions of univariate functions.
Unlike in MLP, KAN does not contain linear weight matrices. Instead, each weight parameter is replaced by a learnable function, which is parametrized as a B-spline. The adaptability of splines enables them to model complex relationships within the data by adjusting their form through coefficients. A general formula for a spline function of order
can be represented utilizing a linear combination of B-spline basis functions as follows:
where,
represents a spline function;
defines the input or feature vector;
is a coefficient corresponding to the
th B-spline basis function, which is a learnable parameter;
corresponds to the
th B-spline basis function of order
.
Each basis function is a piecewise polynomial function defined over specific intervals or knots and determines the local behaviour of the spline. The order of the B-spline basis specifies its degree of continuity and smoothness; for instance, represents a piecewise linear function, represents a piecewise quadratic function.
Technically, the structure of a KAN is defined by an array of integers as follows:
where,
represents the number of nodes in the
layer of the network. The
neuron in the
layer is represented by (
), whereas
stands for the activation value of the
neuron in the
layer.
activation functions exist between the layers
and
. The activation function connecting (
) and (
) is represented as follows:
where,
,
, and
range from 0 to
, 1 to
, and 1 to
, respectively. The input for the activation function
is
, whereas the output of
is represented as follows:
The output of the neuron (
) is calculated by summing all output values of the incoming activation functions. The formula is given below (7):
where,
ranges from 1 to
.
can be represented in matrix form as follows:
where, the matrix in Eq. 8 represents a function matrix,
, demonstrating the
layer of the KAN architecture. Therefore, an L-layered KAN architecture can be generalized as follows:
where,
corresponds to the input vector, which feeds the network. For comparison with KAN, an MLP can be written as interleaving of affine transformations
and non-linearities
σ. Generalized formula for an MLP is given in Eq. 10:
As can be seen from the equations (9) and (10), MLP architecture distinctly separate linear transformations, represented by , from nonlinearities, represented by , whereas KAN integrates both into a unified representation denoted as .
2.6. Experimental Setup
10-fold cross-validation (CV) was employed in the experiments to enable comparison with previous studies [
17]. In the10-fold CV technique, the dataset is split into 10 parts. To keep the imbalanced ratio, each part has sufficient number of samples from each class. The first part is removed and used as the test set while the remaining parts are used as the training set. This procedure continues until all the parts have been used as test tests. To prevent the appearance of nearly identical samples in both training and test sets, participant-wise cross-validation was employed instead of random shuffling. In order to attain this objective, all data from a single participant were included entirely in either the training set or the test set within each fold of the stratified 10-fold cross-validation. This approach prevents any overlap between training and test samples, as data from different participants are inherently independent.
In addition to multi-class classification for HAR, binary classification experiments were applied to detect a specific activity from a large number of samples by utilizing one-versus-all technique. In a multi-class classification task with
classes, the goal is to classify instances into one of the
K possible categories. One-versus-all classification is a technique that transforms a multi-class classification problem into
binary classification problems. Each binary classifier is responsible for distinguishing between instances of one class and instances of all other classes. Formally, for a given set of classes
a binary classifier
is trained to output a binary decision, given in equation below (11):
For the proposed KAN classifier, a grid search was employed to optimize the hyper-parameters. The ranges of the hyper-parameters are given for every scenario including HAR and gender recognition in
Table 2. For one-vs-all classification, the hyper-parameters that were optimized during the HAR were utilized.
It would be advantageous to provide a concise overview of the hyperparameters. Hidden layers represent a list that defines the number of neurons in each hidden layer of a neural network. This parameter defines the network's structure by setting both the depth (number of hidden layers) and the width (number of neurons in each layer). In our work, since the number of input neurons should match the number of features, and the number of output neurons should correspond to the number of classes, first layer consists of 62 neurons and the last layer consists of 24 and 1 neurons for multi-class HAR and gender recognition, respectively. The configuration of layers influences the network's ability to learn complex patterns. More layers or a higher number of neurons allow the model to capture intricate relationships but may also increase the risk of overfitting.
Grid size represents the number of intervals (or knots) used in the B-spline grid. It determines the granularity of partitioning in the input space for the B-spline basis functions, affecting the resolution of the model. The total count of B-spline basis functions used in the model is equal to summation of the values of grid size and spline order hyper-parameters. In the KAN, grid size determines the number of grid points created for each input feature. These grid points facilitate the flexibility of the B-splines.
Learning rate controls the magnitude of weight updates during each iteration of the training process. Specifically, it dictates the step size at which the network adjusts its weights to minimize the error between the predicted outputs and the actual labels. An appropriately chosen learning rate helps to avoid overfitting and underfitting, thereby enhancing model accuracy and stability.
Spline order defines the degree of the splines used for computing B-spline-based weights. It controls the smoothness and local interactions of the B-splines, with higher orders providing more flexible and smoother function approximations, while lower orders result in less flexibility. The choice of spline order should be balanced against the risk of introducing excessive complexity, which could degrade generalization performance.
Scale base acts as a scaling factor for initializing the base weights in the model. It helps stabilize the learning process and ensures the weights start with an appropriate distribution, which is particularly important for managing gradient propagation and learning speed in deep architectures. Smaller initial weights may lead to slower learning but can improve training stability by reducing the risk of large activation values, whereas larger initial weights can accelerate learning by providing a larger starting variance for activations.
Scale spline serves as a scaling factor utilized during the initialization of the spline weights. The scaling of spline weights influences the effect of B-spline transformations on the input data, playing a key role in optimizing the model's flexibility and overall performance. Correct scaling ensures that the model starts with an appropriate level of non-linear influence, which can enhance learning efficiency and stability. A higher value means that the spline component has a more significant initial effect, which can help the model capture non-linear patterns in the data early in the training process.
Batch size specifies the number of training examples used in one iteration of the training process, often referred to as a mini-batch. Smaller batch sizes may lead to noisier gradient estimates, potentially causing the learning process to be less stable but improving generalization. Larger batch sizes can offer smoother gradients, making the learning process more stable, but may require more computational resources.
Optimizer refers to the algorithm utilized to update the model's weights during training based on the computed gradients. It determines how the model learns from the data by adjusting the parameters to minimize the loss function. The most utilized optimizers are Adaptive Moment Estimation (Adam), Root Mean Square Propagation (RMSprop), and Stochastic Gradient Descent (SGD).
Table 3 demonstrates the values of the optimized hyper-parameters that made the overall accuracy highest.
4. Discussion and Conclusion
Except for the activities of making a phone call and writing, the performance results for the other 22 activities in the dataset are higher than those reported in Asuroglu's [
17] study for multi-class HAR. This may be due to the lack of variation in the z-axis over extended periods during these two activities. It can be said that LWRF avoids overfitting in the models created to detect these activities by using bagging and boosting methods to build the trees. However, KAN has not achieved the same level of generalization as LWRF for these two activities, which, in fact, are two of the three activities with more than 3000 samples. On another activity, which is taking of a shoe, has also more than 3000 samples and KAN outperformed LWRF. Therefore, it cannot be concluded that KAN's performance decreases relative to LWRF as the sample size increases. In this case, it would be reasonable to assume that the action of taking of a shoe contains raw signals that represent the activity and provide variability, as it causes changes in the accelerometer's z-axis.
In one-vs-all classification, due to the limited number of samples for the activity to be detected - in other words, due to the imbalance of the dataset - the problem is closer to a real-world scenario. Since the target activity is a minority in the dataset, the standardization process compresses the feature values of the activity into a specific range, leading to the loss of unique values associated with the activity. Therefore, it can be argued that standardization negatively impacts performance in cases where detecting a specific activity is necessary. Consequently, the KAN method has improved the detection rate of the relevant activity in an imbalanced dataset without requiring an additional preprocessing step (standardization) in binary classification. Additionally, as can be seen from
Figure 5, making a general conclusion based on sensitivity and sample size would not be reasonable.
A benchmark has been provided for other studies by presenting the results obtained through one-versus-all classification both with and without standardization. It is hoped that this will serve as a baseline for researchers using the same dataset.
In this study, only time and frequency domain features were used in order to compare the results with studies utilizing the same dataset. The exclusion of using time domain and frequency domain features individually could be considered a limitation of this study. Another limitation of our study is the lack of examination of the KAN algorithm's performance on different datasets.
In one-vs-all classifications, hyperparameter optimization was not performed for each activity individually. Instead, training was conducted using the hyperparameters optimized for the multi-class classification setting. Investigating whether results obtained through hyperparameter optimization for each specific activity could surpass our presented benchmark constitutes a new area of study. This approach holds promise for future research.
In future studies KAN’s performance on different datasets will be investigated. Another future work can be done as replacing soft-max layers, that conclude the final class, in deep learning algorithms with KAN.
To obtain a balanced dataset, the number of training samples can be increased by utilizing synthetic time series data generation methods, such as Generative Adversarial Networks. [
38,
39]. The inclusion of synthetic samples can address the data imbalance issue in the PAAL Dataset and thus can improve prediction performance.
In conclusion, the KAN model proposed in this study outperformed the previous studies in both HAR and gender recognition tasks. In the context of multi-class HAR, on the same dataset, KAN outperformed the models by Perez et al. [
16] and Asuroglu [
17], surpassing the overall accuracy by 7.3% and 3.5%, respectively. In the context of binary-class gender recognition, KAN outperformed the previous two models surpassing the overall accuracy by 6.7% and 4.3%, respectively. The utilization of KAN in the field of HAR has demonstrated its effectiveness, particularly in detecting activities with low sample sizes.
HAR has evolved significantly with the advent of machine learning and deep learning techniques, and the combination of multimodal sensor data and advanced models continues to push the boundaries of activity recognition accuracy. The ongoing development of robust, scalable, and efficient HAR systems holds great promise for applications in healthcare, sports, and daily life monitoring. Advancements in the field of daily activity recognition, along with the display of personalized advertisements and recommendations based on location and time through smartwatches and smartphones, will further enhance the importance of this field in e-commerce and daily shopping activities.