1. Introduction
Effective feature representation of videos is key to action recognition. Spatiotemporal features [
1,
2], subspace features [
3,
4], and label information [
5] have been investigated for action recognition. Correlations between multiple features may provide distinctive information; hence, feature correlation mining has been explored to improve the recognition results when labeled data are scarce [
4,
6]. However, these approaches may have limitations in learning discriminant features, they have limitations. First, although existing algorithms evaluate the common shared structures among different actions, they do not take inter-class separability into account. Second, current semi-supervised approaches solve the nonconvex optimisation problem by impressive derivation, but the global optimum may not be computed mathematically through alternating least squares (ALS) iterative method.
To overcome the limitations of using multiple features for training, we propose modelling intra-class compactness and inter-manifold separability simultaneously, then capturing high-level semantic patterns via Multiple feature analysis. Considering the optimisation process, we introduce the PBB algorithm because of its effectiveness in obtaining an optimal solution [
7]. The PBB method is a non-monotone line-search technique considered for the minimisation of differentiable functions on closed convex sets [
8].
Inspired by the research using multiple features [
5,
6], our framework was extended in a multiple-feature-based manner to improve recognition. We proposed the characterisation of high-level semantic patterns through low-level action features using multiple-feature analysis. Multiple features were extracted from different view of labeled and unlabeled action videos. Based on the constructed graph model, pseudo information of unlabeled videos can be generated by label propagation and feature correlations. For each type of feature, nearby samples preserve the consistency separately, while unlabeled training data perform the label prediction by jointly global consistency of multiple features. Thus, an adaptive semi-supervised action classifier was trained. The main contributions can be summarized as follows:
(1) This work first simultaneously consider manifold learning and Grassmannian kernels in semi-supervised action recognition, as we assume that action videos samples may be found in a Grassmannian manifold space. By modelling a embedding manifold subspace, both inter-class separability and intra-class compactness were considered.
(2) To solve the unconstrained minimisation problem, we incorporate PBB method to avoid matrix inversion, and apply globalisation strategy via adaptive step sizes to render the objective functions non-monotonic, leading to improved convergence and accuracy.
(3) Extensive experiments verified that our method is better than other approaches on three benchmarks in a semi-supervised setting. We believe that this study presents valuable insights in adaptive feature analysis for semi-supervised action recognition.
4. Experiments
The proposed method, called the Kernel Grassmann Manifold Analysis (KGMA), is summarised in Algorithm 1. The conventional method that uses SPG [
10] and ALS method instead of PBB, called kernel spectral projected gradient analysis (KSPG) and kernel alternating least squares analysis (KALS), respectively, was also adopted to solve the objective function (
8) for comparison in our experiments.
Features. For handcrafted features, we follow [
10] to extracted improved dense trajectories (IDT) and Fisher vector (FV), as shown in
Figure 2. For deep-learned features, we retrained the temporal segment network (TSN) [
2] models of 15×
c, and then extracted the global pool features of 15×
c using pretrained TSN model, concatenating rgb+flow into 2048 dimensions with power L2-normalisation, as listed in
Table 1.
We verified the proposed algorithm using three kernels: projection kernel , canonical correlation kernel , and combined kernel . In some cases, is better than , whereas vice versa, suggesting that the kernels combination is more suitable for different data distributions. For , the mixing coefficients and were fixed at one. We obtain better results by combining two kernels.
Datasets. Three datasets were used in the experiments: JHMDB, HMDB51, and UCF101 [
1]. The
JHMDB dataset has 21 action categories. The average recognition accuracies over three training–test splits are reported. The
HMDB51 dataset records 51 action categories. We reported the MAP over three training–test splits. The
UCF101 dataset includes 101 action categories, containing 13,320 video clips. The average accuracy of the first split was reported.
For the JHMDB dataset, we followed the standard data partitioning (three splits) provided by the authors. For other datasets, we used the first split provided by the authors, and applied the original testing sets for fair comparison. Because the semi-supervised training set contained unlabeled data, we performed the following procedure to reform the training set for each individual dataset. the class number c was denoted for each dataset (c = 21, 51, and 101 for JHMDB, HMDB51, and UCF101, respectively).
Using JHMDB as an example, we first randomly selected 30 training samples per category to form a training set ( samples) in our experiment. From this training set, we randomly sampled m videos (m = 3, 5, 10, and 15) per category as labeled samples. Therefore, if , labeled samples will be available, leaving () videos as unlabeled samples for the semi-supervised training setting. We used a standard test set as the test set. Owing to the random selected training samples, the experiments were repeated 10 times to avoid bias.
To demonstrate the superiority of our approach (KGMA), we adopted 8 methods for comparison: SVM, SFUS [
35], SFCM [
3], MFCU [
4], KSPG, and KALS. Notably, SFUS, SFCM, MFCU, KSPG, and KALS are semi-supervised action recognition approaches. Using the available codes, we can facilitate a fair comparison.
Table 1.
Comparison with deep-learned features (average accuracy ± std) when training videos are labeled
Table 1.
Comparison with deep-learned features (average accuracy ± std) when training videos are labeled
|
JHMDB |
HMDB51 |
UCF101 |
SFUS |
0.6942 ± 0.0121 |
0.5217 ± 0.0114 |
0.7910 ± 0.0087 |
SFCM |
0.7125 ± 0.0099 |
0.5394 ± 0.0108 |
0.8070 ± 0.0101 |
MFCU |
0.7154 ± 0.0088 |
0.5556 ± 0.0098 |
0.8429 ± 0.0085 |
SVM-
|
0.6931 ± 0.0106 |
0.5190 ± 0.0095 |
0.8138 ± 0.0108 |
SVM-linear |
0.7140 ± 0.0086 |
0.5385 ± 0.0077 |
0.8450 ± 0.0087 |
KSPG |
0.7287 ± 0.0114 |
0.5697 ± 0.0833 |
0.8552 ± 0.0111 |
KALS |
0.7218 ± 0.0087 |
0.5607 ± 0.0098 |
0.8411 ± 0.0095 |
KGMA |
0.7361±0.0096
|
0.5762±0.1040
|
0.8673±0.0087
|
For the semi-supervised parameters
for SFUS, SFCM, MFCU, KSPG, KALS, and KGMA, we follow the same settings utilised in [
3,
4], ranging from
{
}. Because the PBB parameters were not sensitive to our algorithm, we initialised the parameters as in [
7], as indicated in Algorithm 1. Notably, since KGMA applied PBB to solve the optimal value of objective function (
8), it resulted in non-monotonic convergence with oscillating objective function values, as shown in
Figure 3. Thus, using only the absolute error made it difficult to determine when to stop iterating, relative error of objective function values was better than absolute error, which may be mathematically improper convergence. We chose constant
as the iteration-stopping criterion in (
9).
Mathematical Comparisons. The recognition results with handcrafted features on three datasets were demonstrated in
Figure 2. We compared our method with deep-learned features in
Table 1.
Regarding the presented objective function
8,
Figure 3 summarized the computational results of the three optimization methods. When we used the 2048-dimensional deep-learned features TSN on JHMDB dataset, the model was trained with only 15 labeled samples and 15 unlabeled samples per class, setup the same semi-supervised parameters
, then the performance differences during the solving of the same objective function could be compared in terms of running time, number of iterations, absolute error, relative error, and objective function value.
Figure 3 shown the convergence curves of three optimization methods. Since both SPG and PBB were non-monotonic optimization methods with relatively large fluctuations in objective function values, we omitted the first 29 iterations of SPG and PBB in
Figure 3, and only displayed the data starting from the 30th iteration, so as to better illustrate the monotonic convergence process of ALS.
As shown in
Table 2, for a randomly selected video data sample, ALS exhibited the fewest iterations, shortest running time and fastest computation speed of 0.1220 seconds after extracting the deep features by TSN. In contrast, PBB exhibited the most iterations, longest running time and slowest computation speed of 0.4212 seconds; while SPG’s performance were intermediate between ALS and PBB. Considering
Figure 3 and
Table 2, it is evident that despite using the PBB optimization method, our KGMA algorithm still achieves the highest accuracy on the kernelized Grassmann manifold space. Nevertheless, the equation
9 using SPG results in marginal improvement over ALS, which likely attributable to our novel kernelized Grassmann manifold space.
Performance on Action Recognition. A linear SVM was utilised as the baseline. Based on the comparisons, we observe the following:1) KGMA achieved the best performance, our semi-supervised algorithm was better than linear SVM which is widely-used supervised classifiers; 2) all methods achieved better performances using more labeled training data, as shown in
Figure 2, or enlarging semi-supervised parameter (i.e.,
) range such as
Figure 4; 3) we averaged an accuracy of
,
,
, and
cases, and the recognition of KGMA on JHMDB, HMDB51, and UCF101 improved by 2.97%, 2.59%, and 2.40%, respectively. When using TSN features, the recognition of our KGMA on above-mentioned datasets improved by 2.21%, 3.77%, and 2.23%, respectively. Evidently, our semi-supervised method can improve recognition by leveraging unlabeled data compared to linear SVM with labeled data merely.
Figure 2 illustrated that our algorithm benefits from the multiple-feature analysis, kernelized Grassman space and iterative skills of PBB method.
These results can be attributed to several factors. First, our method not only leverages semi-supervised approaches, but also leverages intra-class action variation and inter-class action ambiguity simultaneously. Therefore, ours gain more significant performance than other approaches when there are few labeled samples. Second, we uncover the action feature subspace on Grassmannian manifold by incorporating Grassmannian kernels, and solve the objective function optimisation by adaptive line-search strategy and PBB method mathematically. Hence, the proposed algorithm works well in few labeled case.
Convergence Study. According to the objective function (
4), we conducted experiments with the TSN feature, fixed the semi-supervised parameters
, and then executed both the ALS and PBB methods 10 times. The results of the study are listed in
Table 2. Although no oscillation exists in the convergence of the ALS and it requires fewer iterations, the PBB method can outperform the ALS for three reasons. First, the PBB method uses a non-monotone line-search strategy to globalise the process [
8], which can obtain the global optimal objective function value rather than being trapped in local optima using the monotone ALS method. Second, the character of adaptive step sizes is an essential characteristic that determines efficiency in the projected gradient methodology [
8], whereas the iteration step skill has not been considered in ALS. Finally, the efficient convergence properties of the projected gradient method have been demonstrated because the PBB is well defined [
8].
Computation Complexity. In the training stage, we computed the Laplacian matrix L, the complexity of which was . To optimise the objective function, we computed the projected gradient and trace operators of several matrices. Therefore, the complexity of these operations was .
Parameter Sensitivity Study. We verified that KGMA benefits from the intra-class and inter-class by manifold discriminant analysis, as shown in
Figure 4. We analysis the impact of manifold learning on JHMDB and HMDB51, set
and
at optimal values over split2, for
-labeled training data. As
varied from
to
, the accuracy oscillated significantly and reached a peak value when
. Since
controls the proportion of the intra-class local geometric structure and the inter-class global manifold structure, as shown in
Figure 4. when the intra-class local geometric structure is treated as a constant 1,
can be considered that the inter-class global manifold structure has a larger proportion in the objective function, and vice versa. When
, no inter-manifold structure is utilised; thus, if
, no intra-class structure is present. When the Grassmann manifold space leverages an adequate balance of intra-class action variation and inter-class action ambiguity, the proposed algorithm can further enhance the discriminatory power of the transformation matrix.