In this paper, we take into account the appearance and dynamics of video surveillance scenes, as well as their spatial and temporal characteristics. Furthermore, deep features have stronger descriptive abilities compared to handcrafted features. Thus, in our proposed hybrid method that combines traditional and deep learning methods, we use deep features to replace handcrafted features and employ traditional machine learning methods to detect anomalies.
3.1. Object Detection
Traditional handcrafted features based on manually defined low-level visual features cannot represent complex behaviors and the extracted features are relatively simple, resulting in weak generalization ability. On the other hand, deep learning-based feature extraction can automatically learn the distribution rules of massive datasets and extract more robust high-level semantic features to better represent complex behaviors. However, low-level features often have some advantages that high-level features do not have, such as invariance under lighting changes, strong interpretability, and the ability to provide basic spatial, temporal, and frequency information. Therefore, this paper proposes to combine traditional background subtraction methods and the YOLACT network to detect and segment foreground images in video frames, which can not only use traditional methods to extract low-level features but also use deep learning to extract high-level semantic features, thus improving the accuracy and robustness of video anomaly detection.
Combining AGMM (Adaptive Gaussian Mixture Model) and YOLACT for video frame object extraction can enhance the accuracy and efficiency of the process. AGMM is a background subtraction method that models the background by a mixture of Gaussian distributions, and it adapts to the scene changes by adjusting the parameters of the distributions. AGMM can handle different types of scenes and achieve good results in complex environments with moving backgrounds. YOLACT, on the other hand, is a deep learning-based object detection and instance segmentation technique that combines target detection and instance segmentation using interactive convolutional networks (Interact Convolutional Networks). It can accurately detect and segment objects in real-time. By combining AGMM and YOLACT, we can use AGMM to effectively extract the background and foreground, and then use YOLACT to detect and segment the objects in the foreground. This method can effectively handle complex scenes and achieve better results than using only one method. Overall, the combination of AGMM and YOLACT can provide a more robust and accurate solution for video frame object extraction.
The integration of YOLACT and background subtraction technology to achieve video frame foreground image extraction can combine the advantages of the two methods to achieve more accurate and robust object detection and segmentation. The specific implementation is as follows:
1. The YOLACT technique was used for object detection and instance segmentation to obtain the foreground map .
YOLACT is a real-time object detection and instance segmentation technique based on deep learning techniques. Its main principle is to combine object detection and instance segmentation together by using Interact Convolutional Networks. Specifically, YOLACT uses a loss function called Mask-IoU loss, which optimizes the performance of both object detection and instance segmentation. The Mask-IoU loss measures the model’s performance using the Intersection over Union (IoU) metric, combining the results of object detection and instance segmentation and minimizing the difference between them. Additionally, YOLACT also uses a Feature Pyramid Network to process feature maps of different scales, improving the model’s detection and segmentation capabilities for objects of different sizes. In summary, YOLACT achieves efficient object detection and instance segmentation by combining object detection and instance segmentation, using Mask-IoU loss and Feature Pyramid Network, among other techniques. The result of the YOLACT is shown in
Figure 3.
From
Figure 3, it can be seen that YOLACT generates foreground images, while what we need is a foreground mask similar to AGMM. By having a similar data type distribution for the foreground masks generated by YOLACT and AGMM, it is easier to fuse them together. Therefore, this paper proposes an improvement to the YOLACT network by adding a mask generation module. The YOLACT foreground map is a gray image where each pixel has a single value indicating whether the pixel belongs to the foreground object or not. In the YOLACT technique, a set of feature maps are obtained by performing convolution and feature extraction on the input image, and then a Mask Head network is used to process each feature map to obtain the corresponding foreground mask, which is the YOLACT foreground map. During the process of generating the foreground mask, each pixel is thresholded to classify it as foreground or background. Let the input image be
I, the feature map extracted by the YOLACT model, and the foreground mask be
M. For each position
in the feature map, the generation of the foreground mask can be expressed by the following formula:
Here,
represents the foreground probability of the pixel corresponding to the position
in the feature map, and
T is a preset threshold. Specifically, for each position
in the feature map, we can calculate its foreground probability
using the Mask Head network:
Here,
w is the parameter of the Mask Head network,
represents the feature vector corresponding to the position
in the feature map, and
represents the sigmoid function. Finally, by setting the threshold
T, we can convert the foreground probability
into a gray foreground mask value
, thereby obtaining the YOLACT foreground map.
In this paper, the YOLACT network architecture is used for foreground extraction from video frames. The model is trained on the COCO dataset for 800,000 epochs with a 54-layer convolutional neural network.
2. AGMM is used to process the video frame and get the foreground map
, the foreground mask extraction process using AGMM is illustrated in
Figure 4.
Background subtraction is one of the main components of surveillance video behavior detection and used as the preprocessing of object classification in this paper. The background subtraction method based on AGMM [
55], which has a good antijamming capability, especially the illumination change, is adopted. Thus, we select the AGMM background subtraction scheme to extract foreground images. A suitable time period
T, and time
t are assumed.
represents a sample,
=(
,
, ... ,
). For each new data sample, we both update the model
and re-estimate the density
. These samples may contain values for the background (BG) and foreground (FG) objects, thus the density estimation
. The Gaussian mixture model with
K components is expressed as follows:
Here
represents non-negative estimated mixing weights, and the
GMM is normalized at time
t.
and
are the estimated mean value and variances of the Gaussain components, respectively.
M is an identity matrix. Given a new data sample
at time
t, the recursive update equations are as follows:
Here
,
is the learning rate and the value of
is set. For a new sample, the ownership
is set to 1 for the "close" component with the largest
and the others are set to zero. We define that a sample belongs to a component if its Mahalanobis distance from the component is less than a certain threshold, and the squared distance from the
component is calculated as follows:
In turn, the algorithm will generate a new component
,
and
, here
is a initial value. This method shortens the processing time and improves the segmentation while providing highly specific image features for the next step of object detection. AGMM (Adaptive Gaussian Mixture Model) is a foreground extraction method based on Gaussian mixture model.
For each pixel
i, assuming three channel value is (
,
,
), the probability density function of AGMM model can be expressed as:
where
is the Gaussian distribution density function,
is the weight of the
Gaussian distribution at time
t, and
and
are the mean and covariance matrix of the
Gaussian distribution at time
t.
K is the number of Gaussian distributions.
For each pixel at location
, calculate the Mahalanobis distance
between the pixel’s color and the mean color of each Gaussian component in the mixture model:
where
is the pixel,
is the mean color of the
Gaussian component, and
is the standard deviation of the
Gaussian component. Then, the pixel is classified as foreground if the minimum Mahalanobis distance is smaller than a threshold:
where
is the threshold. Otherwise, the pixel is classified as background.
After initializing the model parameters, for each frame of the image, the foreground probability
of each pixel is first calculated based on the current model parameters, and then the foreground and background pixels are segmented according to a threshold to obtain the foreground map. The foreground mask is detected clearly for the UCSD ped1 [
56] benchmark dataset using AGMM.
3. For the fusion of two foreground graphs, weighted average can be adopted, as shown in the formula. The weight of the YOLACT foreground can be set to a larger value and the weight of the background subtracting foreground can be set to a smaller value to preserve the details of the YOLACT foreground.
where
,
,
is the fused foreground map.
4. Finally, YOLACT technique can be used again for target detection and instance segmentation for the fused foreground, so as to further improve the accuracy and robustness of target detection and segmentation.
3.2. Feature Learning
After extracting the foreground masks from the video frames, the next step is to perform feature extraction on the foreground images. These features will be used to classify the frames as either normal or abnormal, providing valuable information for anomaly detection analysis. In video anomaly detection, the efficient detection of appropriate features plays an important role in the rapid and accurate identification of normal and abnormal behaviors. Feature extraction can performed in two ways. One is to manually extract features, and the other is to learn the original video frames to obtain deep features. Manual feature extraction methods, despite having numerous theoretical justifications, are often influenced by human factors and may not objectively represent behavior. Moreover, features extracted through this method often depend on the database, meaning that manually designed features may only perform well for certain databases and may not yield the same results for others. Traditional video behavior detection technology has low accuracy, and shallow learning cannot parse big data.
In contrast, deep learning can overcome these problems well. Deep feature extraction through direct learning from data requires only the design of feature extraction rules, the manual design of the network structure, and learning rules to obtain deep model parameters and extract deep features, thus improving the recognition accuracy and the robustness in the process of video behavior detection. The PWC-Net based on deep learning is a kind of optical flow estimation technology used to calculate motion information between adjacent image frames. When combined with classification methods, which can be used for image anomaly detection tasks. PWC-Net (Pyramid, Warping, and Cost Volume, with multi-scale and multi-stage architecture) is a state-of-the-art method for optical flow estimation, proposed by Sun et al. in 2018. It builds upon the FlowNet architecture and introduces several improvements, including: 1). Multi-scale processing: the input images are processed at multiple scales to better capture small-scale and large-scale motion. 2). Multi-stage processing: the network has multiple stages, each of which takes as input the output of the previous stage, allowing for a more fine-grained estimation of flow. 3). Cost volume: instead of concatenating the feature maps of the two input images, PWC-Net computes a cost volume by computing the dot product between each pair of feature vectors. This allows the network to compute the cost of different flow hypotheses, which is then used to estimate the final flow field.
The optical flow vectors can be used as features to construct a classifier, which is used to classify input image frames into normal and abnormal categories. PWC-Net has been shown to outperform previous state-of-the-art methods on several optical flow benchmarks, The results of PWC-Net is shown in
Figure 5. During the training of the classifier, image frames with typical motion patterns can be used as normal samples, while image frames with atypical motion patterns can be used as abnormal samples. Then, the trained classifier can be used to classify new input image frames and detect whether anomalies are present.
3.3. Abnormal Behavior Detection
Given a new visual sequence, we select abnormal video frames by adopting the trained classifier model to compute images in the video. We estimate the likelihood of
by verifying the validity of the corresponding spatio-temporal video.
where
is known feature vector,
is the location.
denotes the feature vector of the observed objection
t, and
denotes its location. To model the relationship between x, z from the aspect of spatio-temporal appearance, we estimate the conditional probability
by the cosine similarity between
and
:
The location similarity between
y and
t is modeled using a Gaussian function:
Dimension of the feature vector is
M,
is the
element of
t feature vector,
denotes natural exponential function, and
is a constant. Assuming that the variables x and y are conditionally independent and have a uniform prior distribution, it suggests that there is no prior preference for valid samples in the set. Consequently, the joint likelihood of the observed object
t and the hidden variables
x and
y can be factorized in the following way:
The constant
is used to ensure that the maximum value of
is limited to a value smaller than 1. We aim to find the samples
x and
y that maximize the maximum a posteriori probability assignment. This can be achieved by using Equation (
16):
The first term in Equation (
17) represents the
inference of spatio-temporal appearance, while the second term represents the
inference of spatial location. Based on the
inference, a sample that appeared only once in the normal fixation set is equally likely as samples that appeared multiple times. A large likelihood indicates that it is more likely to find
x and
y in the set to infer the anomaly object
t in terms of spatio-temporal appearance and spatial location.