1. Introduction
In recent years, with the innovative progress of video editing, 5G communication, and other related technologies, and the popularity of video-related applications and services has led to the continuous expansion of the scale of video data and shows a continuous exponential growth trend [
1]. Take short videos as an example, the data from iiMedia Research shows that the scale of short video users in China has an obvious growth momentum, which exceeded 700 million in 2020. These users deliver vivid and diverse information resources through video creation, video sharing, and video recommendation to enrich daily lives by using short video platforms.
In fact, as the scale of video data increases, many similar videos continue to emerge after video editing, reissue a new version of the modified original, and other operations for the videos, which are also referred to as Near-Duplicate Videos (NDVs) [
2]. In [
3], near-duplicate videos are defined as identical or approximately identical videos are close to each other and hard to distinguish, but they are different in some detail. In general, near-duplicate videos are derived from the original video, which not only the copyright of the video producer has been illegally infringed illegally [
4], but also affects the data quality of video data sets. For example, Wu et al. [
5] use 24 keywords to search for videos on common Web sites of YouTube, Yahoo!Video, and Google Videos, the results show that there are a lot of near-duplicate videos on the afore-mentioned Web sites. In individual cases, the redundancy can reach 93%. These near-duplicate videos will not only cause copyright issues and affect normal applications of video surveillance, video recommendation, etc., but also significantly reduce the data quality of video datasets, making the maintenance and management of video data more and more challenging. At present, some salient problems caused by near-duplicate videos have obtained increasingly attention in academia and industry.
From the perspective of data quality [
6], video data quality a great deal of attention is paid to the overall quality of the video data set, and lay stress on the degree of data consistency, data correctness, data completeness, and data minimization are satisfied in the information systems. The emergence of near-duplicate videos will reduce the degree of data consistency and minimization for video data sets. These near-duplicate videos can be taken to be a kind of dirty data, which has a wide coverage, rich and diverse forms. Concretely, regardless of the stage of video collection, video integration, or video processing, it is possible to generate near- duplicate videos. For instance, in the video collection stage, they can be collected from different angles within the same scene; in the video integration stage, there may be near-duplicate videos with different formats and video lengths from different data sources; in the video processing stage, video copy, video editing, and other operations will produce mass near-duplicate videos. Since near-duplicate videos have the characteristics of concealment, dissemination, and harmfulness in video data sets, the tasks of how to identify and reduce these near-duplicate videos are difficult challenges.
Research on near-duplicate video detection can help us discover hidden near duplicate-videos in video datasets. Currently, various kinds of methodologies have been proposed in literature, and the implementation process mainly includes feature extraction, feature signature, and signature index. In either of these methodologies, feature extraction can be regarded as a key component of near-duplicate video detection. From the perspective of video feature representation, near-duplicate video detection methodologies can be categorized into the hand-crafted features-based methodology and high-level features-based methodology.
As mentioned before hand-crafted features-based methodology, local feature descriptors (such as SIFT, SURF, MSER, etc.) and global feature descriptors (such as a bag of the word and HOG) can be utilized to represent the spatial features of video data, and the computational cost of these feature descriptors is small. However, hand-crafted features-based methodology needs to rely on prior knowledge, which is difficult to accurately represent video features for video data with the characteristics of uneven illumination, local occlusion, and significant noise.
In recent years, through in-depth research on deep learning theories and methodologies, a variety of deep neural network modeling techniques have been applied to the detection of near-duplicate videos. The features of videos can be more accurately portrayed and represented to identify video data semantically similar by adopting convolutional neural network models [
7], recurrent neural network models [
8], and other deep neural network models [
9]. Nevertheless, near-duplicate video detection methodologies can only identify the near-duplicate videos in a video dataset, which lack a process of feature sorting and automatic merging for the video data represented by high-dimensional features. Therefore, it is very challenging for them to automatically clean up redundant near-duplicate videos to reduce video copyright infringement and related issues caused by video copying, video editing, and other manual operations.
Currently, major studies focus on near-duplicate video detection [
10] and retrieval [
11], and less consideration is given to how to automatically clean near-duplicate videos from the perspective of data quality. Data cleaning modeling techniques are important technical ways to effectively reduce near-duplicate data and improve data quality. By using this kind of modeling technique, near-duplicate data existing in the datasets can be automatically cleaned, so the datasets meet data consistency, completeness, correctness, and minimization, and achieve high data quality. At present, data cleaning modeling techniques have been studied more deeply in big data cleaning [
12], stream data cleaning [
13], contextual data cleaning [
14], etc., which can effectively address the data quality issues at the instance layer and schema layer. However, there is still a significant gap in research on near duplicate video cleaning, and there is a sensitive problem of video data orderliness and clustering initial center sensitive problem under a condition that prior distribution is unknown in the existing studies, which seriously affect the accuracy of near-duplicate video cleaning.
In this paper, an automatic near-duplicate video cleaning method based on a consistent feature hash ring (denoted as RCLA-HAOPFDMC) is proposed to address the above-mentioned issues, which consists of three parts: high-dimensional feature extraction of video data, consistent feature hash ring construction, and cluster cleaning modeling based on a consistent feature hash ring. First, a residual network with convolutional block attention modules, a long short-term memory (LSTM) deep network model, and an attention model are integrated to extract temporal and spatial features from videos by constructing a multi-head attention mechanism. Then, a consistent feature hash ring is constructed, which can effectively alleviate the sensitivity of video data orderliness while providing a condition of near-duplicate video merging. Finally, to reduce the sensitivity of the initial cluster centers to results of near-duplicate video cleansing, an optimized feature distance-means clustering algorithm is constructed by utilizing a mountain peak function on a consistent feature hash ring to implement automatic cleaning of near-duplicate videos. A commonly used dataset named CC_WEB_VIDEO [
5] and a coal mining video dataset [
15] are used to confirm the practical effect of our proposed method. The contributions are summarized as follows: (1) a novel consistent feature hash ring is constructed, which can alleviate the sensitivity issue of videos data orderliness while providing a condition of near-duplicate videos merging; (2) an optimized feature distance-means clustering algorithm is constructed by utilizing a mountain peak function on consistent feature hash ring to merge and clean up near-duplicate videos; (3) the method presented in this paper is successful on the highly difficult CC_WEB_VIDEO and coal mining video dataset, where the coal mining video dataset has complex context scenes.
The following is the organizational structure of the remaining parts of this article. In
Section 2, a brief review of related works on near-duplicate video detection and data cleaning is presented; an automatic near-duplicate video cleaning method based on a consistent feature hash ring is proposed in
Section 3. The experimental results validate the performance of the method presented in
Section 4. Finally, the paper is summarized.
2. Related Work
In this section, the previous near-duplicate video detection methodologies and image/video cleaning methodologies are briefly reviewed. First, some hand-crafted features-based methodologies and high-dimensional features-based methodologies for near-duplicate video detection are recalled, then some data cleaning methodologies for image cleaning and video cleaning are reviewed. Finally, the shortcomings of the above-mentioned methodologies are analyzed.
2.1. Near-Duplicate Video Detection Methodologies
In the past decade, hand-crafted features are widely used to near-duplicate video detection, such as SIFT, HOG, and MSER. For example, the study in [
16] adopts SIFT local feature to encode temporal information, generates temporal set SIFT features by tracking SIFT, and combines local sensitive hash algorithms to detect near-duplicate videos. Despite SIFT features of a video frame can maintain invariance to rotation, scaling, and illumination changes, as well as maintain a certain degree of affine transformation and noise, some of the invariance of SIFT will be damaged to a certain extent during strong Camcording. Henderson et al. [
17] adopt key point features from a Harris corner, SURF, BRISK, FAST, and MSER descriptors to detect video frame duplication. This method incorporates different local features to represent video frames, but ignores the global feature and spatiotemporal features that video frames have. Zhang et al. [
18] integrate Harris 3D spatiotemporal feature and HOG/HOF global feature descriptors to detect near-duplicate news web videos, and the Jaccard coefficient is applied to similarity metric, but there exists the issue of inefficient detection in this method. In [
19], a new near-repeat video detection system, CompoundEyes, is proposed, which combines seven hand-made features (such as color consistency, color distribution, edge direction, motion direction, etc.) to improve the efficiency of near-repeat video detection. However, this method is susceptible to feature changes. The work in [
3] adopts an unsupervised cluster algorithm based on temporal and spatial key points to automatically identify and classify near-duplicate videos, but the results of near-duplicate video detection are sensitive to initialize the cluster center.
In general, the combination of spatial and temporal features can be more comprehensively and accurately to represent the spatiotemporal information contained in video data than the representation of a single low-level feature, hence the methodologies based on spatiotemporal features can identify near-duplicate videos more accurately. However, the methodologies based on low-level features need to know prior knowledge, and the results of near-duplicate video detection are easily affected by disturbances from illuminations, occlusions, distortion, etc.
Recently, various deep network models have been utilized to detect near-duplicate videos, which have more excellent representational capacity than the methodologies based on hand-crafted features. For instance, the study in [
7] presents a survey on the utilization of deep learning techniques and frameworks. Nie et al. [
20] use a pre-trained convolutional neural network model to extract high-dimensional features of videos, and a simple but efficient multi-bit hash function is proposed to detect near-duplicate videos. This is a supervised joint view hashing method, which can improve the performance of accuracy and efficiency. However, the distribution of near-duplicate video data in the video dataset is usually unknown in practical applications, so the supervised joint view hashing method is limited. In [
21], the near-duplicate video is detected by combining the two-stream network model of RGB and optical flow, multi-head attention and Siamese network model. The limitation of this method is that it adopts a cosine distance function to measure the similarity between every two videos, which results in relatively low efficiency. Moreover, a neighborhood attention mechanism is integrated into an RNN-based reconstruction scheme to capture the spatial-temporal features of videos, which is used to detect near-duplicate videos [
22]. The work in [
23] uses a temporal segment network model to detect near-duplicate video data. These above-mentioned models based on temporal networks can detect near-duplicate videos by capturing the temporal features of videos. In [
24,
25], a Parallel 3D ConvNet model and a spatiotemporal relationship neural network model are adopted to extract spatiotemporal features to detect near-duplicate videos.
In summary, high-dimensional features-based methodologies can achieve better performance than hand-crafted features-based methodologies, they can reduce the impacts of disturbances from illuminations, occlusions, and distortion on the model results. However, near-duplicate videos can be identified directly by either hash mapping or similarity metric, but the abovementioned methodologies lack a process to automatically merge data with high-dimensional features, which are more difficult to clean up near-duplicate videos automatically.
2.2. Data Cleaning Methodologies
Data duplication may have the following reasons: data maintenance, manual input, device errors, and so on [
26], data cleaning modeling techniques are effective ways to automatically clean and reduce near-duplicate data. Recently, the amount of literature on the topic of data cleaning [
27] has shown a rapid growth trend, most of the existing works are concentrated stream data cleaning and spatiotemporal data cleaning.
In a line works of stream data cleaning, for instance, the literature [
28] proposes a stream data cleaning method named SCREEN, which can clean up the steam data by finding the repair sequence with the smallest difference from input, construct an online cleaning model and calculate the local optimal of the data point. However, this method does not guarantee that near-duplicate data are exactly adjacent to each other in the same sliding window. The work in [
29] proposes a streaming big data system, which is based on an efficient, compact, and distributed data structure to maintain the necessary state for repairing data. Additionally, it improves cleaning accuracy by supporting rule dynamics and utilizing sliding window operations. The limitation of this method is that the fixed size of the sliding window has a significant impact on the cleaning results. In [
30], a sliding window and K-means cluster algorithm are adopted to clean stream data, but the result of this method is sensitive to the initialization of cluster centers.
To sum up, although methodologies of stream data cleaning can clean up the stream data cleaning effectively, there are limitations in that the results of models are sensitive to the fix-size sliding window and pre-defined the initialized clustering center.
In line works of time series data cleaning, for example, Ranjan et al. [
31] unitize a k-nearest neighbor algorithm and a sliding window prediction approach to clean time series data on a set of nonvolatile and volatile time series. This method can optimize the width of the sliding window to enhance the performance, but to optimize parameters that affect performance, a general scheme needs to be developed. In [
32], a top-N keyword query processing method is proposed, which is based on real-time entity parsing to clear data sets with duplicate tuples. The limitation of this method is that the selection of keywords has a salient impact on the results. The study in [
33] proposes an approach of real-time data cleaning and standardization, which clarified the workflows of data cleaning and data reliability, and it can be adapted to clean up near-duplicate time series data. However, this approach is not enough to describe the details of real-time data cleaning.
Through the studies of time series data cleaning, it is found that there is a prevalence of erroneous data in the industrial field, hence the studies mainly focus on cleaning the erroneous data, and less on cleaning near-duplicate data. In the existing studies, k-nearest neighbor and top-K algorithms are widely adopted to clean near-duplicate time series data since they do not rely on prior knowledge of the distribution of time series data. Nevertheless, the pre-setting of the k parameters has a significant impact on the cleaning results.
In a line works of spatiotemporal data cleaning, take the literature [
34] as an example, a probabilistic system named CurrentClean is presented, which uses a spatiotemporal probabilistic model and a set of inferences to identify and clean stale data in the database. This probabilistic system applies to data cleaning with spatiotemporal features in relational databases, but it is difficult to apply to unstructured data cleaning with high dimensional spatiotemporal features. To address this issue, the study in [
15] proposes a method to clean up near-duplicate videos by using locality-sensitive hashing and a sorted neighborhood algorithm. However, this method is challenging to use SIFT and SURF hand-crafted features accurately to portray video features, and the use of a sorted neighborhood algorithm causes a more prominent orderliness-sensitive problem of video data. The work in [
35] achieved an improvement in the quality of the image dataset by automatically clearing minority images using a convolutional neural network model. However, the completeness of image datasets may be destroyed when cleaning images of the minority classes. Fu et al. [
36] propose a near-duplicate video cleaning method based on VGG-16 deep network model and feature distance means clustering fusion to improve the data quality of video datasets, which takes less account of the temporal feature representation of videos and suffers from clustering initial center sensitive problem under a condition that prior distribution is unknown. Moreover, a novel content based on the video segmentation identification scheme is proposed to reduce the mass of near-duplicate video clips [
37]. H. Chen et al. [
38] utilize the similarity measurement to clean the duplicate annotations of video data in the MSR-VTT dataset.
In summary, there are few studies on data cleaning for unstructured data with spatiotemporal features, such as video and audio data, due to less consideration from the perspective of data quality. Recently, the studies on near-duplicate video cleaning mainly have the following issues: (1) it is more difficult to be able to arrange all near-duplicate videos nearby sorting algorithms, so the accuracy of cleaning is more sensitive to the data orderliness; (2) by utilizing the idea of clustering, a video data with the most significant features can be retained in all near-duplicate data and the rest deleted, but the setting of clustering initial center is more sensitive to cleaning results under a condition that prior distribution is unknown.
To address these two issues, we consider constructing a novel consistent feature hash ring based on optimizing video feature extraction to map video data to low-dimensional space, which is used to reduce data orderliness sensitivity issues caused by data sorting, and provide a condition of near-duplicate video merging. On this basis, an optimized feature distance-means clustering algorithm is constructed, which merges a mountain peak function on a consistent feature hash ring to overcome the clustering initial center-sensitive problem under a condition that the prior distribution is unknown.
3. The Proposed Method
In this section, how to utilize the proposed novel automatic near-duplicate video cleaning method based on a consistent feature hash ring is described, which can automatically clean up near-duplicate videos to improve the data qualities of video datasets. It is important to note here that the concepts between the data quality of video datasets and video quality are different. Normally, video quality is concerned with the clarity of videos and involves performance metrics such as resolution and frames per second. The data quality of the video dataset is concerned with the degree to which the video dataset satisfies data consistency, data completeness, data correctness, and data minimization. The goal is to remove redundant near-duplicate videos from video datasets by the method proposed in this paper, so that the video datasets consistently maintain high data quality.
To achieve this goal, three main stages need to be accomplished: feature representations of video data, identification of near-duplicate videos, and deletion of near-duplicate videos. It should be noted that if all the identified near-duplicate videos are removed, the data completeness and data correctness of video datasets will be affected. If only some the near-duplicate videos are removed, the data consistency and data minimization of video datasets will be affected. Therefore, it is a difficult challenge to retain video data with the most salient features to ensure the data completeness and data correctness of video datasets, while removing the rest near-duplicate video data to ensure data consistency and data minimization of video datasets.
At present, considering the time cost of near-duplicate video cleaning, the key insight of existing works is to overcome the above challenges by exploiting data ordering and clustering to retain video data with the most salient features and remove the rest of near-duplicate videos. However, the sensitivities of cleaning effect to data orderliness and initial clustering center setting are major issues. Inspired by the distributed big data storage processing, a consistent feature hash ring is constructed in this paper. The advantage of utilizing a consistent feature hash ring is to reduce the impact of data sorting on the cleaning results by mapping video data with high-dimensional features to a feature hash ring, while providing a condition for removing near-duplicate videos.
Figure 1 outlines our approach, which consists of three parts: high-dimensional feature extraction of videos, construction of a consistent feature hash ring, clustering, and cleaning near-duplicate videos based on a consistent feature hash ring. Each of these sections is explained next.
3.1. High-Dimensional Feature Extraction of Videos
The feature representation of video data is an important stage in the data cleaning process of videos. Currently, several convolutional neural network models have been adopted in the study of image and video data cleaning for feature representation of image or video data, such as AlexNet [
35], GoogleNet [
35], VGG-16 [
36], and ResNet50 [
39]. Due to the spatiotemporal features of video data, it is not enough to rely on the above-mentioned convolutional neural network models for spatial feature representations of videos, and the extracted video features are less likely to highlight the spatiotemporal features of salient regions in video data, which will affect the accurate representation of video semantics. To overcome such a limitation, a residual network with convolutional block attention modules is adopted firstly to extract spatial features of video data, the channel attention and spatial attention modules in this convolutional block attention modules can effectively improve the representation capability of spatial features in the saliency region of video data. Then, the above network and a long short-time memory model are integrated to extract the spatiotemporal features of video data. Finally, to highlight the role of key information in video data on the video semantic representations, an attention model based on the above network models is introduced, thereby along three independent dimensions of channel, space, and time series to construct a video spatiotemporal feature extraction model with multi-head attention mechanism, which is named as RCLA model in this paper. The concrete architecture of the RCLA model is shown in
Figure 2.
Considering the scale of data used for training and testing in this paper, several video data as the training samples are firstly selected to input a residual network with convolutional block attention modules. Let the size of the above video data be
, where
represents the size of a video frame,
represents the number of channels per frame, and
represents the number of frames of the video data [
40]. Before training the 34-layer residual network, set the values of
and
to 224, and the value of
to 3. In addition,
the convolutional kernel with stride 2 in the convolution layer are firstly fixed, and then the pooling window with stride 2 in the pooling layer is fixed
to implement the convolutional operation and max-pooling process. During the above process, a BatchNorm2d method is used to normalize the input data, and a Rectified Linear Units (ReLU) activation function is used to alleviate the problem of gradient dispersion. Thus, a feature map F of size
for a video frame can be obtained. Then, the F into the middle part of the residual network with 34 layers is input. This part is composed of 4 blocks, which respectively include 3, 4, 6, and 3 residual blocks, and each residual block contains a convolutional block attention module (CBAM) [
41]. In this module, the details of spatial features in a video frame can be profiled by constructing a channel attention map Mc and a spatial attention map Ms. Specifically, the spatial information of feature map F is aggregated by first performing maximum pooling and average pooling operations on F, respectively, and this spatial information is input to a multilayer perceptron (MLP) network model with 2 layers, whose two layers share the wights
and
(
is the reduction ratio,
is set to 16 in this paper), respectively, so that the channel attention map Mc can be obtained by using Eq. (1) and an intermediate feature map
is computed by using Eq. (2):
where
denotes the sigmoid function; The meanings of
and
are average-pooled features and max-pooled features respectively; “
” is used for element-by-element multiplication.
Then, the average pooling and maximum pooling operations are performed on
to generate features
and
, respectively. On this basis, the spatial attention map
Ms is calculated by Eq. (3), and the final refined feature map
can be obtained by Eq. (4) as follows:
where
denotes a convolution operation.
Through the above-mentioned different residual blocks with convolutional block attention modules, the feature vectors of size [512, 1] can be obtained to represent the spatial features exhibited by video frames.
Then, the spatial features
are input into the long short-term memory network (LSTM) to further extract the temporal features of video data. Considering the size of datasets used in this paper, a one-layer LSTM with N (we set N=16 in this paper) hidden layer nodes is employed, which is consist of an input gate, forget gate, and output gate. The hidden state
at t moment is calculated as shown in Eq. (5):
where LSTM(∙) denotes the formalized function of an LSTM network model;
denotes a parameter matrix learned during training of the LSTM network model;
denotes the hidden state at
moment; and
denotes the hyperparameters of an LSTM network model. Through the output layer of an LSTM network model, the spatial-temporal features
of size
can be obtained to represent video data.
To focus on the visual features of different video frames to highlight the semantic contents of video data, an attention module based on the above models is introduced, as shown in Eq. (6):
where
denotes a standard additive attention function;
denotes the weight vectors in an attention module;
denotes the semantic features of video data obtained by using the multi-head attention mechanism, and their size are
.
When training the RCLA model, the goal is to minimize the learning loss of each deep neural network model. Considering the cascade relationship between each of the above neural network models, the output of the above attention module as the input of a loss function is used to perform optimization of an RCLA model, the above-mentioned loss function is constructed as shown in Eq. (7) and Eq. (8):
where
denotes a feature vector of the
video data in a video dataset with
video data,
denotes a feature vector of the
seed video in the above video dataset,
denotes the total number of preset seed videos,
denotes the label corresponding to
,
denotes a function of the similarity measurement.
3.2. The Construction of a Consistent Feature Hash Ring
To efficiently identify near-duplicate video data with high-dimensional features and reduce the impact of high-dimensional feature sorting on near-duplicate video data cleaning, a consistent feature hash ring is constructed, which is inspired by the way of using distributed hash tables to address the problems of load balancing and high scalability in big data storage, as shown in
Figure 3.
Assuming that a hash function can be used to map the high-dimensional features of all video data into a set of the hash values, and the largest hash value is less than , then a consistent feature hash ring is a virtual ring constructed through this set of hash value, and the range of hash value space of this ring is [0, ] (hash value is a 32-bit unsigned integer, and is set in this paper). An entire spatial distribution of the consistent feature hash ring is organized in a clockwise direction. The hash value at the top of this ring as the starting point is 0, the hash value at the first point on the left of 0 is , and the hash value at the first point on the right of 0 is 1. By analogy, all the hash values are distributed from small to large clockwise until they return to the starting point.
When constructing a consistent feature hash ring, a hash function is designed as shown in Eq. (9), which can be used to calculate the hash values corresponding to each video and map all video data to the consistent feature hash ring.
where
denotes a binary encoding function, and
denotes a high-dimensional feature vector of the ith video data.
By utilizing the binary function as shown in Eq. (10), the high-dimensional features of all video data can be mapped into compact binary codes, which can not only decrease the dimensions of video features, but also perform metric operations in the low-dimensional space.
where
denotes an encryption function;
is an activation function.
Specifically, a activation function is utilized to perform a nonlinear variation of the video data with a high-dimensional feature. On this basis, to improve the sparsity of the hash distribution on the consistent feature hash ring, a norm is used to calculate a value to represent video data. Subsequently, this value is encrypted with the MD5 encryption algorithm and converted into a fixed-length binary code as a hash value of the video data (the length of a binary code is set to 16 in this paper). Finally, a linear detection method is used when storing multiple hash values to avoid the same hash addresses being preempted by different hash values, i.e., an address is assigned by adding 1 to the back, and the modulus of the maximum value on a consistent feature hash ring is taken as the upper bound of the address range until there is a free address. Here, the modulo operation is to ensure that the location found is in the effective space of . Thus, the ith video data can be mapped as video hash feature points xi on a consistent feature hash ring in the form of two-dimensional coordinates .
3.3. FD-Means Clustering Cleaning Optimization Algorithm with Fused Mountain Peak Function
The efficiency and accuracy of partitioning clustering algorithm are closely related to the selection of initial clustering center. The FD-Means clustering algorithm [
36] randomly selects several initial cluster centers for multiple clustering, and finally selects the optimal clustering centers as the initial clustering center. However, it has a large amount of calculation, and poor effect leads to the volatility of clustering results. To address this issue, a mountain peak function is fused with the FD-Means clustering algorithm to optimize the selection of the initial clustering centers to automatically clean the near duplicate video data accurately.
Specifically, assuming that a set of video features
, where
denotes a vertical ordinate of
, and Nv denotes the total number of video data. To select several initial clustering centers of video data, all the data samples on the consistent feature hash ring are first divided into a finite grid, and all the cross points of a
grid can be used as the candidate centroids of the clustering centers, as shown in Eq. (11) and Eq. (12):
where
denotes the partition interval of a grid;
denotes an index of the ith cluster center
, and
;
denotes the value of the ith cluster center
in a grid. Moreover,
can be determined by inequality
.
Subsequently, the points with higher density in grid space are found by constructing a mountain peak function for each cross point to calculate its peak value
. For example, a peak value of the ith cluster center
is calculated as shown in Eq. (13):
where
is a constant value.
On this basis, the video hash feature points corresponding to the cross points of K maximum peak values are selected sequentially as the initialized clustering centers on a consistent feature hash ring, denoted as .
Since the FD-Means clustering algorithm seldom considers the curse of dimensionality when using Euclidean distance to measure the similarities between video features, a new similarity measurement function is constructed, as shown in Eq. (14):
where
denotes a weight factor;
denotes the Euclidean distance between any ith and jth video hash feature points, as shown in Eq. (15):
where
and
denote the dimensions of two feature vectors.
When updating an FD-Means cluster center, the sum of distances between a video hash feature point and other video hash feature points in a cluster is first calculated, and the video hash feature point with the smallest sum of distances is selected as a cluster center through iteration, as shown in Eq. (16):
where d denotes a cluster;
denotes a new cluster center after the initial cluster center of d is updated.
When updating a cluster, the distance between the video hash feature points of the non-cluster centers and all the cluster centers are first calculated. In the K-Means clustering algorithm, all points of non-cluster are divided into the nearest clusters according to the nearest neighbor principle. Unlike the K-Means clustering algorithm, the optimized FD-Means clustering algorithm compares the distances between the video hash feature points of non-cluster centers and all cluster centers. If the minimum distance is not less than the given threshold
, the video hash feature points are used as the cluster centers of new clusters; otherwise, according to the nearest neighbor principle, they are grouped into the nearest clusters, and the automatic clustering of video hash feature points on a consistent hash ring is finally achieved, as shown in Eq. (17) and Eq. (18):
where
denotes an updated cluster.
In the automatic cleaning of near-duplicate video data, all the near-duplicate videos obtained by clustering cannot be eliminated to ensure the data consistency, completeness, correctness, and minimization of a video set. Therefore, a representative video can be retained from the near-duplicate videos, and other near-duplicate video data can be automatically deleted to improve the video data quality.
At present, the traditional image or video data de-duplication methods tend to retain the first detected data and remove other near-duplicate data in an ordered sequence, but such methods are randomized at different settings of sequence length, which will lead to fluctuations in the results of data cleaning. Hence, a video in a cluster cannot be arbitrarily selected as a seed video or representative video to be reserved. In this paper, the video data corresponding to the cluster centers are as the representative video data to be retained, and others in the clusters are deleted, as shown in Eq. (19):
where
denotes a mapping function between a set
of cluster centers and a set
of the corresponding video data, and the mapping relationship is expressed as
,
denotes the original video data set is updated after the mapping function
.
Through the above processes, the automatic cleaning of near-duplicate video data is finally achieved.