1. Introduction
With the growth of data in video services on telecommunications networks, ensuring the quality of experience (QoE) for viewers is one of the urgent requirements. QoE is generally defined as the level of satisfaction or dissatisfaction of users when using a certain service or application [
1]. To achieve user satisfaction, image quality stability is one of the most important criteria. In [
2], an experiment indicated that the Mean Opinion Score (MOS) is decreased significantly when the quality changing frequency is increased. The reason is that the quality changing between frames in a video sequence typically causes the annoying experience to human visual perception (see
Figure 1). Therefore, keeping the video quality consistent is a necessary process to improve QoE in video coding.
Along with the quality of human perceptual vision, consistent quality of video signal also needs for increasing efficiency of the visual sensor networks. Because the visual sensor network not only allows user to observe image, but also provides input images to learning machine systems for analysis purpose. In these systems, the quality of video directly affects the analysis performance. As shown in [
3,
4], the accuracy of object detection algorithms are directly proportional to the quality of video. Thus, the keeping stability of quality video at high level is beneficial for learning machine systems. Several methods have been proposed to improve the stability of video quality in a rate control algorithm. In [
5], a method is proposed to provide the smooth quantization under CBR (constant bitrate) encoding by introducing a low filtering mechanism to smooth the quantization parameters produced from the traditional rate control algorithm. In [
6], a sequential rate control algorithm was proposed for real-time video coding. However, in [
5,
6] methods used mean squared error (MSE) to measure visual quality although this metric does not reflect exactly the human perceptual quality. Thus, in [
7], a new visual quality metric (VQM) is proposed. However, the proposed metric also uses MSE beside the motion information content in computing VQM. To address the causes of quality fluctuation in RDO, in [
8], the Lagrange multiplier
is adjusted according to the video content in the RDO process in order to ensure that reconstructed video always achieves the stable quality level. In particularly, the Lagrangian multiplier and quantization parameter (QP) for each frame are computed so that the difference of quality between the predicted frame and the correspondingly decoded frame is minimized. With the goal of achieving a constant quality level among frames, method in [
9] uses the probability density function of the transform coefficients to estimate the depth of the coding tree. From there, the quality of the coding blocks is adjusted to ensure a stable quality level. In [
10], the content-adapted quality-distortion model for the H.264 encoding standard is used to estimate the distortion between the original frame and the decoded frame. Based on the estimated distortion value, the QP parameter for each frame is found to achieve the desired quality level.
Another issue related to the visual quality of video signal is the quality assessment metric. In general, conventional objective quality assessment metrics are preferable in practical applications since they offer a specific computational formula and may be easily implemented in the encoder. Peak Signal to Noise Ratio (PSNR) and Mean Square Error (MSE) are the two most commonly used objective measures, respectively. In [
11], a PSNR based method is proposed to control the constant quality of reconstructed video sequence. In this method, to keep the video quality is constant in terms of PSNR, the QP of each frame is adjusted according to the average PSNR of the previous frames. If the average PSNR is less than the PSNR target, the QP of the current frame is reduced and vice-versa. However, it has been demonstrated that PSNR or MSE only have a weak relationship with the human visual system (HVS). Therefore, Netflix created a metric called VMAF, which uses a machine-learning model that is trained on user feedback, to reflect the viewer’s viewpoint [
12]. By using Support Vector Machine (SVM) regression, this metric is created by combining several fundamental metrics such as VIF [
13], Detail Loss Metric - DLM [
14], and Motion. In practical, the industry usually uses the VMAF metric extensively because its superior accuracy compared to traditional metrics [
15,
16,
17].
Because of its benefits, VMAF is proposed to replace conventional metrics in some literatures such as [
18,
19]. In these methods, the relationship between sum of squared difference (SSD) and VMAF is built. Consequently, Lagrange multiplier in RDO function is computed based on VMAF instead of MSE. Also using perceptual visual to improve RDO, methods proposed in [
20,
21] used neural network to predict QP value. In particularly, authors in [
20] proposed a perceptual adaptive quantization based on a VGG - 16 model on high efficiency video coding (HEVC) for bitrate reduction while maintaining subjective visual quality. In [
21], the proposed method used CNN model to predict the visibility threshold for each image patch and then estimate QP value based on this visibility threshold. However, in these mentioned methods, the proposed methods only focus on improving Rate-Distortion (RD) performance of encoder while the stability of video quality at frame-level is not considered. To overcome the drawbacks of the previous methods, we proposed a VMAF-based method to predict QP by using CNN model in article [
22]. However, this method is applied for intra-mode encoding and for low resolution video sequences only. To develop a method which can be applied for variety of video resolutions in both intra-mode and inter-mode, in this paper, we proposed: (1) An estimation for the rate-quantization parameter and distortion-quantization parameter functions based on VMAF metric instead of PSNR. (2) An CNN-based algorithm to estimate QP value at block-level in order to achieve a target quality for overall frame.
The rest of the paper is organized as follows. In section 2, the background works on RDO modeling and perceptual RDO for quality consistency are introduced. Then, the framework of the proposed system is illustrated in section 3. Experimental parameters and simulation results are presented in section 4. Finally, section 5 concludes the contributions of this work.