Temporal Correlation-Based End-to-End Rate Control in DCVC

Zhenglong YANG; Weihao DENG; Guozhong WANG; Tao FAN; Yixi LUO

doi:10.1587/transinf.2024EDL8041

1. Introduction

Rate control is a critical part of video compression, particularly in bandwidth-limited tasks such as live and broadcast. In recent years, end-to-end image compression [1] has shown that coding outperforms the traditional image coding. Guo et al. [2] proposed the first end-to-end framework for video compression, where the key components of traditional video compression are replaced by end-to-end neural networks. To improve the end-to-end video compression, Li et al. [3] proposed a deep contextual video compression (DCVC) model, which leverages the high-dimensional context to carry rich information for high-frequency content and achieves higher video coding quality. Since bit allocation can directly affect the rate distortion (RD) performance, Erenetin et al. [4] exploited frame-level bit allocation for intra- and bi-directionally frames. However, bit allocation for every frame cannot find a suitable \(\lambda\) to decrease the RD cost, which makes the rate control scheme in deep learning video compression remain unfeasible. Li et al. [5] presented an R-D-\(\lambda\) rate control model for the learned video compression. However, the rate control parameters are still obtained via traditional methods.

In this paper, we focused on achieving end-to-end rate control by using a convolutional neural network (CNN) to obtain the optimal bit allocation and \(\lambda\). The major contributions of this paper are as follows:

(1) A two-branch residual-based network is designed to predict the bit rate ratio where the temporal encoded parameters are treated as the coding feature vector. Then, the bit rate can be reasonably allocated to every frame according to the low- and high-level coding features extracted by the designed network.

(2) A two-branch regression-based network is designed to obtain the optimal \(\lambda\). To effectively decrease the RD cost, the temporal encoded information and residual feature frame are used as the input vector for the network. In addition, a regression block is added to enhance the learning and expression ability of the network.

Page top

2. End-to-End Rate Control

2.1 Framework

For end-to-end rate control, the original frame is input into the two-branch residual-based network to optimize the bit rate ratio. Then, the bit rate can be reasonably allocated to every frame by considering the bit buffer. With the allocated bit of the frame, the optimal \(\lambda\) can be predicted by the two-branch regression-based network for the DCVC encoder. Figure 1 shows the framework of the end-to-end rate control.

Fig. 1 End-to-end rate control framework.

2.2 Frame Bit Allocation

To fully utilize the temporal correlation, a two-branch structure of the network is used, as shown in Fig. 2. In Fig. 2, \(R_F(n-1)\), \(D_F(n-1)\) and \(\lambda_F(n-1)\) are the bit rate, distortion and Lagrangian multiplier of the previous encoded frame, respectively. \(R_G\) is the target bit rate of the current group of pictures (GoP). \(W\) is the predicted bit rate ratio from the network. For the up branch of the network, the low-level features of the frame are extracted by the two convolutional layers with the \(3\times3\) kernel. Then, the residual block extracts the high-level features. For the down branch of the network, the encoded information of the previous frame is input into the network. Since the features of the two branches have strong temporal correlation, a multiplication operation is used to fuse the temporal correlation. Finally, the fusion features are extracted and converted to predict the bit rate ratio \(W\).

Fig. 2 Two-branch residual-based network.

The GoP bit allocation \(R_G\) can be expressed as

\[\begin{equation*} R_G = \frac{R_{\mathrm{target}}\cdot(n_{\mathrm{encoded}}+N_{SW})-R_{\mathrm{encoded}}}{N_{SW}}\cdot N_G \tag{1} \end{equation*}\]

where \(R_{\mathrm{target}}\) and \(R_{\mathrm{encoded}}\) are the target bit rate and total used bit rate, respectively; \(N_G\) is the number of frames in the GoP; \(n_{\mathrm{encoded}}\) is the encoded frames; \(N_{SW}\) is the smooth window. Then, the bit allocation of frame \(n\) can be expressed as

\[\begin{equation*} R_F(n) = \frac{R_G-R_{\mathrm{encoded-G}}}{\displaystyle\sum_{i=n}^{N_G}W_i}\cdot W_n \tag{2} \end{equation*}\]

where \(R_{\mathrm{encoded-G}}\) is the used bit rate of the frames in the current GoP; \(W_n\) is the bit rate ratio of frame \(n\), which can be predicted from the two-branch residual-based network. The loss function of the network is defined as

\[\begin{equation*} Loss_{ratio} = \frac{1}{N}\cdot \sum_{i=1}^{N}(W_i-\hat{W_i})^2 \tag{3} \end{equation*}\]

where \(W_i\) is the predicted bit rate ratio, \(\hat{W_i}\) is the actual bit rate ratio, and \(N\) is the number of frames for training.

2.3 Optimal \(\lambda\) Decision

Figure 3 shows the structure of the two-branch regression network to predict \(\lambda\). Since the residual feature, which is the difference between predicted frame and original frame, can indicate the correlation of the adjacent frames, the residual frame will be used as the up input. The bit allocation of the current frame \(R_F(n)\) is calculated using Eq. (2), and the bit cost \(R_F(n-1)\), distortion \(D_F(n-1)\), and \(\lambda_F(n-1)\) of the previous encoded frame are used as the down input. Then, the fusion feature of the two branches is input into the regression block. Finally, the network can predict the optimal \(\lambda\).

Fig. 3 Two-branch regression-based network.

Unlike the loss function of the two-branch residual-based network, the two-branch regression-based network for \(\lambda\) is trained by a multi-tasking loss function, which is defined as

\[\begin{equation*} Loss_\lambda = \gamma \left(\frac{| R_F-\hat{R}_F[\lambda]|}{R_F}\right)^2+(1-\gamma)\hat{D}_F[\lambda] \tag{4} \end{equation*}\]

where \(\gamma\) is set as 0.4 empirically; \(R_F\) is the calculated bit in Eq. (2); \(\hat{R}_F[\lambda]\) and \(\hat{D}_F[\lambda]\) are the actual bit and distortion, respectively; \([\lambda]\) denotes parameter \(\lambda\) in the range between \(\hat{R}_F\) and \(\hat{D}_F\).

Page top

3. Experimental Results

The proposed algorithm is implemented in DCVC. Li et al. [5] and Li et al. [6] are used for comparison. The Vimeo-90k [7] and BVI-DVC [8] datasets are used to train the two designed networks. One hundred frames are used to encode every test sequence. DCVC is used as an anchor, and four RD points are selected: \(\lambda=256\), 512, 1024 and 2048. The bit rate accuracy is defined as

\[\begin{equation*} M = \frac{|R-\hat{R}|}{R} \tag{5} \end{equation*}\]

where \(R\) is the target bit rate, and \(\hat{R}\) is the actual bit rate. Table 1 shows the bit rate accuracy results.

Table 1 Bit rate accuracy comparisons of DCVC, Li et al. [5], Li et al. [6] and the proposed algorithm.

Table 1 shows that the average bit rate accuracy results are 2.62%, 3.89%, 5.93% and 2.25%, respectively. The proposed algorithm has better control accuracy than the other algorithms. Since controlling the bit rate is a highly challenging task for end-to-end coding, the accuracies of the four algorithms remain high. Table 2 shows a comparison of the coding quality of the algorithms.

Table 2 Experimental comparisons of Li et al. [5], Li et al. [6] and the proposed algorithm.

In Table 2, the average BD-rate (PSNR) indices of Li et al. [5], Li et al. [6] and the proposed algorithm are \(-0.69\), \(-0.35\) and \(-0.84\), respectively. This result indicates that the proposed algorithm uses the lowest bit rate but improves the coding quality the most. For the BD-rate (SSIM) indices, the proposed algorithm achieves \(-0.35\). Li et al. [5] and Li et al. [6] achieve values of \(-0.24\) and \(-0.17\), respectively. Thus, the proposed algorithm mostly improves the subjective coding quality. Since the temporal coding information is used by the proposed algorithm to train the network for coding, the bit rate can be more reasonably allocated to satisfy the changing frame feature, and \(\lambda\) will be more effectively selected to decrease the RD cost.

Figure 4 shows the RD comparisons of DCVC, Li et al. [5], Li et al. [6] and the proposed algorithm. The proposed algorithm has better RD performance than the other algorithms, which indicates the effectiveness of the proposed algorithm. In summary, the proposed end-to-end rate control can improve both objective and subjective coding performance with good control accuracy.

Fig. 4 RD curve comparisons of DCVC, Li et al. [5], Li et al. [6] and the proposed algorithm.

Page top

4. Conclusions

In this work, a two-branch residual-based network and a two-branch regression-based network are designed to obtain the bit rate ratio and \(\lambda\) for end-to-end rate control. By fully utilizing the temporal coding correlation, the rate control parameters are appropriately selected to satisfy the coding feature. Experimental results show that the proposed algorithm can significantly improve the coding performance with a high rate control accuracy.

Page top

References

[1] F. Jiang, W. Tao, S. Liu, J. Ren, X. Guo, and D. Zhao, “An end-to-end compression framework based on convolutional neural networks,” IEEE Trans. Circuits Syst. Video Technol., vol.28, no.10, pp.3007-3018, 2017.
CrossRef

[2] G. Lu, W. Ouyang, D. Xu, X. Zhang, C. Cai, and Z. Gao, “DVC: An end-to-end deep video compression framework,” Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.10998-11007, 2019.
CrossRef

[3] J. Li, B. Li, and Y. Lu, “Deep contextual video compression,” Adv. Neural Inf. Process. Syst., vol.34, pp.18114-18125, 2021.

[4] E. Çetin, M.A. Yılmaz, and A.M. Tekalp, “Flexible-rate learned hierarchical bi-directional video compression with motion refinement and frame-level bit allocation,” Proc. 2022 IEEE International Conference on Image Processing (ICIP), pp.1206-1210, 2022.
CrossRef

[5] Y. Li, X. Chen, J. Li, J. Wen, Y. Han, S. Liu, and X. Xu, “Rate control for learned video compression,” Proc. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.2829-2833, 2022.
CrossRef

[6] B. Li, H. Li, L. Li, and J. Zhang, “λ domain rate control algorithm for high efficiency video coding,” IEEE Trans. Image Process., vol.23, no.9, pp.3841-3854, 2014.
CrossRef

[7] T. Xue, B. Chen, J. Wu, D. Wei, and W.T. Freeman, “Video enhancement with task-oriented flow,” Int. J. Comput. Vis., vol.127, pp.1106-1125, 2019.
CrossRef

[8] D. Ma, F. Zhang, and D.R. Bull, “BVI-DVC: A training database for deep video compression,” IEEE Trans. Multimed., vol.24, pp.3847-3858, 2021.
CrossRef

Page top