For neural video codec, it is critical, yet challenging, to design an efficient entropy model which can accurately predict the probability distribution of the quantized latent representation. However, most existing video codecs directly use the ready-made entropy model from image codec to encode the residual or motion, and do not fully leverage the spatial-temporal characteristics in video. To this end, this paper proposes a powerful entropy model which efficiently captures both spatial and temporal dependencies. In particular, we introduce the latent prior which exploits the correlation among the latent representation to squeeze the temporal redundancy. Meanwhile, the dual spatial prior is proposed to reduce the spatial redundancy in a parallel-friendly manner. In addition, our entropy model is also versatile. Besides estimating the probability distribution, our entropy model also generates the quantization step at spatial-channel-wise. This content-adaptive quantization mechanism not only helps our codec achieve the smooth rate adjustment in single model but also improves the final rate-distortion performance by dynamic bit allocation. Experimental results show that, powered by the proposed entropy model, our neural codec can achieve 18.2% bitrate saving on UVG dataset when compared with H.266 (VTM) using the highest compression ratio configuration. It makes a new milestone in the development of neural video codec. The codes are at https://github.com/microsoft/DCVC.

本文提出了一种强大的熵模型，能够高效地捕捉视频中的空间和时间依赖关系，使用潜在先验来减少时间冗余，使用双重空间先验来并行地减少空间冗余。此外，该熵模型还具有内容自适应量化机制，有助于编解码器实现平滑的速率调整，并通过动态位分配改善最终的速率失真性能。实验结果表明，使用该熵模型作为支撑，与最高压缩比配置下的H.266（VTM）相比，我们的神经编解码器可以在 UVG 数据集上实现18.2%的比特率节省，这是神经视频编解码器发展的新里程碑。

混合时空熵建模用于神经网络视频压缩