This work addresses the challenge of video depth estimation, which expects
not only per-frame accuracy but, more importantly, cross-frame consistency.
Instead of directly developing a depth estimator from scratch, we reformulate
the prediction task into a conditional generation problem. This allows us to
leverage the prior knowledge embedded in existing video generation models,
thereby reducing learn- ing difficulty and enhancing generalizability.
Concretely, we study how to tame the public Stable Video Diffusion (SVD) to
predict reliable depth from input videos using a mixture of image depth and
video depth datasets. We empirically confirm that a procedural training
strategy - first optimizing the spatial layers of SVD and then optimizing the
temporal layers while keeping the spatial layers frozen - yields the best
results in terms of both spatial accuracy and temporal consistency. We further
examine the sliding window strategy for inference on arbitrarily long videos.
Our observations indicate a trade-off between efficiency and performance, with
a one-frame overlap already producing favorable results. Extensive experimental
results demonstrate the superiority of our approach, termed ChronoDepth, over
existing alternatives, particularly in terms of the temporal consistency of the
estimated depth. Additionally, we highlight the benefits of more consistent
video depth in two practical applications: depth-conditioned video generation
and novel view synthesis. Our project page is available at
$\href{this https URL}{this\ http\ URL}$.

该研究旨在通过利用现有视频生成模型中的先验知识，将视频深度估计问题转化为条件生成问题，以降低学习难度并增强泛化能力。通过实证验证，作者提出了一种先优化空间层再优化时间层的训练策略，并通过滑动窗口策略在任意长的视频上进行推断，从而获得更具时间一致性的深度估计结果。实验结果表明，作者提出的 ChronoDepth 方法在估计深度的时间一致性方面优于现有方法，并在深度条件视频生成和新视角合成等实际应用中展示了更一致的视频深度的益处。

从视频扩散先验中学习时间一致的视频深度

Learning Temporally Consistent Video Depth from Video Diffusion Priors

Human image animation involves generating a video from a static image by
following a specified pose sequence. Current approaches typically adopt a
multi-stage pipeline that separately learns appearance and motion, which often
leads to appearance degradation and temporal inconsistencies. To address these
issues, we propose VividPose, an innovative end-to-end pipeline based on Stable
Video Diffusion (SVD) that ensures superior temporal stability. To enhance the
retention of human identity, we propose an identity-aware appearance controller
that integrates additional facial information without compromising other
appearance details such as clothing texture and background. This approach
ensures that the generated videos maintain high fidelity to the identity of
human subject, preserving key facial features across various poses. To
accommodate diverse human body shapes and hand movements, we introduce a
geometry-aware pose controller that utilizes both dense rendering maps from
SMPL-X and sparse skeleton maps. This enables accurate alignment of pose and
shape in the generated videos, providing a robust framework capable of handling
a wide range of body shapes and dynamic hand movements. Extensive qualitative
and quantitative experiments on the UBCFashion and TikTok benchmarks
demonstrate that our method achieves state-of-the-art performance. Furthermore,
VividPose exhibits superior generalization capabilities on our proposed
in-the-wild dataset. Codes and models will be available.

通过使用稳定的视频扩散 (SVD)、面部信息集成、准确对齐人体姿态和形状的控制器，VividPose 通过保持人物身份，并提供一个能处理多种身体形状和动态手部运动的坚实框架，实现了最先进的性能，并展示出在我们提出的野外数据集上的出色泛化能力。