This paper presents a synthetic multimodal dataset of daily activities that
fuses video data from a 3D virtual space simulator with knowledge graphs
depicting the spatiotemporal context of the activities. The dataset is
developed for the Knowledge Graph Reasoning Challenge for Social Issues
(KGRC4SI), which focuses on identifying and addressing hazardous situations in
the home environment. The dataset is available to the public as a valuable
resource for researchers and practitioners developing innovative solutions
recognizing human behaviors to enhance safety and well-being in

本研究提供了一个合成的多模态数据集，通过融合来自 3D 虚拟空间模拟器的视频数据与描绘活动时空上下文的知识图谱，该数据集旨在用于社会问题的知识图谱推理挑战（KGRC4SI），重点是识别和解决家庭环境中的危险情况，该数据集对于研究人员和从业者开发创新解决方案，以识别人类行为以提升安全和福祉而言，可作为宝贵的资源对公众开放。

提升家庭环境安全与福祉的综合多模态数据集

Synthetic Multimodal Dataset for Empowering Safety and Well-being in  Home Environments

Multi-frame human pose estimation in complicated situations is challenging.
Although state-of-the-art human joints detectors have demonstrated remarkable
results for static images, their performances come short when we apply these
models to video sequences. Prevalent shortcomings include the failure to handle
motion blur, video defocus, or pose occlusions, arising from the inability in
capturing the temporal dependency among video frames. On the other hand,
directly employing conventional recurrent neural networks incurs empirical
difficulties in modeling spatial contexts, especially for dealing with pose
occlusions. In this paper, we propose a novel multi-frame human pose estimation
framework, leveraging abundant temporal cues between video frames to facilitate
keypoint detection. Three modular components are designed in our framework. A
Pose Temporal Merger encodes keypoint spatiotemporal context to generate
effective searching scopes while a Pose Residual Fusion module computes
weighted pose residuals in dual directions. These are then processed via our
Pose Correction Network for efficient refining of pose estimations. Our method
ranks No.1 in the Multi-frame Person Pose Estimation Challenge on the
large-scale benchmark datasets PoseTrack2017 and PoseTrack2018. We have
released our code, hoping to inspire future research.

本文提出了一种基于多帧和时序信息的人体姿态估计方法。该方法包含三个模块：姿态时序合成器、姿态残差融合模块和姿态校正网络。在 PoseTrack2017 和 PoseTrack2018 数据集上进行的实验结果表明，该方法取得了最佳效果，并已发布代码以期促进未来的研究。

深度双连续网络用于人体姿态估计

Deep Dual Consecutive Network for Human Pose Estimation

We present a new large-scale multilingual video description dataset, VATEX,
which contains over 41,250 videos and 825,000 captions in both English and
Chinese. Among the captions, there are over 206,000 English-Chinese parallel
translation pairs. Compared to the widely-used MSR-VTT dataset, VATEX is
multilingual, larger, linguistically complex, and more diverse in terms of both
video and natural language descriptions. We also introduce two tasks for
video-and-language research based on VATEX: (1) Multilingual Video Captioning,
aimed at describing a video in various languages with a compact unified
captioning model, and (2) Video-guided Machine Translation, to translate a
source language description into the target language using the video
information as additional spatiotemporal context. Extensive experiments on the
VATEX dataset show that, first, the unified multilingual model can not only
produce both English and Chinese descriptions for a video more efficiently, but
also offer improved performance over the monolingual models. Furthermore, we
demonstrate that the spatiotemporal video context can be effectively utilized
to align source and target languages and thus assist machine translation. In
the end, we discuss the potentials of using VATEX for other video-and-language
research.

我们提出了一个新的大规模多语言视频描述数据集 VATEX, 其中包含超过 41,250 个视频和 825,000 条英文和中文字幕，拥有超过 206,000 个英中平行翻译对。我们还基于 VATEX 引入了两项视频与语言研究任务：（1）多语言视频字幕生成，旨在使用紧凑的统一字幕模型以各种语言描述视频，（2）视频引导机器翻译，使用视频信息作为附加时空上下文将源语言描述翻译成目标语言。VATEX 数据集的广泛实验表明，该统一多语言模型不仅可以更高效地生成视频的英文和中文描述，而且可以提供比单语言模型更好的性能。此外，我们还证明，时空视频上下文可以有效地用于对齐源语言和目标语言，从而帮助机器翻译。最后，我们讨论了使用 VATEX 进行其他视频与语言研究的潜力。