Creating artificial social intelligence - algorithms that can understand the
nuances of multi-person interactions - is an exciting and emerging challenge in
processing facial expressions and gestures from multimodal videos. Recent
multimodal methods have set the state of the art on many tasks, but have
difficulty modeling the complex face-to-face conversational dynamics across
speaking turns in social interaction, particularly in a self-supervised setup.
In this paper, we propose Face-to-Face Contrastive Learning (F2F-CL), a graph
neural network designed to model social interactions using factorization nodes
to contextualize the multimodal face-to-face interaction along the boundaries
of the speaking turn. With the F2F-CL model, we propose to perform contrastive
learning between the factorization nodes of different speaking turns within the
same video. We experimentally evaluated the challenging Social-IQ dataset and
show state-of-the-art results.

本文提出了一种名为 Face-to-Face Contrastive Learning (F2F-CL) 的图神经网络模型，用于建模人类社交互动中的面对面交流动态，并在 Social-IQ 数据集上实现了最先进的效果。

面对面对比学习用于社交智能问答

Face-to-Face Contrastive Learning for Social Intelligence Question-Answering

Human emotions unfold over time, and more affective computing research has to
prioritize capturing this crucial component of real-world affect. Modeling
dynamic emotional stimuli requires solving the twin challenges of time-series
modeling and of collecting high-quality time-series datasets. We begin by
assessing the state-of-the-art in time-series emotion recognition, and we
review contemporary time-series approaches in affective computing, including
discriminative and generative models. We then introduce the first version of
the Stanford Emotional Narratives Dataset (SENDv1): a set of rich, multimodal
videos of self-paced, unscripted emotional narratives, annotated for emotional
valence over time. The complex narratives and naturalistic expressions in this
dataset provide a challenging test for contemporary time-series emotion
recognition models. We demonstrate several baseline and state-of-the-art
modeling approaches on the SEND, including a Long Short-Term Memory model and a
multimodal Variational Recurrent Neural Network, which perform comparably to
the human-benchmark. We end by discussing the implications for future research
in time-series affective computing.

本文通过时间序列建模和高质量数据集的采集来建立动态情感刺激的模型，在此基础上介绍第一版本的斯坦福情感叙述数据集（SENDv1）。该数据集是自我节奏、非手稿的丰富、多模态视频，标注情感质量随时间的变化，为情感计算的当代时间序列方法提供了挑战，并通过多种基准和最先进的建模方法展示了良好的表现。

在复杂故事中建模情感：斯坦福情感叙事数据集

Modeling emotion in complex stories: the Stanford Emotional Narratives  Dataset

We propose a technique that tackles action detection in multimodal videos
under a realistic and challenging condition in which only limited training data
and partially observed modalities are available. Common methods in transfer
learning do not take advantage of the extra modalities potentially available in
the source domain. On the other hand, previous work on multimodal learning only
focuses on a single domain or task and does not handle the modality discrepancy
between training and testing. In this work, we propose a method termed graph
distillation that incorporates rich privileged information from a large-scale
multimodal dataset in the source domain, and improves the learning in the
target domain where training data and modalities are scarce. We evaluate our
approach on action classification and detection tasks in multimodal videos, and
show that our model outperforms the state-of-the-art by a large margin on the
NTU RGB+D and PKU-MMD benchmarks. The code is released at
this http URL

本研究提出一种称为图蒸馏的方法，该方法在多模态视频中处理行动检测，其中仅有有限的训练数据和部分观察到的模态可用，并通过在源域中的大规模多模态数据集中使用丰富的特权信息来提高目标域的学习效果，从而克服了训练和测试之间的模态差异，并在 NTU RGB + D 和 PKU-MMD 基准测试中明显优于现有技术。