We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation. Given the input speech of a speaker, our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context. Our approach further allows the listener to be conditioned on their own goals, personalities, or backgrounds. Our approach models conversations through a composition of large language models and vision-language models, creating internal representations that are interpretable and controllable. To study multimodal communication, we propose a new video dataset of unscripted conversations covering diverse topics and demographics. Experiments and visualizations show our approach is able to output listeners that are significantly more socially appropriate than baselines. However, many challenges remain, and we release our dataset publicly to spur further progress. See our website for video results, data, and code: https://realtalk.cs.columbia.edu.

本研究介绍了一种视频框架，用于建模双人对话中口头和非口头交流之间的关联，提出了一种通过大型语言模型和视觉-语言模型构成的对话建模方法，并提出了一种新的无剧本对话视频数据集，实验和可视化结果表明，该方法能够生成显著更具社交适切性的监听者。

面向目标驱动的二元交流的情感面孔