Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotion in conversation. Benefiting from the ``explicit'' and ``discrete'' design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods.

生成非语言行为（如微笑）时的监听者头部生成侧重于根据说话者提供的信息生成听话者的非语言行为。我们提出了情感监听者肖像（ELP），将每个细粒度面部动作视为几个离散运动密码字的组合，并明确地建模了不同情绪对话中动作的概率分布。通过从学习的分布中采样，我们的ELP模型不仅可以自动生成自然且多样化的回应，还可以生成具有预定态度的可控回应。与以前的方法相比，我们的ELP在多个定量指标上表现出显著改进。

情感听众画像：对话中的真实听众动作模拟