In dyadic speaker-listener interactions, the listener's head reactions along with the speaker's head movements, constitute an important non-verbal semantic expression together. The listener Head generation task aims to synthesize responsive listener's head videos based on audios of the speaker and reference images of the listener. Compared to the Talking-head generation, it is more challenging to capture the correlation clues from the speaker's audio and visual information. Following the ViCo baseline scheme, we propose a high-performance solution by enhancing the hierarchical semantic extraction capability of the audio encoder module and improving the decoder part, renderer and post-processing modules. Our solution gets the first place on the official leaderboard for the track of listening head generation. This paper is a technical report of ViCo@2023 Conversational Head Generation Challenge in ACM Multimedia 2023 conference.

在双人对话中，听众的头部反应与发言者的头部移动构成了重要的非言语语义表达。听众头部生成任务旨在基于发言者的音频和听众的参考图像，合成反应性的听众头部视频。本文提出了一个高性能的解决方案，通过增强音频编码器模块的分层语义提取能力，改进解码器部分、渲染器和后处理模块。我们的解决方案在ACM Multimedia 2023会议的ViCo@2023 Conversational Head Generation Challenge中获得了第一名。

分层语义感知听觉头部视频生成：一个高性能的管线