In this paper, we consider a novel and practical case for talking face video
generation. Specifically, we focus on the scenarios involving multi-people
interactions, where the talking context, such as audience or surroundings, is
present. In these situations, the video generation should take the context into
consideration in order to generate video content n