Current dialogue systems focus more on textual and speech context knowledge
and are usually based on two speakers. Some recent work has investigated static
image-based dialogue. However, several real-world human interactions also
involve dynamic visual context (similar to videos) as well as dialogue
exchanges among multiple speakers. To move closer towards s