There have been many attempts to build multimodal dialog systems that can respond to a question about given audio-visual information, and the representative task for such systems is the audio visual scene-aware dialog (AVSD). Most conventional AVSD models adopt the →