The large vision-language model (LVLM) has enhanced the performance of
various downstream tasks in visual-language understanding. Most existing
approaches encode images and videos into separate feature spaces, which are
then fed as inputs to large language models. However, due to the l