Our contribution introduces a groundbreaking multimodal large language model
designed to comprehend multi-images, multi-audio, and multi-images-multi-audio
within a single multiturn session. Leveraging state-of-the-art models, we
utilize the SigLIP encoder for visual inputs and the Whisper Encoder for audio
inputs. Notably, this multimodal large language model is bilingual, proficient
in understanding both English and Malay simultaneously. We proudly unveil two
versions of this model: TinyLlama with 1.1B parameters, and Mistral with 7B
parameters. With its ability to navigate diverse modalities and languages, our
model represents a significant advancement for the Malaysian context and
beyond.
All models released at
this https URL

我们介绍了一种开创性的多模态大型语言模型，能够在一个多轮对话中理解多图像、多音频和多图像 - 多音频。借助最先进的模型，我们利用 SigLIP 编码器进行视觉输入和 Whisper 编码器进行音频输入。值得注意的是，这个多模态大型语言模型是双语的，能够同时理解英文和马来文。我们自豪地推出了这个模型的两个版本：参数量为 1.1B 的 TinyLlama 和参数量为 7B 的 Mistral。我们的模型能够处理多样的模态和语言，代表了马来西亚及其他地区的重大进展。