The rapid progress in Multimodal Large Language Models (MLLMs) has
significantly advanced their ability to process and understand complex visual
and textual information. However, the integration of multiple images and
extensive textual contexts remains a challenge due to the inherent limitation
of the models' capacity to handle long input sequences efficiently. In this
paper, we introduce SEEKER, a multimodal large language model designed to
tackle this issue. SEEKER aims to optimize the compact encoding of long text by
compressing the text sequence into the visual pixel space via images, enabling
the model to handle long text within a fixed token-length budget efficiently.
Our empirical experiments on six long-context multimodal tasks demonstrate that
SEEKER can leverage fewer image tokens to convey the same amount of textual
information compared with the OCR-based approach, and is more efficient in
understanding long-form multimodal input and generating long-form textual
output, outperforming all existing proprietary and open-source MLLMs by large
margins.

通过将文本序列压缩到视觉像素空间中的图像，SEEKER 旨在优化长文本的紧凑编码，以便于在固定的令牌长度预算内高效处理长文本，并在理解长格式多模输入和生成长格式文本输出方面胜过所有现有专有和开源 MLLMs。