Large language models (LLMs) need to serve everyone, including a global
majority of non-English speakers. However, most LLMs today, and open LLMs in
particular, are often intended for use in just English (e.g. Llama2, Mistral)
or a small handful of high-resource languages (e.g. Mixtral, Qwen). Recent
research shows that, despite limits in their intended use, people prompt LLMs
in many different languages. Therefore, in this paper, we investigate the basic
multilingual capabilities of state-of-the-art open LLMs beyond their intended
use. For this purpose, we introduce MultiQ, a new silver standard benchmark for
basic open-ended question answering with 27.4k test questions across a
typologically diverse set of 137 languages. With MultiQ, we evaluate language
fidelity, i.e.\ whether models respond in the prompted language, and question
answering accuracy. All LLMs we test respond faithfully and/or accurately for
at least some languages beyond their intended use. Most models are more
accurate when they respond faithfully. However, differences across models are
large, and there is a long tail of languages where models are neither accurate
nor faithful. We explore differences in tokenization as a potential explanation
for our findings, identifying possible correlations that warrant further
investigation.

研究表明，尽管当前大部分开放式语言模型主要面向英语或少数几种高资源语言，人们却在多种语言中使用这些模型。本文通过引入 MultiQ 标准测试并评估 27.4k 个不同语言的基本开放式问答问题，探讨了现有开放式语言模型在超越其预定用途方面的多语言能力。通过研究发现，在一些语言中，这些模型在回答问题时表现得既忠实又准确，而大多数模型在忠实于回答问题时的准确性更高，但在某些语言中模型的准确性和忠实度都较低。我们还探讨了分词对这些发现的潜在解释，发现了可能存在的相关性，值得进一步研究。

使用 MultiQ 评估大型语言模型的基础多语言能力

Evaluating the Elementary Multilingual Capabilities of Large Language  Models with MultiQ

Multimodal Large Language Models (MLLMs) have recently shown remarkable
perceptual capability in answering visual questions, however, little is known
about the limits of their perception. In particular, while prior works have
provided anecdotal evidence of MLLMs' sensitivity to object size, this
phenomenon and its underlying causes have not been explored comprehensively. In
this work, we quantitatively study the perception of small visual objects in
several state-of-the-art MLLMs and reveal a pervasive limitation in answering
questions about small objects in images. Next, we identify four independent
factors that can contribute to this limitation -- object quality, size,
distractors, and location -- and conduct controlled intervention studies to
measure the effect of each factor on MLLMs' perception. In particular, we find
that lower object quality and smaller object size can both independently reduce
MLLMs' ability to answer visual questions. More surprisingly, we find that the
location of the object in the image and the presence of visual distractors can
also significantly reduce MLLMs' question answering accuracy. Our study
provides a better understanding of the perceptual limitation of MLLMs and
contributes new evaluation protocols for analyzing the perception of future
MLLMs. To facilitate further investigations, we release our code and data.

在多模态大型语言模型中，研究了其对小型视觉对象的感知限制，发现对象质量、大小、干扰物的位置等因素都会显著降低模型对视觉问题的回答准确性。该研究对多模态大型语言模型的感知限制进行了探索，为未来模型的感知分析提供了新的评价协议。