We introduce MIA-Bench, a new benchmark designed to evaluate multimodal large
language models (MLLMs) on their ability to strictly adhere to complex
instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs,
each crafted to challenge the models' compliance with layered instructions in
generating accurate responses that satisfy specific requested patterns.
Evaluation results from a wide array of state-of-the-art MLLMs reveal
significant variations in performance, highlighting areas for improvement in
instruction fidelity. Additionally, we create extra training data and explore
supervised fine-tuning to enhance the models' ability to strictly follow
instructions without compromising performance on other tasks. We hope this
benchmark not only serves as a tool for measuring MLLM adherence to
instructions, but also guides future developments in MLLM training methods.

我们引入了 MIA-Bench，一个新的基准测试，旨在评估多模态大型语言模型在其严格遵循复杂指令方面的能力。通过评估各种最先进的多模态大型语言模型，我们发现性能存在显著差异，突出了指令准确性方面的改进空间。此外，我们创建了额外的训练数据，并探索监督微调来提高模型在严格遵循指令的能力，而不牺牲其他任务的性能。我们希望这个基准测试不仅可用于测量多模态大型语言模型对指令的遵循程度，还能指导未来的多模态大型语言模型训练方法的发展。

MIA-Bench: 多模态 LLMs 的更好指令遵循评估

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal  LLMs

Large Language Models (LLMs) have demonstrated a powerful ability for text
generation. However, achieving optimal results with a given prompt or
instruction can be challenging, especially for billion-sized models.
Additionally, undesired behaviors such as toxicity or hallucinations can
manifest. While much larger models (e.g., ChatGPT) may demonstrate strength in
mitigating these issues, there is still no guarantee of complete prevention. In
this work, we propose formalizing text generation as a future-constrained
generation problem to minimize undesirable behaviors and enforce faithfulness
to instructions. The estimation of future constraint satisfaction, accomplished
using LLMs, guides the text generation process. Our extensive experiments
demonstrate the effectiveness of the proposed approach across three distinct
text generation tasks: keyword-constrained generation (Lin et al., 2020),
toxicity reduction (Gehman et al., 2020), and factual correctness in
question-answering (Gao et al., 2023).

通过将文本生成问题形式化为未来约束生成问题，以最小化不良行为并确保指令的忠实执行，本文介绍了利用 LLM 的未来约束满足估计来指导文本生成过程的方法，并通过对关键词受限生成、有害性降低和问答中的事实正确性等三个不同的文本生成任务进行了广泛的实验，证明了该方法的有效性。

解锁预期文本生成：一种受限方法用于大型语言模型的忠实解码

Unlocking Anticipatory Text Generation: A Constrained Approach for  Faithful Decoding with Large Language Models

Advances in learning and representations have reinvigorated work that
connects language to other modalities. A particularly exciting direction is
Vision-and-Language Navigation(VLN), in which agents interpret natural language
instructions and visual scenes to move through environments and reach goals.
Despite recent progress, current research leaves unclear how much of a role
language understanding plays in this task, especially because dominant
evaluation metrics have focused on goal completion rather than the sequence of
actions corresponding to the instructions. Here, we highlight shortcomings of
current metrics for the Room-to-Room dataset (Anderson et al.,2018b) and
propose a new metric, Coverage weighted by Length Score (CLS). We also show
that the existing paths in the dataset are not ideal for evaluating instruction
following because they are direct-to-goal shortest paths. We join existing
short paths to form more challenging extended paths to create a new data set,
Room-for-Room (R4R). Using R4R and CLS, we show that agents that receive
rewards for instruction fidelity outperform agents that focus on goal
completion.

通过对 Room-to-Room 数据集的评估提出新的评价指标 CLS，并且创建了新数据集 Room-for-Room (R4R) 用于指导指令跟踪的任务，同时通过比较与基准系统，表明重视指令的代理程序优于重视目标完成的代理程序。