Recently, instruction-following audio-language models have received broad
attention for human-audio interaction. However, the absence of benchmarks
capable of evaluating audio-centric interaction capabilities has impeded
advancements in this field. Previous models primarily focus on assessing
different fundamental tasks, such as Automatic Speech Recognition (ASR), and
lack an assessment of the open-ended generative capabilities centered around
audio. Thus, it is challenging to track the progression in the Large
Audio-Language Models (LALMs) domain and to provide guidance for future
improvement. In this paper, we introduce AIR-Bench (\textbf{A}udio
\textbf{I}nst\textbf{R}uction \textbf{Bench}mark), the first benchmark designed
to evaluate the ability of LALMs to understand various types of audio signals
(including human speech, natural sounds, and music), and furthermore, to
interact with humans in the textual format. AIR-Bench encompasses two
dimensions: \textit{foundation} and \textit{chat} benchmarks. The former
consists of 19 tasks with approximately 19k single-choice questions, intending
to inspect the basic single-task ability of LALMs. The latter one contains 2k
instances of open-ended question-and-answer data, directly assessing the
comprehension of the model on complex audio and its capacity to follow
instructions. Both benchmarks require the model to generate hypotheses
directly. We design a unified framework that leverages advanced language
models, such as GPT-4, to evaluate the scores of generated hypotheses given the
meta-information of the audio. Experimental results demonstrate a high level of
consistency between GPT-4-based evaluation and human evaluation. By revealing
the limitations of existing LALMs through evaluation results, AIR-Bench can
provide insights into the direction of future research.

近期，为人 - 音频交互所提出的指令遵循型音频语言模型引起了广泛关注。然而，由于缺乏评估以音频为中心的交互能力的基准，这一领域的发展受阻。本文引入了 AIR-Bench（音频指令评测基准），这是首个旨在评估音频语言模型在理解各种类型音频信号（包括人声、自然声音和音乐）以及在文本格式下与人类互动方面的能力的基准。AIR-Bench 包括两个维度：基础和对话评测。通过实验证明，使用 GPT-4 评估生成的假设得分与人工评估结果之间存在高度一致性。通过评估结果揭示现有 LALMs 的局限性，AIR-Bench 可以为未来研究方向提供启示。

AIR-Bench: 大规模音频语言模型的生成理解基准评估

AIR-Bench: Benchmarking Large Audio-Language Models via Generative  Comprehension

Recently, instruction-following audio-language models have received broad
attention for audio interaction with humans. However, the absence of
pre-trained audio models capable of handling diverse audio types and tasks has
hindered progress in this field. Consequently, most existing works have only
been able to support a limited range of interaction capabilities. In this
paper, we develop the Qwen-Audio model and address this limitation by scaling
up audio-language pre-training to cover over 30 tasks and various audio types,
such as human speech, natural sounds, music, and songs, to facilitate universal
audio understanding abilities. However, directly co-training all tasks and
datasets can lead to interference issues, as the textual labels associated with
different datasets exhibit considerable variations due to differences in task
focus, language, granularity of annotation, and text structure. To overcome the
one-to-many interference, we carefully design a multi-task training framework
by conditioning on a sequence of hierarchical tags to the decoder for
encouraging knowledge sharing and avoiding interference through shared and
specified tags respectively. Remarkably, Qwen-Audio achieves impressive
performance across diverse benchmark tasks without requiring any task-specific
fine-tuning, surpassing its counterparts. Building upon the capabilities of
Qwen-Audio, we further develop Qwen-Audio-Chat, which allows for input from
various audios and text inputs, enabling multi-turn dialogues and supporting
various audio-central scenarios.

最近，受到广泛关注的指令跟随音频语言模型在音频与人类的交互方面表现出色。然而，缺乏能够处理各种音频类型与任务的预训练音频模型阻碍了该领域的进展。本文通过扩大音频语言预训练的规模，覆盖 30 多项任务和各种音频类型（如人类语音、自然声音、音乐和歌曲），开发了 Qwen-Audio 模型，以促进通用音频理解能力。然而，直接同时训练所有任务和数据集可能会引起干扰问题，因为不同数据集的文本标签因任务焦点、语言、注释粒度和文本结构的差异而有相当大的变化。为了克服一对多干扰，我们通过对解码器进行基于层次标签序列的条件设计了一个多任务训练框架，以通过共享和明确的标签来鼓励知识共享和避免干扰。值得注意的是，Qwen-Audio 在不需要任何特定任务的微调的情况下，跨多个基准任务取得了令人印象深刻的表现，超过了其对手。借助 Qwen-Audio 的能力，我们进一步开发了 Qwen-Audio-Chat，它可以接受来自不同音频和文本输入的输入，实现多轮对话并支持各种以音频为中心的场景。

Qwen-Audio：通过统一的大规模音频语言模型推进通用音频理解

Qwen-Audio: Advancing Universal Audio Understanding via Unified  Large-Scale Audio-Language Models

A fundamental characteristic of audio is its compositional nature.
Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP)
that learns a shared representation between audio and language modalities have
improved performance in many downstream applications, including zero-shot audio
classification, audio retrieval, etc. However, the ability of these models to
effectively perform compositional reasoning remains largely unexplored and
necessitates additional research. In this paper, we propose CompA, a collection
of two expert-annotated benchmarks with a majority of real-world audio samples,
to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates
how well an ALM understands the order or occurrence of acoustic events in
audio, and CompA-attribute evaluates attribute binding of acoustic events. An
instance from either benchmark consists of two audio-caption pairs, where both
audios have the same acoustic events but with different compositions. An ALM is
evaluated on how well it matches the right audio to the right caption. Using
this benchmark, we first show that current ALMs perform only marginally better
than random chance, thereby struggling with compositional reasoning. Next, we
propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to
improve its compositional reasoning abilities. To train CompA-CLAP, we first
propose improvements to contrastive training with composition-aware hard
negatives, allowing for more focused training. Next, we propose a novel modular
contrastive loss that helps the model learn fine-grained compositional
understanding and overcomes the acute scarcity of openly available
compositional audios. CompA-CLAP significantly improves over all our baseline
models on the CompA benchmark, indicating its superior compositional reasoning
capabilities.

通过 ALM 和 CLAP 的训练方法，本研究提出了 CompA，用于评估 ALMs 的组合推理能力，并发现现有的 ALMs 在组合推理方面的表现仅略好于随机选择，而通过改进训练方法和引入模块化对比损失的 CompA-CLAP 显著提高了组合推理能力。