Abstractive Speech Summarization (SSum) aims to generate human-like text
summaries from spoken content. It encounters difficulties in handling long
speech input and capturing the intricate cross-modal mapping between long
speech inputs and short text summaries. Research on large language models
(LLMs) and multimodal information fusion has provided new insights for
addressing these challenges. In this paper, we propose an end-to-end SSum model
that utilizes Q-Former as a connector for the audio-text modality and employs
LLMs to generate text summaries directly from speech features. We adopt a
multi-stage training approach that includes LLM based ASR and Text
Summarization (TSum) tasks as auxiliary tasks. ASR tasks are used to align
feature spaces and enhance the LLM's ability to handle longer speech. Then, we
utilize a curriculum learning strategy to facilitate the model's transition
from TSum to SSum. Finally, our model achieves competitive performance on the
How-2 dataset.

提出了一种利用 Q-Former 作为音频 - 文本模态连接器、采用大型语言模型从语音特征直接生成文本摘要的端到端 SSum 模型，并采用多阶段训练方法来提高模型处理长篇语音的能力，最终在 How-2 数据集上取得了具有竞争力的性能。

使用大型语言模型的端到端语音摘要

An End-to-End Speech Summarization Using Large Language Model

The impressive capability and versatility of large language models (LLMs)
have aroused increasing attention in automatic speech recognition (ASR), with
several pioneering studies attempting to build integrated ASR models by
connecting a speech encoder with an LLM. This paper presents a comparative
study of three commonly used structures as connectors, including fully
connected layers, multi-head cross-attention, and Q-Former. Speech encoders
from the Whisper model series as well as LLMs from the Vicuna model series with
different model sizes were studied. Experiments were performed on the commonly
used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with
Q-Formers demonstrated consistent and considerable word error rate (WER)
reductions over LLMs with other connector structures. Q-Former-based LLMs can
generalise well to out-of-domain datasets, where 12% relative WER reductions
over the Whisper baseline ASR model were achieved on the Eval2000 test set
without using any in-domain training data from Switchboard. Moreover, a novel
segment-level Q-Former is proposed to enable LLMs to recognise speech segments
with a duration exceeding the limitation of the encoders, which results in 17%
relative WER reductions over other connector structures on 90-second-long
speech data.

该论文通过比较研究了三种常用的连接结构，包括全连接层、多头交叉注意力和 Q-Former，并对 Whisper 系列的语音编码器和 Vicuna 系列的大语言模型进行了实验，结果表明基于 Q-Former 的大语言模型相比其他连接结构在 LibriSpeech、Common Voice 和 GigaSpeech 数据集上均取得了一致且显著的词错误率降低。此外，提出了一种新颖的片段级 Q-Former，使大语言模型能够识别超过编码器限制的持续时间的语音片段，在 90 秒长的语音数据上相比其他连接结构取得了 17% 的词错误率降低。

连接语音编码器和大型语言模型用于 ASR

Connecting Speech Encoder and Large Language Model for ASR

Our winning entry for the CVPR 2023 Generic Event Boundary Captioning (GEBC)
competition is detailed in this paper. Unlike conventional video captioning
tasks, GEBC demands that the captioning model possess an understanding of
immediate changes in status around the designated video boundary, making it a
difficult task. This paper proposes an effective model LLMVA-GEBC (Large
Language Model with Video Adapter for Generic Event Boundary Captioning): (1)
We utilize a pretrained LLM for generating human-like captions with high
quality. (2) To adapt the model to the GEBC task, we take the video Q-former as
an adapter and train it with the frozen visual feature extractors and LLM. Our
proposed method achieved a 76.14 score on the test set and won the first place
in the challenge. Our code is available at
this https URL .

本文详细介绍了我们在 CVPR 2023 通用事件边界字幕（GEBC）比赛中的得奖作品，提出了 LLMVA-GEBC 模型，该模型采用预训练 LLM 生成高质量的人类式字幕，在训练固定的视觉特征提取和 LLM 同时采用视频 Q-former 作为适配器以适应 GEBC 任务，最终在测试集上取得了 76.14 分的高得分并获得了第一名 。