There has been significant research on developing pretrained transformer
architectures for multimodal-to-text generation tasks. Albeit performance
improvements, such models are frequently overparameterized, hence suffer from
hallucination and large memory footprint making them challenging to deploy on
edge devices. In this paper, we address both these issues for the application
of automated audio captioning. First, we propose a data augmentation technique
for generating hallucinated audio captions and show that similarity based on an
audio-text shared latent space is suitable for detecting hallucination. Then,
we propose a parameter efficient inference time faithful decoding algorithm
that enables smaller audio captioning models with performance equivalent to
larger models trained with more data. During the beam decoding step, the
smaller model utilizes an audio-text shared latent representation to
semantically align the generated text with corresponding input audio. Faithful
guidance is introduced into the beam probability by incorporating the cosine
similarity between latent representation projections of greedy rolled out
intermediate beams and audio clip. We show the efficacy of our algorithm on
benchmark datasets and evaluate the proposed scheme against baselines using
conventional audio captioning and semantic similarity metrics while
illustrating tradeoffs between performance and complexity.

通过提出预训练的 Transformer 架构、数据增强技术和参数高效的推理算法，研究针对自动音频字幕生成应用中存在的过度参数化、虚构现象和大内存占用的问题，通过语义对齐和类似度计算等方法，提升性能并减少模型复杂度。

使用音频和文本共享的潜在表示进行高效音频字幕生成

Parameter Efficient Audio Captioning With Faithful Guidance Using  Audio-text Shared Latent Representation

Prompt engineering is a technique that involves augmenting a large
pre-trained model with task-specific hints, known as prompts, to adapt the
model to new tasks. Prompts can be created manually as natural language
instructions or generated automatically as either natural language instructions
or vector representations. Prompt engineering enables the ability to perform
predictions based solely on prompts without updating model parameters, and the
easier application of large pre-trained models in real-world tasks. In past
years, Prompt engineering has been well-studied in natural language processing.
Recently, it has also been intensively studied in vision-language modeling.
However, there is currently a lack of a systematic overview of prompt
engineering on pre-trained vision-language models. This paper aims to provide a
comprehensive survey of cutting-edge research in prompt engineering on three
types of vision-language models: multimodal-to-text generation models (e.g.
Flamingo), image-text matching models (e.g. CLIP), and text-to-image generation
models (e.g. Stable Diffusion). For each type of model, a brief model summary,
prompting methods, prompting-based applications, and the corresponding
responsibility and integrity issues are summarized and discussed. Furthermore,
the commonalities and differences between prompting on vision-language models,
language models, and vision models are also discussed. The challenges, future
directions, and research opportunities are summarized to foster future research
on this topic.

本文系统概述了在三种类型的视觉 - 语言模型上的提示工程的前沿研究，包括多模式到文本生成模型、图像 - 文本匹配模型和文本 - 图像生成模型，并总结和讨论了模型概要、提示方法、基于提示的应用以及相关的责任和完整性问题。此外，还讨论了在提示对视觉 - 语言模型、语言模型和视觉模型的共同点和差异，并对挑战、未来方向和研究机会进行了总结，以推动未来对此主题的研究。