In this paper, we propose three methods for generating synthetic samples to
train and evaluate multimodal large language models capable of processing both
text and speech inputs. Addressing the scarcity of samples containing both
modalities, synthetic data generation emerges as a crucial strategy to enhance
the performance of such systems and facilitate the modeling of cross-modal
relationships between the speech and text domains. Our process employs large
language models to generate textual components and text-to-speech systems to
generate speech components. The proposed methods offer a practical and
effective means to expand the training dataset for these models. Experimental
results show progress in achieving an integrated understanding of text and
speech. We also highlight the potential of using unlabeled speech data to
generate synthetic samples comparable in quality to those with available
transcriptions, enabling the expansion of these models to more languages.

我们提出了三种方法来生成合成样本，以训练和评估能够处理文本和语音输入的多模态大语言模型。通过解决包含多种模态的样本的稀缺性问题，合成数据生成成为提高这些系统性能并促进语音和文本领域的跨模态关系建模的关键策略。我们使用大型语言模型生成文本组件和文本到语音系统生成语音组件的过程。所提出的方法提供了一种实用且有效的扩展这些模型训练数据集的方式。实验结果表明，在理解文本和语音方面取得了进展。我们还强调了使用未标注的语音数据来生成质量可与有可用转录的样本媲美的合成样本的潜力，从而使这些模型能够更多地应用于其他语言。

语音语言模型的指导数据生成和无监督适应

Instruction Data Generation and Unsupervised Adaptation for Speech  Language Models

We present CrissCross, a self-supervised framework for learning audio-visual
representations. A novel notion is introduced in our framework whereby in
addition to learning the intra-modal and standard 'synchronous' cross-modal
relations, CrissCross also learns 'asynchronous' cross-modal relationships. We
perform in-depth studies showing that by relaxing the temporal synchronicity
between the audio and visual modalities, the network learns strong generalized
representations useful for a variety of downstream tasks. To pretrain our
proposed solution, we use 3 different datasets with varying sizes,
Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are
evaluated on a number of downstream tasks namely action recognition, sound
classification, and action retrieval. Our experiments show that CrissCross
either outperforms or achieves performances on par with the current
state-of-the-art self-supervised methods on action recognition and action
retrieval with UCF101 and HMDB51, as well as sound classification with ESC50
and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while
pretrained on Kinetics-Sound. The codes and pretrained models are available on
the project website.

CrissCross 是一种自监督学习框架，用于学习音频和视觉之间的表示，它还可以学习异步交叉模态关系，通过在多项下游任务方面的表现显示其有效性，并在 Kinetics-Sound 数据集上实现了优于或不逊于当前自监督方法的表现，同时也提供了经过预训练的模型。