This project involved participation in the DCASE 2022 Competition (Task 6)
which had two subtasks: (1) Automated Audio Captioning and (2) Language-Based
Audio Retrieval. The first subtask involved the generation of a textual
description for audio samples, while the goal of the second was to find audio
samples within a fixed dataset that match a given description. For both
subtasks, the Clotho dataset was used. The models were evaluated on BLEU1,
BLEU2, BLEU3, ROUGEL, METEOR, CIDEr, SPICE, and SPIDEr scores for audio
captioning and R1, R5, R10 and mARP10 scores for audio retrieval. We have
conducted a handful of experiments that modify the baseline models for these
tasks. Our final architecture for Automated Audio Captioning is close to the
baseline performance, while our model for Language-Based Audio Retrieval has
surpassed its counterpart.

参加了 DCASE 2022 比赛的两个子任务：自动音频字幕和基于语言的音频检索。在 Clotho 数据集上评估使用多种评估指标的基线模型和一些实验，分别对音频字幕和语音检索任务的最终表现进行了改进。

自动音频字幕和基于语言的音频检索

Automated Audio Captioning and Language-Based Audio Retrieval

In this paper, we tackle the new Language-Based Audio Retrieval task proposed
in DCASE 2022. Firstly, we introduce a simple, scalable architecture which ties
both the audio and text encoder together. Secondly, we show that using this
architecture along with contrastive loss allows the model to significantly beat
the performance of the baseline model. Finally, in addition to having an
extremely low training memory requirement, we are able to use pretrained models
as it is without needing to finetune them. We test our methods and show that
using a combination of our methods beats the baseline scores significantly.

本文介绍了一种简单，可扩展的架构，将语音和文本编码器结合在一起，并使用对比损失来显著提高基线模型的性能。通过使用预训练模型，无需微调即可在极低的训练内存要求下实现优异的语音检索表现。实验结果表明，采用我们的方法组合可以显著提高基线分数。