Music representation learning is notoriously difficult for its complex
human-related concepts contained in the sequence of numerical signals. To
excavate better MUsic SEquence Representation from labeled audio, we propose a
novel text-supervision pre-training method, namely MUSER. MUSER adopts an
audio-spectrum-text tri-modal contrastive learning framework, where the text
input could be any form of meta-data with the help of text templates while the
spectrum is derived from an audio sequence. Our experiments reveal that MUSER
could be more flexibly adapted to downstream tasks compared with the current
data-hungry pre-training method, and it only requires 0.056% of pre-training
data to achieve the state-of-the-art performance.

本文提出了一种新的文本监督预训练方法 MUSER，采用音频 - 频谱 - 文本三模态对比学习框架，通过任何形式的元数据模板来帮助文本输入，从标记音频中挖掘更好的音乐序列表示，具有比当前数据密集型预训练方法更灵活地适应下游任务以及只需要 0.056％的预训练数据就能达到最先进性能的优势。