Conventional audio classification relied on predefined classes, lacking the
ability to learn from free-form text. Recent methods unlock learning joint
audio-text embeddings from raw audio-text pairs describing audio in natural
language. Despite recent advancements, there is little expl