Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multimodal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets. Our code is available at https://github.com/FouriYe/DCRGAN-TMM.

本文提出了一种利用多模态信息搜索解缠表示空间的多模态三元组损失模型（MMTL），并且发展了一种名为“解开类表示生成对抗网络（DCR-GAN）”的新型模型，该模型能够在训练、特征合成和最终识别阶段利用解开表示，从而使得DCR-GAN能够拟合更真实的分布。实验证明，该模型在四个基准数据集上的性能优于现有的技术。

解决零样本学习语义与视觉混淆问题