Today, there have been many achievements in learning the association between
voice and face. However, most previous work models rely on cosine similarity or
L2 distance to evaluate the likeness of voices and faces following contrastive
learning, subsequently applied to retrieval and matching tasks. This method
only considers the embeddings as high-dimensiona