Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood. Two candidate properties related to the geometry of the representation space have been hypothesized to correlate well with downstream tasks: (1) the degree of orthogonality between the subspaces spanned by the speaker centroids and phone centroids, and (2) the isotropy of the space, i.e., the degree to which all dimensions are effectively utilized. To study them, we introduce a new measure, Cumulative Residual Variance (CRV), which can be used to assess both properties. Using linear classifiers for speaker and phone ID to probe the representations of six different self-supervised models and two untrained baselines, we ask whether either orthogonality or isotropy correlate with linear probing accuracy. We find that both measures correlate with phonetic probing accuracy, though our results on isotropy are more nuanced.

自我监督语音表示对下游语音技术有很大的益处，但其有用性的属性仍然了解甚少。本文引入了一种新的度量方法，即累积残差方差（CRV），用于评估表示空间的两个候选属性：讲话者质心和音素质心所跨越子空间的正交程度，以及空间的各个维度有效利用程度。我们使用线性分类器对六个不同的自我监督模型和两个未经训练的基准模型的语音表示进行了评估，探讨正交性和各向同性是否与线性测试精度相关。研究发现这两个度量与语音测试精度呈正相关，尽管对于各向同性的结果更为微妙。

自监督语音表示中说话人和语音信息的正交性和等向性