In recent times, BERT based transformer models have become an inseparable part of the 'tech stack' of text processing models. Similar progress is being observed in the speech domain with a multitude of models observing state-of-the-art results by using audio transformer models to encode speech. This begs the question of what are these audio transformer models learning. Moreover, although the standard methodology is to choose the last layer embedding for any downstream task, but is it the optimal choice? We try to answer these questions for the two recent audio transformer models, Mockingjay and wave2vec2.0. We compare them on a comprehensive set of language delivery and structure features including audio, fluency and pronunciation features. Additionally, we probe the audio models' understanding of textual surface, syntax, and semantic features and compare them to BERT. We do this over exhaustive settings for native, non-native, synthetic, read and spontaneous speech datasets

通过比较Mockingjay和wave2vec2.0这两种音频转换模型、对其语言传递和结构特征、音频、流畅性和发音特征、文本表面、语法和语义特征的理解，最终发现语音编码中的音频转换模型在语音理解方面取得了很好的效果，类似于基于BERT的转换模型。

音频Transformer模型听到了什么? 探究语言交付及其结构的声学表示