To measure how well pretrained representations encode some linguistic
property, it is common to use accuracy of a probe, i.e. a classifier trained to
predict the property from the representations. Despite widespread adoption of
probes, differences in their accuracy fail to adequately r