natural language understanding (NLU) tasks face a non-trivial amount of
ambiguous samples where veracity of their labels is debatable among annotators.
NLU models should thus account for such ambiguity, but they approximate the
human opinion distributions quite poorly and tend to produ