Recent work has shown that models trained to the same objective, and which
achieve similar measures of accuracy on consistent test data, may nonetheless
behave very differently on individual predictions. This inconsistency is
undesirable in high-stakes contexts, such as medical diagnos