In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.

法律NLP中的案件结果分类不仅需要准确性，还需要可信度和可解释性。我们提出了一个新的数据集RAVE：欧洲人权法上的理由变异，收集了两位专家在国际人权法领域的评估，发现他们在评估案例事实上存在差异。我们建立了一个两层次的与任务无关的分类系统，并补充了与案例结果分类相关的子类别。我们定量评估了不同分类系统的性能，并发现观点分歧主要源于法律背景的不明确，这在案件结果分类的元数据中是有限的。我们进一步评估了最先进的案件结果分类模型在RAVE上的可解释性，并观察到模型与专家之间的一致性有限。总的来说，我们的案例研究揭示了在法律NLP中创建基准数据集所涉及的复杂性，重点是确定与案件结果相关的方面。

从不一致到洞察：对案件结果分类中的理据数据集构建进行剖析