Large Language Models (LLMs) are increasingly employed as automated evaluators to assess the safety of generated content, yet their reliability in this role remains uncertain. This study evaluates a diverse set of 11 LLM judge models across critical safety domains, examining three key aspects: self-consistency in repeated judging tasks, alignment with human judgments, and susceptibility to input artifacts such as apologetic or verbose phrasing. Our findings reveal that biases in LLM judges can significantly distort the final verdict on which content source is safer, undermining the validity of comparative evaluations. Notably, apologetic language artifacts alone can skew evaluator preferences by up to 98\%. Contrary to expectations, larger models do not consistently exhibit greater robustness, while smaller models sometimes show higher resistance to specific artifacts. To mitigate LLM evaluator robustness issues, we investigate jury-based evaluations aggregating decisions from multiple models. Although this approach both improves robustness and enhances alignment to human judgements, artifact sensitivity persists even with the best jury configurations. These results highlight the urgent need for diversified, artifact-resistant methodologies to ensure reliable safety assessments.

本研究探讨了大型语言模型（LLM）在安全评估中的可靠性，发现其对输入伪影的敏感性导致偏见，显著影响了对内容安全性的评价。研究提出了基于多个模型的评审方法，以提高评估的一致性和与人类判断的对齐，但伪影敏感性仍然存在，凸显了亟需更为多样化和抗伪影的方法以确保可靠的安全评估。

更安全还是更幸运？大型语言模型作为安全评估者对伪影不具鲁棒性