The continuous advancement of large language models (LLMs) has brought
increasing attention to the critical issue of developing fair and reliable
methods for evaluating their performance. Particularly, the emergence of
subjective or non-subjective cheating phenomena, such as test set l