Human evaluation serves as the gold standard for assessing the quality of
Natural Language Generation (NLG) systems. Nevertheless, the evaluation
guideline, as a pivotal element ensuring reliable and reproducible human
assessment, has received limited attention.Our investigation revealed that only
29.84% of recent papers involving human evaluation at top conferences release
their evaluation guidelines, with vulnerabilities identified in 77.09% of these
guidelines. Unreliable evaluation guidelines can yield inaccurate assessment
outcomes, potentially impeding the advancement of NLG in the right direction.
To address these challenges, we take an initial step towards reliable
evaluation guidelines and propose the first human evaluation guideline dataset
by collecting annotations of guidelines extracted from existing papers as well
as generated via Large Language Models (LLMs). We then introduce a taxonomy of
eight vulnerabilities and formulate a principle for composing evaluation
guidelines. Furthermore, a method for detecting guideline vulnerabilities has
been explored using LLMs, and we offer a set of recommendations to enhance
reliability in human evaluation. The annotated human evaluation guideline
dataset and code for the vulnerability detection method are publicly available
online.

通过收集从现有论文中提取的指南注释以及由大型语言模型（LLMs）生成的指南注释，我们提出了第一个人工评估指南数据集，并引入了八种漏洞的分类和组成评估指南的原则。此外，我们还探索了使用 LLMs 检测指南漏洞的方法，并提供了一套增强人工评估可靠性的建议。

人工评估指南中对漏洞的定义和检测：实现可靠的自然语言生成评估的初步研究

Defining and Detecting Vulnerability in Human Evaluation Guidelines: A  Preliminary Study Towards Reliable NLG Evaluation

Over the past 40 years, the discovery and development of therapeutic
antibodies to treat disease has become common practice. However, as therapeutic
antibody constructs are becoming more sophisticated (e.g., multi-specifics),
conventional approaches to optimisation are increasingly inefficient. Machine
learning (ML) promises to open up an in silico route to antibody discovery and
help accelerate the development of drug products using a reduced number of
experiments and hence cost. Over the past few years, we have observed rapid
developments in the field of ML-guided antibody discovery and development
(D&D). However, many of the results are difficult to compare or hard to assess
for utility by other experts in the field due to the high diversity in the
datasets and evaluation techniques and metrics that are across industry and
academia. This limitation of the literature curtails the broad adoption of ML
across the industry and slows down overall progress in the field, highlighting
the need to develop standards and guidelines that may help improve the
reproducibility of ML models across different research groups. To address these
challenges, we set out in this perspective to critically review current
practices, explain common pitfalls, and clearly define a set of method
development and evaluation guidelines that can be applied to different types of
ML-based techniques for therapeutic antibody D&D. Specifically, we address in
an end-to-end analysis, challenges associated with all aspects of the ML
process and recommend a set of best practices for each stage.

过去 40 年来，治疗抗体的发现和开发已经变得司空见惯。然而，随着治疗抗体构造变得更复杂（例如多样性抗体），常规的优化方法日益低效。机器学习承诺开辟一条使用较少实验和成本的计算模拟路径来加速药物产品的发现和开发。本文对当前的做法进行了批判性评论，解释了常见陷阱，并明确了一组方法开发和评估指南，可应用于治疗抗体发现与开发的不同类型的机器学习技术。

抗体发现与开发中的机器学习最佳实践

Best practices for machine learning in antibody discovery and  development

Transferable adversarial examples raise critical security concerns in
real-world, black-box attack scenarios. However, in this work, we identify two
main problems in common evaluation practices: (1) For attack transferability,
lack of systematic, one-to-one attack comparison and fair hyperparameter
settings. (2) For attack stealthiness, simply no comparisons. To address these
problems, we establish new evaluation guidelines by (1) proposing a novel
attack categorization strategy and conducting systematic and fair
intra-category analyses on transferability, and (2) considering diverse
imperceptibility metrics and finer-grained stealthiness characteristics from
the perspective of attack traceback. To this end, we provide the first
large-scale evaluation of transferable adversarial examples on ImageNet,
involving 23 representative attacks against 9 representative defenses. Our
evaluation leads to a number of new insights, including consensus-challenging
ones: (1) Under a fair attack hyperparameter setting, one early attack method,
DI, actually outperforms all the follow-up methods. (2) A state-of-the-art
defense, DiffPure, actually gives a false sense of (white-box) security since
it is indeed largely bypassed by our (black-box) transferable attacks. (3) Even
when all attacks are bounded by the same $L_p$ norm, they lead to dramatically
different stealthiness performance, which negatively correlates with their
transferability performance. Overall, our work demonstrates that existing
problematic evaluations have indeed caused misleading conclusions and missing
points, and as a result, hindered the assessment of the actual progress in this
field.

通过建立新的评估准则，我们在 ImageNet 上对 23 种典型攻击与 9 种代表性防御进行了首次大规模的可传递对抗样本评估，发现既有的评估存在误导性结论和遗漏点，从而阻碍了该领域的实际进展评估。