Large Language Models (LLMs) increasingly support applications in a wide
range of domains, some with potential high societal impact such as biomedicine,
yet their reliability in realistic use cases is under-researched. In this work
we introduce the Reliability AssesMent for Biomedical LLM Assistants (RAmBLA)
framework and evaluate whether four state-of-the-art foundation LLMs can serve
as reliable assistants in the biomedical domain. We identify prompt robustness,
high recall, and a lack of hallucinations as necessary criteria for this use
case. We design shortform tasks and tasks requiring LLM freeform responses
mimicking real-world user interactions. We evaluate LLM performance using
semantic similarity with a ground truth response, through an evaluator LLM.

我们引入了 RAmBLA 框架，评估了四种最先进的基于语言模型的助手是否能在生物医学领域中作为可靠的助手，并明确了快速性、高召回率和缺乏幻觉是这种使用情况的必要标准。

RAmBLA：一个评估 LLMs 在生物医学领域作为助手可靠性的框架

RAmBLA: A Framework for Evaluating the Reliability of LLMs as Assistants  in the Biomedical Domain

The increasing reliance on Large Language Models (LLMs) across academia and
industry necessitates a comprehensive understanding of their robustness to
prompts. In response to this vital need, we introduce PromptBench, a robustness
benchmark designed to measure LLMs' resilience to adversarial prompts. This
study uses a plethora of adversarial textual attacks targeting prompts across
multiple levels: character, word, sentence, and semantic. These prompts are
then employed in diverse tasks, such as sentiment analysis, natural language
inference, reading comprehension, machine translation, and math
problem-solving. Our study generates 4,032 adversarial prompts, meticulously
evaluated over 8 tasks and 13 datasets, with 567,084 test samples in total. Our
findings demonstrate that contemporary LLMs are vulnerable to adversarial
prompts. Furthermore, we present comprehensive analysis to understand the
mystery behind prompt robustness and its transferability. We then offer
insightful robustness analysis and pragmatic recommendations for prompt
composition, beneficial to both researchers and everyday users. We make our
code, prompts, and methodologies to generate adversarial prompts publicly
accessible, thereby enabling and encouraging collaborative exploration in this
pivotal field: this https URL

本研究使用 adversarial prompts 对 Large Language Models 进行度量，并分析了 prompt 鲁棒性及其传递性，为 prompt 组合提供了实用性建议。

PromptBench：评估大型语言模型对对抗性提示的鲁棒性

PromptBench: Towards Evaluating the Robustness of Large Language Models  on Adversarial Prompts

A particularly successful class of approaches for few-shot learning combines
language models with prompts -- hand-crafted task descriptions that complement
data samples. However, designing prompts by hand for each task commonly
requires domain knowledge and substantial guesswork. We observe, in the context
of classification tasks, that instruction finetuned language models exhibit
remarkable prompt robustness, and we subsequently propose a simple method to
eliminate the need for handcrafted prompts, named AuT-Few. This approach
consists of (i) a prompt retrieval module that selects suitable task
instructions from the instruction-tuning knowledge base, and (ii) the
generation of two distinct, semantically meaningful, class descriptions and a
selection mechanism via cross-validation. Over $12$ datasets, spanning $8$
classification tasks, we show that AuT-Few outperforms current state-of-the-art
few-shot learning methods. Moreover, AuT-Few is the best ranking method across
datasets on the RAFT few-shot benchmark. Notably, these results are achieved
without task-specific handcrafted prompts on unseen tasks.

通过使用指导微调的语言模型，构建了一个用于 few-shot 学习的方法，名为 AuT-Few，该方法可以自动选择适合的任务指令，并实现了较强的 prompt 稳健性和良好的分类性能。