The rapid development of large language model (LLM) evaluation methodologies
and datasets has led to a profound challenge: integrating state-of-the-art
evaluation techniques cost-effectively while ensuring reliability,
reproducibility, and efficiency. Currently, there is a notable absence of a
unified and adaptable framework that seamlessly integrates various evaluation
approaches. Moreover, the reliability of evaluation findings is often
questionable due to potential data contamination, with the evaluation
efficiency commonly overlooked when facing the substantial costs associated
with LLM inference. In response to these challenges, we introduce FreeEval, a
modular and scalable framework crafted to enable trustworthy and efficient
automatic evaluations of LLMs. Firstly, FreeEval's unified abstractions
simplify the integration and improve the transparency of diverse evaluation
methodologies, encompassing dynamic evaluation that demand sophisticated LLM
interactions. Secondly, the framework integrates meta-evaluation techniques
like human evaluation and data contamination detection, which, along with
dynamic evaluation modules in the platform, enhance the fairness of the
evaluation outcomes. Lastly, FreeEval is designed with a high-performance
infrastructure, including distributed computation and caching strategies,
enabling extensive evaluations across multi-node, multi-GPU clusters for
open-source and proprietary LLMs.

介绍了一个名为 FreeEval 的模块化和可扩展框架，用于可靠高效地自动评估大型语言模型，通过统一的架构整合了各种评估方法，并结合人工评估和数据污染检测等元评估技术，实现了评估结果的公平性。

FreeEval: 大型语言模型的可靠高效评估的模块化框架

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation  of Large Language Models

Large Language Models (LLMs) have achieved impressive performance across
various reasoning tasks. However, even state-of-the-art LLMs such as ChatGPT
are prone to logical errors during their reasoning processes. Existing
solutions, which include deploying task-specific verifiers or voting over
multiple reasoning paths, either require extensive human annotations or fail in
scenarios with inconsistent responses. To address these challenges, we
introduce RankPrompt, a new prompting method that enables LLMs to self-rank
their responses without additional resources. RankPrompt breaks down the
ranking problem into a series of comparisons among diverse responses,
leveraging the inherent capabilities of LLMs to generate chains of comparison
as contextual exemplars. Our experiments across 11 arithmetic and commonsense
reasoning tasks show that RankPrompt significantly enhances the reasoning
performance of ChatGPT and GPT-4, with improvements of up to 13\%. RankPrompt
also excels in LLM-based automatic evaluations for open-ended generation,
aligning with human preferences 74\% of the time in the AlpacaEval set.
Moreover, RankPrompt demonstrates robustness against variations in the
orderings and consistencies of responses.

通过使用 RankPrompt 方法，LLMs 可以自我评级其回答，从而显著提高 ChatGPT 和 GPT-4 的推理表现。

RankPrompt: 逐步对比使语言模型成为更好的推理者

RankPrompt: Step-by-Step Comparisons Make Language Models Better  Reasoners

Modern NLP defines the task of style transfer as modifying the style of a
given sentence without appreciably changing its semantics, which implies that
the outputs of style transfer systems should be paraphrases of their inputs.
However, many existing systems purportedly designed for style transfer
inherently warp the input's meaning through attribute transfer, which changes
semantic properties such as sentiment. In this paper, we reformulate
unsupervised style transfer as a paraphrase generation problem, and present a
simple methodology based on fine-tuning pretrained language models on
automatically generated paraphrase data. Despite its simplicity, our method
significantly outperforms state-of-the-art style transfer systems on both human
and automatic evaluations. We also survey 23 style transfer papers and discover
that existing automatic metrics can be easily gamed and propose fixed variants.
Finally, we pivot to a more real-world style transfer setting by collecting a
large dataset of 15M sentences in 11 diverse styles, which we use for an
in-depth analysis of our system.

该论文提出了一种简单的基于预训练语言模型的方法，将非监督风格转移重新规定为句子释义生成问题，本文在人工和自动评估方面显著优于目前最先进的风格转移系统，并发现现有的自动指标可以进行简单地误导，最后通过收集具有 11 种不同风格的大型数据集进一步对该系统进行深入分析。