Though numerous solvers have been proposed for the MaxSAT problem, and the
benchmark environment such as MaxSAT Evaluations provides a platform for the
comparison of the state-of-the-art solvers, existing assessments were usually
evaluated based on the quality, e.g., fitness, of the best-found solutions
obtained within a given running time budget. However, concerning solely the
final obtained solutions regarding specific time budgets may restrict us from
comprehending the behavior of the solvers along the convergence process. This
paper demonstrates that Empirical Cumulative Distribution Functions can be used
to compare MaxSAT local search solvers' anytime performance across multiple
problem instances and various time budgets. The assessment reveals distinctions
in solvers' performance and displays that the (dis)advantages of solvers adjust
along different running times. This work also exhibits that the quantitative
and high variance assessment of anytime performance can guide machines, i.e.,
automatic configurators, to search for better parameter settings. Our
experimental results show that the hyperparameter optimization tool, i.e.,
SMAC, generally achieves better parameter settings of local search when using
the anytime performance as the cost function, compared to using the fitness of
the best-found solutions.

本文介绍了一种使用经验累积分布函数来比较 MaxSAT 局部搜索求解器在多个问题实例和不同时间预算下的任意时刻性能的方法，实证评估结果显示求解器的性能存在差异，并且在不同的运行时间下求解器的优势和劣势会有所调整，同时，这项工作还证明了以任意时刻性能作为成本函数进行超参数优化的方法能够得到更好的局部搜索参数设置。

通过实时性能分析来更好理解和配置 MaxSAT 局部搜索求解器

Better Understandings and Configurations in MaxSAT Local Search Solvers  via Anytime Performance Analysis

Recent progress in generative language models has enabled machines to
generate astonishingly realistic texts. While there are many legitimate
applications of such models, there is also a rising need to distinguish
machine-generated texts from human-written ones (e.g., fake news detection).
However, to our best knowledge, there is currently no benchmark environment
with datasets and tasks to systematically study the so-called "Turing Test"
problem for neural text generation methods. In this work, we present the
TuringBench benchmark environment, which is comprised of (1) a dataset with
200K human- or machine-generated samples across 20 labels {Human, GPT-1,
GPT-2_small, GPT-2_medium, GPT-2_large, GPT-2_xl, GPT-2_PyTorch, GPT-3,
GROVER_base, GROVER_large, GROVER_mega, CTRL, XLM, XLNET_base, XLNET_large,
FAIR_wmt19, FAIR_wmt20, TRANSFORMER_XL, PPLM_distil, PPLM_gpt2}, (2) two
benchmark tasks -- i.e., Turing Test (TT) and Authorship Attribution (AA), and
(3) a website with leaderboards. Our preliminary experimental results using
TuringBench show that FAIR_wmt20 and GPT-3 are the current winners, among all
language models tested, in generating the most human-like indistinguishable
texts with the lowest F1 score by five state-of-the-art TT detection models.
The TuringBench is available at: this https URL

该研究提出了 TuringBench 基准环境，旨在解决神经文本生成方法的 “图灵测试” 问题，它包括 200K 个人工或机器生成的样本数据集，分别涵盖 20 个标签，以及两个基准测试任务和网站排行榜，研究初步实验表明，FAIR_wmt20 和 GPT-3 是生成最逼近人类无法辨别的文本的最佳选择。