Large language models (LLMs) have reached human-like proficiency in
generating diverse textual content, underscoring the necessity for effective
fake text detection to avoid potential risks such as fake news in social media.
Previous research has mostly tested single models on in-distribution datasets,
limiting our understanding of how these models perform on different types of
data for LLM-generated text detection task. We researched this by testing five
specialized transformer-based models on both in-distribution and
out-of-distribution datasets to better assess their performance and
generalizability. Our results revealed that single transformer-based
classifiers achieved decent performance on in-distribution dataset but limited
generalization ability on out-of-distribution dataset. To improve it, we
combined the individual classifiers models using adaptive ensemble algorithms,
which improved the average accuracy significantly from 91.8% to 99.2% on an
in-distribution test set and from 62.9% to 72.5% on an out-of-distribution test
set. The results indicate the effectiveness, good generalization ability, and
great potential of adaptive ensemble algorithms in LLM-generated text
detection.

大语言模型生成文本内容的多样性接近于人类的能力，因此为了避免潜在的风险如社交媒体上的假新闻，需要有效的假文本检测。本研究通过在内部和外部分布数据集上测试五种专门的基于 Transformer 的模型来研究它们在 LLM 生成文本检测任务中的性能和泛化能力。结果表明，单个基于 Transformer 的分类器在内部数据集上取得了不错的性能，但在外部数据集上的泛化能力有限。为了改进这一点，我们使用自适应集成算法结合了个体分类器模型，将在内部测试集上的平均准确率从 91.8% 提高到 99.2%，在外部测试集上的准确率从 62.9% 提高到 72.5%。结果表明自适应集成算法在 LLM 生成文本检测中具有有效性、良好的泛化能力和巨大的潜力。

自适应精调 Transformer 集成模型用于 LLM 生成文本检测

Adaptive Ensembles of Fine-Tuned Transformers for LLM-Generated Text  Detection

Harnessing logical reasoning ability is a comprehensive natural language
understanding endeavor. With the release of Generative Pretrained Transformer 4
(GPT-4), highlighted as "advanced" at reasoning tasks, we are eager to learn
the GPT-4 performance on various logical reasoning tasks. This report analyses
multiple logical reasoning datasets, with popular benchmarks like LogiQA and
ReClor, and newly-released datasets like AR-LSAT. We test the multi-choice
reading comprehension and natural language inference tasks with benchmarks
requiring logical reasoning. We further construct a logical reasoning
out-of-distribution dataset to investigate the robustness of ChatGPT and GPT-4.
We also make a performance comparison between ChatGPT and GPT-4. Experiment
results show that ChatGPT performs significantly better than the RoBERTa
fine-tuning method on most logical reasoning benchmarks. GPT-4 shows even
higher performance on our manual tests. Among benchmarks, ChatGPT and GPT-4 do
relatively well on well-known datasets like LogiQA and ReClor. However, the
performance drops significantly when handling newly released and
out-of-distribution datasets. Logical reasoning remains challenging for ChatGPT
and GPT-4, especially on out-of-distribution and natural language inference
datasets.

本研究评估了 GPT-4 在逻辑推断任务中的性能，包括多项逻辑推断数据集的测试以及构建一个逻辑推理离散数据集进行实验。结论显示，尽管 GPT-4 表现优异，但逻辑推理对 ChatGPT 和 GPT-4 来说仍然是一项挑战。