Long-sequence transformers are designed to improve the representation of
longer texts by language models and their performance on downstream
document-level tasks. However, not much is understood about the quality of
token-level predictions in long-form models. We investigate the performance of
such architectures in the context of document classification with unsupervised
rationale extraction. We find standard soft attention methods to perform
significantly worse when combined with the Longformer language model. We
propose a compositional soft attention architecture that applies RoBERTa
sentence-wise to extract plausible rationales at the token-level. We find this
method to significantly outperform Longformer-driven baselines on sentiment
classification datasets, while also exhibiting significantly lower runtimes.

这篇论文针对长篇文本的语言模型，研究了其对 token-level 预测准确性的影响，提出了使用 RoBERTa 分析句子的组合式软注意力架构来提取 plausible rationales 的方法，并发现相比 Longformer 驱动的基准模型，提出的方法在情感分类数据集上表现更优，且运行时间更短。

长文分类器中的无监督理性抽取：大海捞针

Finding the Needle in a Haystack: Unsupervised Rationale Extraction from Long Text Classifiers

Natural language is generated by people, yet traditional language modeling
views words or documents as if generated independently. Here, we propose human
language modeling (HuLM), a hierarchical extension to the language modeling
problem whereby a human-level exists to connect sequences of documents (e.g.
social media messages) and capture the notion that human language is moderated
by changing human states. We introduce, HaRT, a large-scale transformer model
for the HuLM task, pre-trained on approximately 100,000 social media users, and
demonstrate its effectiveness in terms of both language modeling (perplexity)
for social media and fine-tuning for 4 downstream tasks spanning document- and
user-levels: stance detection, sentiment classification, age estimation, and
personality assessment. Results on all tasks meet or surpass the current
state-of-the-art.

本文介绍了一种针对人类语言的建模方法 —— 人类语言建模（HuLM）及其大规模变形器模型 HaRT。研究表明，HaRT 不仅可以有效地预测社交媒体上的语言模型，而且对于文档和用户级别的任务具有良好的适应性和先进水平。

人类语言建模

Human Language Modeling

Transformers are not suited for processing long documents, due to their
quadratically increasing memory and time consumption. Simply truncating a long
document or applying the sparse attention mechanism will incur the context
fragmentation problem or lead to an inferior modeling capability against
comparable model sizes. In this paper, we propose ERNIE-Doc, a document-level
language pretraining model based on Recurrence Transformers. Two well-designed
techniques, namely the retrospective feed mechanism and the enhanced recurrence
mechanism, enable ERNIE-Doc, which has a much longer effective context length,
to capture the contextual information of a complete document. We pretrain
ERNIE-Doc to explicitly learn the relationships among segments with an
additional document-aware segment-reordering objective. Various experiments
were conducted on both English and Chinese document-level tasks. ERNIE-Doc
improved the state-of-the-art language modeling result of perplexity to 16.8 on
WikiText-103. Moreover, it outperformed competitive pretraining models by a
large margin on most language understanding tasks, such as text classification
and question answering.

提出了基于具有回归的 Transformers 的文档级语言预训练模型 ERNIE-Doc，借助回顾性馈送机制和增强的回归机制，提高了其处理长文档数据的能力。实验证明，在英文和中文文档级任务上，ERNIE-Doc 在文本分类和问题回答等任务上表现出比其他模型更优秀的语言理解能力。

ERNIE-Doc: 一种用于回顾性长文档建模的 Transformer

ERNIE-Doc: A Retrospective Long-Document Modeling Transformer

Hierarchical neural architectures are often used to capture long-distance
dependencies and have been applied to many document-level tasks such as
summarization, document segmentation, and sentiment analysis. However,
effective usage of such a large context can be difficult to learn, especially
in the case where there is limited labeled data available. Building on the
recent success of language model pretraining methods for learning flat
representations of text, we propose algorithms for pre-training hierarchical
document representations from unlabeled data. Unlike prior work, which has
focused on pre-training contextual token representations or context-independent
{sentence/paragraph} representations, our hierarchical document representations
include fixed-length sentence/paragraph representations which integrate
contextual information from the entire documents. Experiments on document
segmentation, document-level question answering, and extractive document
summarization demonstrate the effectiveness of the proposed pre-training
algorithms.

从无标注数据中预训练了一种能够包括来自整个文档的上下文信息的分层文档表示，包括定长的句子 / 段落表示，并应用于文档分割、文档级问答和抽取式文档摘要等方面取得了有效结果。