Research on data generation and augmentation has been focused majorly on
enhancing generation models, leaving a notable gap in the exploration and
refinement of methods for evaluating synthetic data. There are several text
similarity metrics within the context of generated data filtering which can
impact the performance of specific Natural Language Understanding (NLU) tasks,
specifically focusing on intent and sentiment classification. In this study, we
propose RankAug, a text-ranking approach that detects and filters out the top
augmented texts in terms of being most similar in meaning with lexical and
syntactical diversity. Through experiments conducted on multiple datasets, we
demonstrate that the judicious selection of filtering techniques can yield a
substantial improvement of up to 35% in classification accuracy for
under-represented classes.

本研究提出了一种名为 RankAug 的文本排名方法，通过多样性的词汇和句法，检测和过滤出最具相似意义的顶级增强文本，从而改善生成数据过滤在自然语言理解任务中的性能，特别是意图和情感分类。通过对多个数据集进行实验，我们证明了过滤技术的精心选择可以显著提高少数派分类的准确性，提高了多达 35%。

RankAug: 文本分类的增强数据排名

RankAug: Augmented data ranking for text classification

Algorithmic sequence alignment identifies similar segments shared between
pairs of documents, and is fundamental to many NLP tasks. But it is difficult
to recognize similarities between distant versions of narratives such as
translations and retellings, particularly for summaries and abridgements which
are much shorter than the original novels.
We develop a general approach to narrative alignment coupling the
Smith-Waterman algorithm from bioinformatics with modern text similarity
metrics. We show that the background of alignment scores fits a Gumbel
distribution, enabling us to define rigorous p-values on the significance of
any alignment. We apply and evaluate our general narrative alignment tool
(GNAT) on four distinct problem domains differing greatly in both the relative
and absolute length of documents, namely summary-to-book alignment, translated
book alignment, short story alignment, and plagiarism detection --
demonstrating the power and performance of our methods.

通过将生物信息学中的 Smith-Waterman 算法与现代文本相似度度量相结合，我们开发了一种用于故事对齐的通用方法，并展示了对于摘要和节选，相较于原始小说更短的具有大的版本之间的相似性是很难识别的。我们的方法在四个不同的问题领域上应用和评估了我们的通用故事对齐工具 (GNAT)，这四个问题领域在文档的相对和绝对长度方面差异巨大，包括摘要与书籍对齐、翻译书籍对齐、短篇小说对齐和抄袭检测，从而展示了我们方法的能力与性能。

GNAT: 通用叙事对齐工具

GNAT: A General Narrative Alignment Tool

Similar Narrative Retrieval is a crucial task since narratives are essential
for explaining and understanding events, and multiple related narratives often
help to create a holistic view of the event of interest. To accurately identify
semantically similar narratives, this paper proposes a novel narrative
similarity metric called Facet-based Narrative Similarity (FaNS), based on the
classic 5W1H facets (Who, What, When, Where, Why, and How), which are extracted
by leveraging the state-of-the-art Large Language Models (LLMs). Unlike
existing similarity metrics that only focus on overall lexical/semantic match,
FaNS provides a more granular matching along six different facets independently
and then combines them. To evaluate FaNS, we created a comprehensive dataset by
collecting narratives from AllSides, a third-party news portal. Experimental
results demonstrate that the FaNS metric exhibits a higher correlation (37\%
higher) than traditional text similarity metrics that directly measure the
lexical/semantic match between narratives, demonstrating its effectiveness in
comparing the finer details between a pair of narratives.

为了准确识别语义相似的叙述，本文提出了一种新的基于经典的 5W1H 要素（谁、什么、何时、何地、为什么和如何）的叙述相似度度量方法，通过利用先进的大型语言模型（LLMs）来提取这些要素，并通过组合六个不同要素的匹配结果来优化匹配效果，通过实验结果验证了其在比较细节方面的有效性，与直接衡量叙述之间词汇 / 语义匹配的传统文本相似度度量方法相比，相关性提高了 37％。