Recent evaluations of cross-domain text classification models aim to measure the ability of a model to obtain domain-invariant performance in a target domain given labeled samples in a source domain. The primary strategy for this evaluation relies on assumed differences between source domain samples and target domain samples in benchmark datasets. This evaluation strategy fails to account for the similarity between source and target domains, and may mask when models fail to transfer learning to specific target samples which are highly dissimilar from the source domain. We introduce Depth $F_1$, a novel cross-domain text classification performance metric. Designed to be complementary to existing classification metrics such as $F_1$, Depth $F_1$ measures how well a model performs on target samples which are dissimilar from the source domain. We motivate this metric using standard cross-domain text classification datasets and benchmark several recent cross-domain text classification models, with the goal of enabling in-depth evaluation of the semantic generalizability of cross-domain text classification models.

该研究介绍了一种新颖的跨领域文本分类性能度量标准Depth F1，用于评估模型在源域和目标域之间的语义泛化能力，该标准衡量了模型在与源域高度不相似的目标样本上的表现。通过在几个最新的跨领域文本分类模型上进行基准测试，旨在促进对跨领域文本分类模型的语义泛化能力进行深入评估。

深度$F_1$：通过度量语义泛化性提高跨领域文本分类评估