Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words' in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model's performance on unknown domain datasets is possible without undergoing training.

通过分析训练数据中的`词汇'对总结任务的具体影响，本文研究细粒度因素对于领域适应性能的影响，并提出将数据集学习难度量化为生成式总结的学习难度，并得出跨域重叠与总结任务的性能增益之间存在近似线性关系的实验结论，从而实现对未知领域数据集模型性能的预测而无需经过训练。

词汇的重要性：什么影响了摘要的领域适应性？