Lemmatization is a Natural Language Processing (NLP) task which consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without analyzing whether that is the optimum in terms of downstream performance. Thus, in this paper we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising: (i) providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages; (ii) in fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain good contextual lemmatizers without seeing any explicit morphological signal; (iii) the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology; (iv) current evaluation practices for lemmatization are not adequate to clearly discriminate between models.

本文通过实证研究，考察了在六种不同的语言上，使用不同的形态学特征开发上下文词形还原器对下游表现的影响，并发现：（i）为词形还原器提供细粒度的形态学特征在训练时并不那么有益，即使对于词汇连接语言；（ii）实际上，现代上下文词表示似乎隐式地编码了足够的形态信息，以获得无需查看任何明确形态信号的良好上下文词形还原器；（iii）最佳的领域外词形还原器是使用简单的UPOS标签或者没有训练形态学的模型；（iv）目前的词形还原评估实践并不足以清晰地区分模型之间的差异。

关于词形信息在上下文词形还原中的作用