Human linguistic capacity is often characterized by compositionality and the
generalization it enables -- human learners can produce and comprehend novel
complex expressions by composing known parts. Several benchmarks exploit
distributional control across training and test to gauge compositional
generalization, where certain lexical items only occur in limited contexts
during training. While recent work using these benchmarks suggests that
pretrained models achieve impressive generalization performance, we argue that
exposure to pretraining data may break the aforementioned distributional
control. Using the COGS benchmark of Kim and Linzen (2020), we test two
modified evaluation setups that control for this issue: (1) substituting
context-controlled lexical items with novel character sequences, and (2)
substituting them with special tokens represented by novel embeddings. We find
that both of these setups lead to lower generalization performance in T5
(Raffel et al., 2020), suggesting that previously reported results have been
overestimated due to uncontrolled lexical exposure during pretraining. The
performance degradation is more extreme with novel embeddings, and the
degradation increases with the amount of pretraining data, highlighting an
interesting case of inverse scaling.

通过对 Kim and Linzen（2020）的 COGS 基准进行测试，我们发现两种修改后的评估设置均导致 T5（Raffel et al.，2020）的泛化性能降低，暗示以前报道的结果由于预训练期间未受控制的词汇暴露而被高估。