We systematically study the capacity of two large language models for code -
CodeT5 and Codex - to generalize to out-of-domain data. In this study, we
consider two fundamental applications - code summarization, and code
generation. We split data into domains following its natural boundaries - by an
organization, by a project, and by a module within the software project. This
makes recognition of in-domain vs out-of-domain data at the time of deployment
trivial. We establish that samples from each new domain present both models
with a significant challenge of distribution shift. We study how well different
established methods can adapt models to better generalize to new domains. Our
experiments show that while multitask learning alone is a reasonable baseline,
combining it with few-shot finetuning on examples retrieved from training data
can achieve very strong performance. In fact, according to our experiments,
this solution can outperform direct finetuning for very low-data scenarios.
Finally, we consider variations of this approach to create a more broadly
applicable method to adapt to multiple domains at once. We find that in the
case of code generation, a model adapted to multiple domains simultaneously
performs on par with those adapted to each domain individually.

通过研究两个大型语言模型 CodeT5 和 Codex 在代码领域外具有的一般化能力，我们发现多任务学习与少许训练数据的微调相结合的方法能够很好地适应不同域的代码摘要和生成需求。

探索大型语言模型对代码分析中的分布式转换

Exploring Distributional Shifts in Large Language Models for Code Analysis

Recent prompt-based approaches allow pretrained language models to achieve
strong performances on few-shot finetuning by reformulating downstream tasks as
a language modeling problem. In this work, we demonstrate that, despite its
advantages on low data regimes, finetuned prompt-based models for sentence pair
classification tasks still suffer from a common pitfall of adopting inference
heuristics based on lexical overlap, e.g., models incorrectly assuming a
sentence pair is of the same meaning because they consist of the same set of
words. Interestingly, we find that this particular inference heuristic is
significantly less present in the zero-shot evaluation of the prompt-based
model, indicating how finetuning can be destructive to useful knowledge learned
during the pretraining. We then show that adding a regularization that
preserves pretraining weights is effective in mitigating this destructive
tendency of few-shot finetuning. Our evaluation on three datasets demonstrates
promising improvements on the three corresponding challenge datasets used to
diagnose the inference heuristics.

本文研究表明，在句子对分类任务中，虽然预训练语言模型提供了低数据环境下的优势，但基于提示的微调模型仍然存在使用基于词汇重叠的推理启发式的共同缺陷，加入保留预训练权重的正则化可以缓解这种破坏性的微调倾向，并在三个挑战数据集上显示了可预期的改进。