NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.

近期，NLP 在培训具备强大科研问题解决技能的语言模型方面取得了令人振奋的进展。本文通过引入 TutorEval 和 TutorChat，提出了一种针对科学教育中需要处理长篇科学文档的语言模型应用的多样化问答基准评估方法，以及使用现有对话数据集对基础模型进行微调并展示其对 TutorEval 性能的影响，进一步创建了一套长篇合成教材对话数据集 TutorChat，最终在 TutorEval、GSM8K 和 MATH 上表现出色。

语言模型作为科学导师