Information extraction and textual comprehension from materials literature
are vital for developing an exhaustive knowledge base that enables accelerated
materials discovery. Language models have demonstrated their capability to
answer domain-specific questions and retrieve information from knowledge bases.
However, there are no benchmark datasets in the materials domain that can
evaluate the understanding of the key concepts by these language models. In
this work, we curate a dataset of 650 challenging questions from the materials
domain that require the knowledge and skills of a materials student who has
cleared their undergraduate degree. We classify these questions based on their
structure and the materials science domain-based subcategories. Further, we
evaluate the performance of GPT-3.5 and GPT-4 models on solving these questions
via zero-shot and chain of thought prompting. It is observed that GPT-4 gives
the best performance (~62% accuracy) as compared to GPT-3.5. Interestingly, in
contrast to the general observation, no significant improvement in accuracy is
observed with the chain of thought prompting. To evaluate the limitations, we
performed an error analysis, which revealed conceptual errors (~64%) as the
major contributor compared to computational errors (~36%) towards the reduced
performance of LLMs. We hope that the dataset and analysis performed in this
work will promote further research in developing better materials science
domain-specific LLMs and strategies for information extraction.

我们基于材料学领域的 650 个具有挑战性的问题，对 GPT-3.5 和 GPT-4 模型在问答、零点提示和思维链激励下的表现进行了评估，并发现 GPT-4 的准确率最高（约为 62%），而与思维链激励相比，没有明显的准确率提高。通过错误分析，我们发现概念错误（64%）是改进语言模型表现的主要因素，而计算错误（36%）对 LLMs 性能的降低起到了次要作用。我们希望该工作中的数据集和分析能够促进材料科学领域特定 LLMs 的开发和信息提取策略的研究。