This paper investigates the impact of domain-specific model fine-tuning and of reasoning mechanisms on the performance of question-answering (Q&A) systems powered by large language models (LLMs) and Retrieval-Augmented Generation (RAG). Using the FinanceBench SEC financial filings dataset, we observe that, for RAG, combining a fine-tuned embedding model with a fine-tuned LLM achieves better accuracy than generic models, with relatively greater gains attributable to fine-tuned embedding models. Additionally, employing reasoning iterations on top of RAG delivers an even bigger jump in performance, enabling the Q&A systems to get closer to human-expert quality. We discuss the implications of such findings, propose a structured technical design space capturing major technical components of Q&A AI, and provide recommendations for making high-impact technical choices for such components. We plan to follow up on this work with actionable guides for AI teams and further investigations into the impact of domain-specific augmentation in RAG and into agentic AI capabilities such as advanced planning and reasoning.

本文研究了领域特定的模型微调和推理机制对由大型语言模型（LLM）和检索增强生成（RAG）驱动的问答系统的性能的影响。通过使用FinanceBench SEC财务报告数据集，我们观察到，对于RAG，将微调的嵌入模型与微调的LLM结合使用可以获得比通用模型更高的准确性，其中微调的嵌入模型所带来的收益相对更大。此外，在RAG之上使用推理迭代可以进一步提高性能，使问答系统更接近人类专家水平。我们讨论了这些发现的影响，提出了一个结构化的技术设计空间，涵盖了问答AI的主要技术组成部分，并为这些组成部分提供了高影响的技术选择建议。我们计划在本工作的基础上为AI团队提供具体指南，并进一步研究RAG中领域特定增强以及先进规划和推理等自主AI能力的影响。

加强问答系统的领域特定微调和迭代推理：一项比较研究