Medical systematic reviews can be very costly and resource intensive. We explore how Large Language Models (LLMs) can support and be trained to perform literature screening when provided with a detailed set of selection criteria. Specifically, we instruction tune LLaMA and Guanaco models to perform abstract screening for medical systematic reviews. Our best model, Bio-SIEVE, outperforms both ChatGPT and trained traditional approaches, and generalises better across medical domains. However, there remains the challenge of adapting the model to safety-first scenarios. We also explore the impact of multi-task training with Bio-SIEVE-Multi, including tasks such as PICO extraction and exclusion reasoning, but find that it is unable to match single-task Bio-SIEVE's performance. We see Bio-SIEVE as an important step towards specialising LLMs for the biomedical systematic review process and explore its future developmental opportunities. We release our models, code and a list of DOIs to reconstruct our dataset for reproducibility.

通过对详细的选择标准进行训练，我们使用大型语言模型（LLMs）来支持和执行医学系统评价文献筛选。我们的最佳模型Bio-SIEVE在医学领域中优于ChatGPT和传统方法，并在多个医学领域中具有更好的泛化性能。然而，将该模型适应安全优先的情境仍然是一个挑战。我们还探讨了Bio-SIEVE-Multi的多任务训练的影响，包括PICO提取和排除推理等任务，但发现它无法与单任务的Bio-SIEVE的表现相匹配。我们认为Bio-SIEVE是将LLMs专门用于生物医学系统评价过程的重要一步，并探讨了它的未来发展机会。我们公开发布了我们的模型、代码和DOI列表，以便重现我们的数据集。

Bio-SIEVE: 探索调整大型语言模型以实现系统性综述自动化