In this study, we address the issue of API hallucinations in various software engineering contexts. We introduce CloudAPIBench, a new benchmark designed to measure API hallucination occurrences. CloudAPIBench also provides annotations for frequencies of API occurrences in the public domain, allowing us to study API hallucinations at various frequency levels. Our findings reveal that Code LLMs struggle with low frequency APIs: for e.g., GPT-4o achieves only 38.58% valid low frequency API invocations. We demonstrate that Documentation Augmented Generation (DAG) significantly improves performance for low frequency APIs (increase to 47.94% with DAG) but negatively impacts high frequency APIs when using sub-optimal retrievers (a 39.02% absolute drop). To mitigate this, we propose to intelligently trigger DAG where we check against an API index or leverage Code LLMs' confidence scores to retrieve only when needed. We demonstrate that our proposed methods enhance the balance between low and high frequency API performance, resulting in more reliable API invocations (8.20% absolute improvement on CloudAPIBench for GPT-4o).

本文介绍了云API基准测试工具CloudAPIBench，用于测量与公共领域中的API假象出现频率相关的API假象。我们发现，Code LLMs在低频API方面存在困难。通过文档增强生成（DAG），我们提高了低频API的性能，但对高频API使用次优的检索器时会有负面影响。为了缓解这一问题，我们提出了智能触发DAG的方法，根据API索引或者利用Code LLMs的置信度分数进行检索。我们证明了我们的方法增强了低频和高频API性能的平衡，在云API基准测试上提高了API调用的可靠性（GPT-4o上绝对改进8.20%）。

通过API文档减轻代码LLM幻觉