Large language models (LLMs) offer unprecedented text completion capabilities. As general models, they can fulfill a wide range of roles, including those of more specialized models. We assess the performance of GPT-4 and GPT-3.5 in zero shot, few shot and fine-tuned settings on the aspect-based sentiment analysis (ABSA) task. Fine-tuned GPT-3.5 achieves a state-of-the-art F1 score of 83.8 on the joint aspect term extraction and polarity classification task of the SemEval-2014 Task 4, improving upon InstructABSA [@scaria_instructabsa_2023] by 5.7%. However, this comes at the price of 1000 times more model parameters and thus increased inference cost. We discuss the the cost-performance trade-offs of different models, and analyze the typical errors that they make. Our results also indicate that detailed prompts improve performance in zero-shot and few-shot settings but are not necessary for fine-tuned models. This evidence is relevant for practioners that are faced with the choice of prompt engineering versus fine-tuning when using LLMs for ABSA.

使用零痕迹、少痕迹和微调模型在纵向情感分析任务上评估了GPT-4和GPT-3.5的性能，结果显示微调的GPT-3.5在SemEval-2014任务4的联合方面术语提取和极性分类任务上获得了83.8的最优F1分数，比InstructABSA提高了5.7%，但模型参数增加了1000倍，推理成本也增加了。我们讨论了不同模型的性价比和分析了它们的典型错误。同时，我们的研究结果表明，在零痕迹和少痕迹环境中，详细提示可以提高性能，但对于微调模型来说并非必要。这些证据对于在ABSA中使用LLMs时面临提示工程和微调选择的实践者具有相关性。

大型语言模型用于方面级情感分析