Large language models have gained considerable interest for their impressive performance on various tasks. Among these models, ChatGPT developed by OpenAI has become extremely popular among early adopters who even regard it as a disruptive technology in many fields like customer service, education, healthcare, and finance. It is essential to comprehend the opinions of these initial users as it can provide valuable insights into the potential strengths, weaknesses, and success or failure of the technology in different areas. This research examines the responses generated by ChatGPT from different Conversational QA corpora. The study employed BERT similarity scores to compare these responses with correct answers and obtain Natural Language Inference(NLI) labels. Evaluation scores were also computed and compared to determine the overall performance of GPT-3 \& GPT-4. Additionally, the study identified instances where ChatGPT provided incorrect answers to questions, providing insights into areas where the model may be prone to error.

本研究分析了ChatGPT在不同对话问答语料库中生成的回答，并使用BERT相似度得分进行比较，以获取自然语言推理（NLI）标签。该研究还确定了ChatGPT提供错误答案的情况，提供了有关该模型可能存在错误的领域的见解。通过评估分数，比较GPT-3和GPT-4的整体性能。

ChatGPT-Crawler: 查看ChatGPT的言论是否可靠