Metaphor and sarcasm are common figurative expressions in people's communication, especially on the Internet or the memes popular among teenagers. We create a new benchmark named NYK-MS (NewYorKer for Metaphor and Sarcasm), which contains 1,583 samples for metaphor understanding tasks and 1,578 samples for sarcasm understanding tasks. These tasks include whether it contains metaphor/sarcasm, which word or object contains metaphor/sarcasm, what does it satirize and why does it contains metaphor/sarcasm, all of the 7 tasks are well-annotated by at least 3 annotators. We annotate the dataset for several rounds to improve the consistency and quality, and use GUI and GPT-4V to raise our efficiency. Based on the benchmark, we conduct plenty of experiments. In the zero-shot experiments, we show that Large Language Models (LLM) and Large Multi-modal Models (LMM) can't do classification task well, and as the scale increases, the performance on other 5 tasks improves. In the experiments on traditional pre-train models, we show the enhancement with augment and alignment methods, which prove our benchmark is consistent with previous dataset and requires the model to understand both of the two modalities.

本研究针对现有多模态隐喻和讽刺理解任务中数据集缺乏的问题，提出了NYK-MS基准，包含1583个隐喻样本和1578个讽刺样本，并进行了多轮高质量标注。研究发现，尽管大型语言模型在分类任务中表现不佳，但随着模型规模的增加，其在理解隐喻和讽刺的其他任务中的表现有所提升，验证了基准与现有数据集的一致性。

NYK-MS：一个注释完善的多模态隐喻和讽刺理解基准在卡通字幕数据集上的应用