I-Tuning: 利用图像微调冻结语言模型轻量级图像字幕

Feb, 2022

I-Tuning: 利用图像微调冻结语言模型轻量级图像字幕

I-Tuning: Tuning Language Models with Image for Caption Generation

Ziyang Luo, Yadong Xi, Rongsheng Zhang, Jing Ma

TL;DR本文介绍了一种轻量级图像字幕生成框架（I-Tuning），该框架包含较少的可训练参数，并设计了一种新颖的I-Tuning交叉注意力模块，用于连接预先训练的语言解码器GPT2和视觉编码器CLIP-ViT。实验结果表明，该框架与大规模基线系统具有可比或更好的性能，但我们的模型可训练参数少至10倍并且需要更少的训练数据。

Abstract

Recently, tuning the pre-trained language model (PLM) in a parameter-efficient manner becomes a popular topic in the natural language processing area. However, most of them focus on tuning the PLM with the text-only information. In this work, we propose a new perspective to tune the frozen PLM with images for caption generation. We denote our method as I-Tun