Addressing the challenge of adapting pre-trained vision-language models for generating insightful explanations for visual reasoning tasks with limited annotations, we present ReVisE: a $\textbf{Re}$cursive $\textbf{Vis}$ual $\textbf{E}$xplanation algorithm. Our method iteratively computes visual features (conditioned on the text input), an answer, and an explanation, to improve the explanation quality step by step until the answer converges. We find that this multi-step approach guides the model to correct its own answers and outperforms single-step explanation generation. Furthermore, explanations generated by ReVisE also serve as valuable annotations for few-shot self-training. Our approach outperforms previous methods while utilizing merely 5% of the human-annotated explanations across 10 metrics, demonstrating up to a 4.2 and 1.3 increase in BLEU-1 score on the VCR and VQA-X datasets, underscoring the efficacy and data-efficiency of our method.

针对有限注释的视觉推理任务，我们提出了一种递归视觉解释算法（ReVisE），通过逐步计算视觉特征、答案和解释来提高解释质量，同时作为宝贵的用于少样本自我训练的注释，该方法在几项指标上超过以往方法，仅利用人类注释的5%的数据，VCR和VQA-X数据集的BLEU-1得分分别提高了4.2和1.3，突显了我们方法的有效性和数据效率。

从错误到正确：一种递归方法用于视觉语言解释