Vision-Language Models (VLMs) have achieved remarkable success in various
multi-modal tasks, but they are often bottlenecked by the limited context
window and high computational cost of processing high-resolution image inputs
and videos. Vision compression can alleviate this problem by reducing the
vision token count. Previous approaches compress vision tokens with external
modules and force LLMs to understand the compressed ones, leading to visual
information loss. However, the LLMs' understanding paradigm of vision tokens is
not fully utilised in the compression learning process. We propose VoCo-LLaMA,
the first approach to compress vision tokens using LLMs. By introducing Vision
Compression tokens during the vision instruction tuning phase and leveraging
attention distillation, our method distill how LLMs comprehend vision tokens
into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision
compression and improves the computational efficiency during the inference
stage. Specifically, our method achieves minimal performance loss with a
compression ratio of 576$\times$, resulting in up to 94.8$\%$ fewer FLOPs and
69.6$\%$ acceleration in inference time. Furthermore, through continuous
training using time-series compressed token sequences of video frames,
VoCo-LLaMA demonstrates the ability to understand temporal correlations,
outperforming previous methods on popular video question-answering benchmarks.
Our approach presents a promising way to unlock the full potential of VLMs'
contextual window, enabling more scalable multi-modal applications. The project
page, along with the associated code, can be accessed via
$\href{this https URL}{\text{this https URL}}$.

基于 Vision-Language Models 的 VoCo-LLaMA 方法通过引入 Vision Compression tokens 和利用 attention distillation，实现了视觉压缩并提高推理效率，能够理解时间相关性，在多模态应用中具有广泛的潜力。

VoCo-LLaMA：面向大型语言模型的视觉压缩

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Graph Convolutional Neural Networks (GCNs) possess strong capabilities for
processing graph data in non-grid domains. They can capture the topological
logical structure and node features in graphs and integrate them into nodes'
final representations. GCNs have been extensively studied in various fields,
such as recommendation systems, social networks, and protein molecular
structures. With the increasing application of graph neural networks, research
has focused on improving their performance while compressing their size. In
this work, a plug-in module named Graph Knowledge Enhancement and Distillation
Module (GKEDM) is proposed. GKEDM can enhance node representations and improve
the performance of GCNs by extracting and aggregating graph information via
multi-head attention mechanism. Furthermore, GKEDM can serve as an auxiliary
transferor for knowledge distillation. With a specially designed attention
distillation method, GKEDM can distill the knowledge of large teacher models
into high-performance and compact student models. Experiments on multiple
datasets demonstrate that GKEDM can significantly improve the performance of
various GCNs with minimal overhead. Furthermore, it can efficiently transfer
distilled knowledge from large teacher networks to small student networks via
attention distillation.

提出了一种名为图知识增强和蒸馏模块的插件，通过多头注意机制提取和聚合图信息以增强节点表示，并能通过特殊设计的注意力蒸馏方法将大型教师模型的知识蒸馏到高性能和紧凑的学生模型中，并通过注意力蒸馏有效地从大型教师网络转移蒸馏知识到小型学生网络。

提升图卷积神经网络的关注力

Attention is all you need for boosting graph convolutional neural  network

Retrieval-augmented generation framework can address the limitations of large
language models by enabling real-time knowledge updates for more accurate
answers. An efficient way in the training phase of retrieval-augmented models
is attention distillation, which uses attention scores as a supervision signal
instead of manually annotated query-document pairs. Despite its growing
popularity, the detailed mechanisms behind the success of attention
distillation remain unexplored, particularly the specific patterns it leverages
to benefit training. In this paper, we address this gap by conducting a
comprehensive review of attention distillation workflow and identifying key
factors influencing the learning quality of retrieval-augmented language
models. We further propose indicators for optimizing models' training methods
and avoiding ineffective training.

通过注意力蒸馏机制，综合评估了提取增强模型的工作流程，明确了影响检索 - 增强语言模型学习质量的关键因素，并提出了优化模型训练方法和避免无效训练的指标。

揭秘：调查检索增强生成中的注意力精简

Unveiling the Magic: Investigating Attention Distillation in  Retrieval-augmented Generation

If an image tells a story, the image caption is the briefest narrator.
Generally, a scene graph prefers to be an omniscient generalist, while the
image caption is more willing to be a specialist, which outlines the gist. Lots
of previous studies have found that a scene graph is not as practical as
expected unless it can reduce the trivial contents and noises. In this respect,
the image caption is a good tutor. To this end, we let the scene graph borrow
the ability from the image caption so that it can be a specialist on the basis
of remaining all-around, resulting in the so-called Topic Scene Graph. What an
image caption pays attention to is distilled and passed to the scene graph for
estimating the importance of partial objects, relationships, and events.
Specifically, during the caption generation, the attention about individual
objects in each time step is collected, pooled, and assembled to obtain the
attention about relationships, which serves as weak supervision for
regularizing the estimated importance scores of relationships. In addition, as
this attention distillation process provides an opportunity for combining the
generation of image caption and scene graph together, we further transform the
scene graph into linguistic form with rich and free-form expressions by sharing
a single generation model with image caption. Experiments show that attention
distillation brings significant improvements in mining important relationships
without strong supervision, and the topic scene graph shows great potential in
subsequent applications.

文章阐述了如何使用图像注释中的注意力分配机制来增强场景图的估计能力，提出了一种称为主题场景图的方法，学习从图像到自然语言的映射，并用于关系重要性估计。