Image Restoration (IR), a classic low-level vision task, has witnessed
significant advancements through deep models that effectively model global
information. Notably, the Vision Transformers (ViTs) emergence has further
propelled these advancements. When computing, the self-attention mechanism, a
cornerstone of ViTs, tends to encompass all global cues, even those from
semantically unrelated objects or regions. This inclusivity introduces
computational inefficiencies, particularly noticeable with high input
resolution, as it requires processing irrelevant information, thereby impeding
efficiency. Additionally, for IR, it is commonly noted that small segments of a
degraded image, particularly those closely aligned semantically, provide
particularly relevant information to aid in the restoration process, as they
contribute essential contextual cues crucial for accurate reconstruction. To
address these challenges, we propose boosting IR's performance by sharing the
key semantics via Transformer for IR (i.e., SemanIR) in this paper.
Specifically, SemanIR initially constructs a sparse yet comprehensive
key-semantic dictionary within each transformer stage by establishing essential
semantic connections for every degraded patch. Subsequently, this dictionary is
shared across all subsequent transformer blocks within the same stage. This
strategy optimizes attention calculation within each block by focusing
exclusively on semantically related components stored in the key-semantic
dictionary. As a result, attention calculation achieves linear computational
complexity within each window. Extensive experiments across 6 IR tasks confirm
the proposed SemanIR's state-of-the-art performance, quantitatively and
qualitatively showcasing advancements.

通过使用由专门构造的稀疏但全面的关键语义词典优化的自注意力机制，该文提出了一种增强图像恢复性能的新方法：SemanIR。通过在同一阶段内共享关键语义词典，该方法能够实现每个窗口内的线性计算复杂度，并通过实验证明了其在六个图像恢复任务中的卓越性能。

Transformer 中共享关键语义的高效图像修复

Sharing Key Semantics in Transformer Makes Efficient Image Restoration

We present Lightning Attention, the first linear attention implementation
that maintains a constant training speed for various sequence lengths under
fixed memory consumption. Due to the issue with cumulative summation operations
(cumsum), previous linear attention implementations cannot achieve their
theoretical advantage in a casual setting. However, this issue can be
effectively solved by utilizing different attention calculation strategies to
compute the different parts of attention. Specifically, we split the attention
calculation into intra-blocks and inter-blocks and use conventional attention
computation for intra-blocks and linear attention kernel tricks for
inter-blocks. This eliminates the need for cumsum in the linear attention
calculation. Furthermore, a tiling technique is adopted through both forward
and backward procedures to take full advantage of the GPU hardware. To enhance
accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new
architecture that is tailored to our lightning attention. We conduct rigorous
testing on standard and self-collected datasets with varying model sizes and
sequence lengths. TNL is notably more efficient than other language models. In
addition, benchmark results indicate that TNL performs on par with
state-of-the-art LLMs utilizing conventional transformer structures. The source
code is released at github.com/OpenNLPLab/TransnormerLLM.

我们提出了闪电注意力（Lightning Attention），这是第一个在固定的内存消耗下保持不同序列长度的训练速度恒定的线性注意力实现。

不同长度，匀速进行：高效语言建模与闪电注意力

Various Lengths, Constant Speed: Efficient Language Modeling with  Lightning Attention

The Transformer architecture is crucial for numerous AI models, but it still
faces challenges in long-range language modeling. Though several specific
transformer architectures have been designed to tackle issues of long-range
dependencies, existing methods like Transformer-XL are plagued by a high
percentage of ineffective memories. In this study, we present a plug-and-play
strategy, known as TRAining-free Memory Selection (TRAMS), that selects tokens
participating in attention calculation based on one simple metric. This
strategy allows us to keep tokens that are likely to have a high attention
score with the current queries and ignore the other ones. We have tested our
approach on the word-level benchmark (WikiText-103) and the character-level
benchmark (enwik8), and the results indicate an improvement without having
additional training or adding additional parameters.

提出了一种称为 TRAining-free Memory Selection (TRAMS) 的插拔式策略，通过一个简单的评价指标选择参与注意力计算的令牌，从而改善长程语言建模的挑战，无需额外训练或添加参数，通过在 word-level benchmark (WikiText-103) 和 character-level benchmark (enwik8) 上的测试取得了改进的结果。

TRAMS：无需训练的长程语言模型记忆选择

TRAMS: Training-free Memory Selection for Long-range Language Modeling

Current end-to-end semantic role labeling is mostly accomplished via
graph-based neural models. However, these all are first-order models, where
each decision for detecting any predicate-argument pair is made in isolation
with local features. In this paper, we present a high-order refining mechanism
to perform interaction between all predicate-argument pairs. Based on the
baseline graph model, our high-order refining module learns higher-order
features between all candidate pairs via attention calculation, which are later
used to update the original token representations. After several iterations of
refinement, the underlying token representations can be enriched with globally
interacted features. Our high-order model achieves state-of-the-art results on
Chinese SRL data, including CoNLL09 and Universal Proposition Bank, meanwhile
relieving the long-range dependency issues.

本文呈献一种高级别的精细机制，通过注意力计算与所有谓词 - 论元对之间的交互来进行更新标记表示，以解决长程依赖问题，从而在中文 SRL 数据上实现了最先进的结果。