This short note is written for rapid communication of long context training
and to share the idea of how to train it with low memory usage. In the note, we
generalize the attention algorithm and neural network of Generative Pre-Trained
Transformers and reinterpret it in Path integral formalism. First, the role of
the transformer is understood as the time evolution of the token state and
second, it is suggested that the all key-token states in the same time as the
query-token can attend to the attention with the query token states. As a
result of the repetitive time evolution, it is discussed that the token states
in the past sequence meats the token states in the present sequence so that the
attention between separated sequences becomes possible for maintaining infinite
contextual information just by using low memory for limited size of sequence.
For the experiment, the $12$ input token window size was taken and one GPU with
$24$GB memory was used for the pre-training. It was confirmed that more than
$150$ length context is preserved. The sampling result of the training, the
code and the other details will be included in the revised version of this note
later.

利用生成式预训练变形器的注意力算法和神经网络在路径积分形式上进行推广，将变形器的作用解释为令牌状态的时间演变，并建议在相同时间内，所有关键 - 令牌状态都可以与查询令牌状态进行关注，从而通过使用有限的序列大小的低内存来保持分离序列之间的无限上下文信息的注意力。

路径积分形式下无限上下文转换器中的折叠上下文浓缩

Folded context condensation in Path Integral formalism for infinite  context transformers

We provide an optimized implementation of the forward pass of
FlashAttention-2, a popular memory-aware scaled dot-product attention
algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture
and written using the open-source CUTLASS library. In doing so, we explain the
challenges and techniques involved in fusing online-softmax with back-to-back
GEMM kernels, utilizing the Hopper-specific Tensor Memory Accelerator (TMA) and
Warpgroup Matrix-Multiply-Accumulate (WGMMA) instructions, defining and
transforming CUTLASS Layouts and Tensors, overlapping copy and GEMM operations,
and choosing optimal tile sizes for the Q, K and V attention matrices while
balancing the register pressure and shared memory utilization. In head-to-head
benchmarks on a single H100 PCIe GPU for some common choices of
hyperparameters, we observe 20-50% higher FLOPs/s over a version of
FlashAttention-2 optimized for last-generation NVIDIA Ampere architecture.

我们提供了 FlashAttention-2 前向传递的优化实现，使用了自定义融合的 CUDA 内核，以适应 NVIDIA Hopper 架构，并使用开源的 CUTLASS 库编写。在此过程中，我们解释了将在线 softmax 与连续的 GEMM 内核融合在一起的挑战和技术，利用 Hopper 特定的 Tensor Memory Accelerator（TMA）和 Warpgroup Matrix-Multiply-Accumulate（WGMMA）指令，定义和转换 CUTLASS 布局和张量，重叠复制和 GEMM 操作，并选择 Q、K 和 V 注意力矩阵的最优瓦片大小，同时平衡寄存器压力和共享内存利用率。在单个 H100 PCIe GPU 上进行的对比性测试中，针对某些常见的超参数选择，我们观察到与针对上一代 NVIDIA Ampere 架构进行优化的 FlashAttention-2 版本相比，FLOPs/s 高出 20-50%。