May, 2023

用于长上下文大模型的分块并行 Transformer

TL;DRBlockwise Parallel Transformer (BPT) is a distinct approach to address memory demands posed by the self-attention mechanism and the large feedforward network in Transformers, which enables training sequences up to 32 times longer than vanilla Transformers and 2 to 4 times longer than previous memory-efficient methods, and improves performance in language modeling and reinforcement learning tasks.