We introduce a decoder-decoder architecture, YOCO, for large language models,
which only caches key-value pairs once. It consists of two components, i.e., a
cross-decoder stacked upon a self-decoder. The self-decoder efficiently encodes
global key-value (KV) caches that are reused by the cross-decoder via
cross-attention. The overall model behaves like a decoder-only Transformer,
although YOCO only caches once. The design substantially reduces GPU memory
demands, yet retains global attention capability. Additionally, the computation
flow enables prefilling to early exit without changing the final output,
thereby significantly speeding up the prefill stage. Experimental results
demonstrate that YOCO achieves favorable performance compared to Transformer in
various settings of scaling up model size and number of training tokens. We
also extend YOCO to 1M context length with near-perfect needle retrieval
accuracy. The profiling results show that YOCO improves inference memory,
prefill latency, and throughput by orders of magnitude across context lengths
and model sizes. Code is available at this https URL

用一种仅缓存一次的解码器 - 解码器架构 (YOCO) 来构建大型语言模型，以降低 GPU 内存需求，并在模型尺寸扩大和词汇数量增加的情况下取得良好的性能，并提高针筒检索的准确性。

只需缓存一次：用于语言模型的解码器 - 解码器架构

You Only Cache Once: Decoder-Decoder Architectures for Language Models

Recently, the success of large models has demonstrated the importance of
scaling up model size. This has spurred interest in exploring collaborative
training of large-scale models from federated learning perspective. Due to
computational constraints, many institutions struggle to train a large-scale
model locally. Thus, training a larger global model using only smaller local
models has become an important scenario (i.e., the \textbf{small-to-large
scenario}). Although recent device-heterogeneity federated learning approaches
have started to explore this area, they face limitations in fully covering the
parameter space of the global model. In this paper, we propose a method called
\textbf{FedBRB} (\underline{B}lock-wise \underline{R}olling and weighted
\underline{B}roadcast) based on the block concept. FedBRB can uses small local
models to train all blocks of the large global model, and broadcasts the
trained parameters to the entire space for faster information interaction.
Experiments demonstrate FedBRB yields substantial performance gains, achieving
state-of-the-art results in this scenario. Moreover, FedBRB using only minimal
local models can even surpass baselines using larger local models.

最近，大型模型的成功展示了扩大模型规模的重要性，这在联邦学习的视角下激发了对大规模模型的协同训练的兴趣。由于计算约束，许多机构在本地训练大规模模型时面临困难。因此，只使用较小的本地模型训练更大的全局模型已成为重要的场景。尽管最近的设备异构联邦学习方法开始探索此领域，但它们在完全覆盖全局模型的参数空间方面存在局限性。本文提出了一种基于块概念的方法 FedBRB（块级滚动和加权广播）。FedBRB 可以使用小型本地模型训练大型全局模型的所有块，并将训练参数广播到整个空间以实现更快的信息交互。实验表明 FedBRB 在此场景中取得了显著的性能提升，达到了最先进的结果。此外，仅使用较小的本地模型的 FedBRB 甚至可以超越使用较大本地模型的基线。

FedBRB：设备异构联邦学习中小到大场景的有效解决方案

FedBRB: An Effective Solution to the Small-to-Large Scenario in  Device-Heterogeneity Federated Learning

It has been shown that dual encoders trained on one domain often fail to
generalize to other domains for retrieval tasks. One widespread belief is that
the bottleneck layer of a dual encoder, where the final score is simply a
dot-product between a query vector and a passage vector, is too limited to make
dual encoders an effective retrieval model for out-of-domain generalization. In
this paper, we challenge this belief by scaling up the size of the dual encoder
model {\em while keeping the bottleneck embedding size fixed.} With multi-stage
training, surprisingly, scaling up the model size brings significant
improvement on a variety of retrieval tasks, especially for out-of-domain
generalization. Experimental results show that our dual encoders,
\textbf{G}eneralizable \textbf{T}5-based dense \textbf{R}etrievers (GTR),
outperform %ColBERT~\cite{khattab2020colbert} and existing sparse and dense
retrievers on the BEIR dataset~\cite{thakur2021beir} significantly. Most
surprisingly, our ablation study finds that GTR is very data efficient, as it
only needs 10\% of MS Marco supervised data to achieve the best out-of-domain
performance. All the GTR models are released at
this https URL

本文通过对双编码器进行多阶段训练，并在保持瓶颈嵌入大小不变的同时扩大了双编码器模型的规模，挑战了一种广为流传的观念，即双编码器在一个域上训练后，往往无法推广到其他域的检索任务中。结果表明，我们的双编码器模型 ——GTR，尤其是在域外泛化方面，取得了显著的检索性能提高，并且在 BEIR 数据集上明显优于现有的稀疏和密集的检索模型。最出乎意料的是，我们的消融研究发现，GTR 在数据效率方面非常高效，只需要 MS Marco 10％的监督数据即可实现最佳的跨域检索性能。