Transformer-based NLP models are powerful but have high computational costs that limit deployment scenarios. Finetuned encoder-decoder models are popular in specialized domains and can outperform larger more generalized decoder-only models, such as GPT-4. We introduce a new configuration for encoder-decoder models that improves efficiency on structured output and question-answering tasks where multiple outputs are required of a single input. Our method, prompt-in-decoder (PiD), encodes the input once and decodes output in parallel, boosting both training and inference efficiency by avoiding duplicate input encoding, thereby reducing the decoder's memory footprint. We achieve computation reduction that roughly scales with the number of subtasks, gaining up to 4.6x speed-up over state-of-the-art models for dialogue state tracking, summarization, and question-answering tasks with comparable or better performance. We release our training/inference code and checkpoints.

基于Transformer的NLP模型在计算成本上限制了其应用场景。我们引入了一种新的编码器-解码器模型配置（PiD），通过一次编码和并行解码输出来提高结构化输出和问答任务的效率，避免了重复的输入编码以及减小解码器的内存占用，从而获得了可比较或更好性能并具有高达4.6倍加速的计算减少。

一次编码，多次并行解码：高效Transformer解码