The scalability limitations of Transformers regarding sequence length have renewed interest in recurrent sequence models that are parallelizable during training. As a result, many novel recurrent architectures, such as S4, Mamba, and Aaren, have been proposed that achieve comparable performance. In this work, we revisit traditional recurrent neural networks (RNNs) from over a decade ago: LSTMs (1997) and GRUs (2014). While these models were slow due to requiring to backpropagate through time (BPTT), we show that by removing their hidden state dependencies from their input, forget, and update gates, LSTMs and GRUs no longer need to BPTT and can be efficiently trained in parallel. Building on this, we introduce minimal versions (minLSTMs and minGRUs) that (1) use significantly fewer parameters than their traditional counterparts and (2) are fully parallelizable during training (175x faster for a sequence of length 512). Lastly, we show that these stripped-down versions of decade-old RNNs match the empirical performance of recent sequence models.

本研究针对Transformer在序列长度上的可扩展性限制，重新审视传统的循环神经网络（RNNs），尤其是LSTM和GRU。通过去除隐藏状态依赖性，提出了简化版本（minLSTMs和minGRUs），不仅参数显著减少，还能高效地进行并行训练，其性能与近期模型相当，表明传统RNN仍具备潜在价值。

我们还需要RNN吗？