In this paper, we study the inductive biases of two major approaches to augmenting Transformers with a recurrent mechanism - (1) the approach of incorporating a depth-wise recurrence similar to Universal Transformers; and (2) the approach of incorporating a chunk-wise temporal recurrence like Temporal Latent Bottleneck. Furthermore, we propose and investigate novel ways to extend and combine the above methods - for example, we propose a global mean-based dynamic halting mechanism for Universal Transformer and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. We compare the models and probe their inductive biases in several diagnostic tasks such as Long Range Arena (LRA), flip-flop language modeling, ListOps, and Logical Inference.

本文研究了两种主要方法在增强Transformer与循环机制方面的归纳倾向性，其中一种是类似于通用Transformer的逐层循环方法，另一种是类似于时态潜变块的分块时间循环方法。此外，我们提出并研究了扩展和组合上述方法的新方式，例如，我们为通用Transformer提出了一种基于全局均值的动态停止机制，并将时态潜变块与通用Transformer的要素进行了增强。我们通过一些诊断性任务（如长距离竞技场，翻转语言建模，列表操作和逻辑推理）比较了这些模型，并探讨了它们的归纳倾向性。

具有动态停止的循环变压器