Can transformers generalize efficiently on problems that require dealing with examples with different levels of difficulty? We introduce a new task tailored to assess generalization over different complexities and present results that indicate that standard transformers face challenges in solving these tasks. These tasks are variations of pointer value retrieval previously introduced by Zhang et al. (2021). We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential computation steps (i.e., the depth of the computation graph). Based on our observations, we propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers. This model demonstrates higher accuracy and a fairer allocation of computational resources when generalizing to higher numbers of computation steps. We conclude that mechanisms for adaptive depth and modularity complement each other in improving efficient generalization concerning example complexity. Additionally, to emphasize the broad applicability of our findings, we illustrate that in a standard image recognition task, Hyper- UT's performance matches that of a ViT model but with considerably reduced computational demands (achieving over 70\% average savings by effectively using fewer layers).

通过引入新的任务，我们调查了transformers在不同难度的问题上的泛化效应，并通过结果表明，标准transformers在解决这些任务时面临挑战。我们提出了基于适应性和模块化计算机制的Hyper-UT架构，它通过从超网络生成动态函数和从通用transformers中获得适应性深度来学习需要对计算步骤数量（即计算图的深度）进行泛化的任务。该模型表现出更高的准确性和对于更高数量的计算步骤的公平的计算资源分配。我们得出结论，适应性深度和模块化机制相互补充，从而提高了关于示例复杂性的高效泛化效应。此外，我们还说明了我们发现的广泛适用性，通过证明在标准图像识别任务中，Hyper-UT模型的性能与ViT模型相匹配，但计算要求显著降低（通过有效使用较少的层达到平均节省70%）。

适应性和模块化：高效泛化多样性任务