Deep learning is closing the gap with humans on several object recognition benchmarks. Here we investigate this gap in the context of challenging images where objects are seen from unusual viewpoints. We find that humans excel at recognizing objects in unusual poses, in contrast with state-of-the-art pretrained networks (EfficientNet, SWAG, ViT, SWIN, BEiT, ConvNext) which are systematically brittle in this condition. Remarkably, as we limit image exposure time, human performance degrades to the level of deep networks, suggesting that additional mental processes (requiring additional time) take place when humans identify objects in unusual poses. Finally, our analysis of error patterns of humans vs. networks reveals that even time-limited humans are dissimilar to feed-forward deep networks. We conclude that more work is needed to bring computer vision systems to the level of robustness of the human visual system. Understanding the nature of the mental processes taking place during extra viewing time may be key to attain such robustness.

深度学习与人类在几个物体识别基准上的差距正在缩小。在本文中，我们研究了在不寻常视角下观察物体的情况下这个差距。我们发现，与最先进的预训练网络（EfficientNet，SWAG，ViT，SWIN，BEiT，ConvNext）相比，人类在识别不寻常姿势的物体方面表现得更出色。值得注意的是，当我们限制图像暴露时间时，人类的表现下降到深度网络的水平，这表明当人类在不寻常姿势下识别物体时，会进行额外的心理过程（需要额外的时间）。最后，我们对人类和网络的错误模式进行分析，发现即使是时间有限的人类与前馈深度网络也有不相似之处。我们得出结论，需要更多的工作来使计算机视觉系统具备人类视觉系统的鲁棒性。了解额外的观察时间中进行的心理过程的性质可能是获得这种鲁棒性的关键。

人类在给予足够时间的情况下，以不常见的姿势识别物体时胜过深度神经网络