Reinforcement learning (RL) for robot control typically requires a detailed
representation of the environment state, including information about
task-relevant objects not directly measurable. Keypoint detectors, such as
spatial autoencoders (SAEs), are a common approach to extracting a
low-dimensional representation from high-dimensional image data. SAEs aim at
spatial features such as object positions, which are often useful
representations in robotic RL. However, whether an SAE is actually able to
track objects in the scene and thus yields a spatial state representation well
suited for RL tasks has rarely been examined due to a lack of established
metrics. In this paper, we propose to assess the performance of an SAE instance
by measuring how well keypoints track ground truth objects in images. We
present a computationally lightweight metric and use it to evaluate common
baseline SAE architectures on image data from a simulated robot task. We find
that common SAEs differ substantially in their spatial extraction capability.
Furthermore, we validate that SAEs that perform well in our metric achieve
superior performance when used in downstream RL. Thus, our metric is an
effective and lightweight indicator of RL performance before executing
expensive RL training. Building on these insights, we identify three key
modifications of SAE architectures to improve tracking performance. We make our
code available at anonymous.4open.science/r/sae-rl.

提出了一种用于评估 SAE 实例性能的轻量级度量标准，并验证 SAE 实例的跟踪性能与其在下游强化学习中的表现的关系。从而实现在昂贵的强化学习训练之前对 RL 性能的评估，同时提出了改进 SAE 架构以提高跟踪性能的三个关键修改。