We study interactive learning in a setting where the agent has to generate a response (e.g., an action or trajectory) given a context and an instruction. In contrast, to typical approaches that train the system using reward or expert supervision on response, we study learning with hindsight instruction where a teacher provides an instruction that is most suitable for the agent's generated response. This hindsight labeling of instruction is often easier to provide than providing expert supervision of the optimal response which may require expert knowledge or can be impractical to elicit. We initiate the theoretical analysis of interactive learning with hindsight labeling. We first provide a lower bound showing that in general, the regret of any algorithm must scale with the size of the agent's response space. We then study a specialized setting where the underlying instruction-response distribution can be decomposed as a low-rank matrix. We introduce an algorithm called LORIL for this setting and show that its regret scales as $\sqrt{T}$ where $T$ is the number of rounds and depends on the intrinsic rank but does not depend on the size of the agent's response space. We provide experiments in two domains showing that LORIL outperforms baselines even when the low-rank assumption is violated.

探讨了以回顾性标签为指导的交互学习，通过理论分析证明了任何算法的后悔度必须与代理的响应空间的规模成比例，并基于低秩矩阵的特殊设定引入了名为LORIL的算法，并证明了它的后悔度与回合数的平方根成比例，而不以代理的响应空间的大小为依据，最后通过两个领域的实验表明了LORIL优于基准算法。

具近见式指导反馈的可证明交互式学习