The partial monitoring (PM) framework provides a theoretical formulation of
sequential learning problems with incomplete feedback. On each round, a
learning agent plays an action while the environment simultaneously chooses an
outcome. The agent then observes a feedback signal that is only partially
informative about the (unobserved) outcome. The agent leverages the received
feedback signals to select actions that minimize the (unobserved) cumulative
loss. In contextual PM, the outcomes depend on some side information that is
observable by the agent before selecting the action on each round. In this
paper, we consider the contextual and non-contextual PM settings with
stochastic outcomes. We introduce a new class of strategies based on the
randomization of deterministic confidence bounds, that extend regret guarantees
to settings where existing stochastic strategies are not applicable. Our
experiments show that the proposed RandCBP and RandCBPside* strategies improve
state-of-the-art baselines in PM games. To encourage the adoption of the PM
framework, we design a use case on the real-world problem of monitoring the
error rate of any deployed classification system.

偏导监控（PM）框架为具有不完整反馈的顺序学习问题提供了一个理论表述。本文在上下文 PM 的情况下，考虑了随机结果的情况，并介绍了一种基于确定性置信区间的随机化策略，扩展了悔恨保证适用范围，该策略在 PM 游戏中改进了现有基线结果。为了鼓励采用 PM 框架，我们设计了一个实际问题的用例，即监控任何部署的分类系统的误差率。

随机偏袒监控的随机置信界限

Randomized Confidence Bounds for Stochastic Partial Monitoring

We study the problem of continually training an instruction-following agent
through feedback provided by users during collaborative interactions. During
interaction, human users instruct an agent using natural language, and provide
realtime binary feedback as they observe the agent's instruction execution. We
cast learning as a contextual bandit problem, converting the user feedback to
immediate reward. We evaluate through multiple rounds of human-agent
interactions, demonstrating 15.4% absolute improvement in instruction execution
over time. We also show our approach is robust to several design variations,
and that the feedback signal is roughly equivalent to the learning signal of
supervised demonstration data.

通过人机协作交互提供的实时二元反馈，用自然语言训练指令遵从代理的问题被研究。将学习作为一种上下文医师问题，将用户反馈转换为立即奖励，证明了其在提高指令执行效果方面具有优势，并且反馈信号与监督式演示数据的学习信号基本等价。

基于实时反馈的指令跟随持续学习

Continual Learning for Instruction Following from Realtime Feedback

We present an efficient, effective, and generic approach towards solving
inverse problems. The key idea is to leverage the feedback signal provided by
the forward process and learn an iterative update model. Specifically, at each
iteration, the neural network takes the feedback as input and outputs an update
on the current estimation. Our approach does not have any restrictions on the
forward process; it does not require any prior knowledge either. Through the
feedback information, our model not only can produce accurate estimations that
are coherent to the input observation but also is capable of recovering from
early incorrect predictions. We verify the performance of our approach over a
wide range of inverse problems, including 6-DOF pose estimation, illumination
estimation, as well as inverse kinematics. Comparing to traditional
optimization-based methods, we can achieve comparable or better performance
while being two to three orders of magnitude faster. Compared to deep
learning-based approaches, our model consistently improves the performance on
all metrics. Please refer to the project page for videos, animations,
supplementary materials, etc.

本文介绍了一种利用反馈信号进行迭代更新建模的方法，可在解决逆问题的过程中提供比传统优化法更快且更优秀的性能表现，同时在各项指标上均显著优于基于深度学习的方法，可广泛应用于 6-DOF 姿态估计、照明估计和逆运动学等领域。