In this research note, we revisit the bandits with expert advice problem.
Under a restricted feedback model, we prove a lower bound of order $\sqrt{K T
\ln(N/K)}$ for the worst-case regret, where $K$ is the number of actions, $N>K$
the number of experts, and $T$ the time horizon. This matches a previously
known upper bound of the same order and improves upon the best available lower
bound of $\sqrt{K T (\ln N) / (\ln K)}$. For the standard feedback model, we
prove a new instance-based upper bound that depends on the agreement between
the experts and provides a logarithmic improvement compared to prior results.

通过受限反馈模型，本研究提供了关于 “专家建议问题” 的最坏情况后悔度的新的下界和上界，其中下界为 O (sqrt (KT ln (N/K)))，上界与之相匹配，并改进了现有最佳下界 sqrt (KT (ln N) / (ln K))。同时，对于标准反馈模型，本研究提供了一种新的基于实例的上界，该上界取决于专家之间的一致性，并相比之前的结果提供了对数级的改进。

带专家建议的强盗问题的改进遗憾界限

Improved Regret Bounds for Bandits with Expert Advice

Many real-life contractual relations differ completely from the clean, static
model at the heart of principal-agent theory. Typically, they involve repeated
strategic interactions of the principal and agent, taking place under
uncertainty and over time. While appealing in theory, players seldom use
complex dynamic strategies in practice, often preferring to circumvent
complexity and approach uncertainty through learning. We initiate the study of
repeated contracts with a learning agent, focusing on agents who achieve
no-regret outcomes.
Optimizing against a no-regret agent is a known open problem in general
games; we achieve an optimal solution to this problem for a canonical contract
setting, in which the agent's choice among multiple actions leads to
success/failure. The solution has a surprisingly simple structure: for some
$\alpha > 0$, initially offer the agent a linear contract with scalar $\alpha$,
then switch to offering a linear contract with scalar $0$. This switch causes
the agent to ``free-fall'' through their action space and during this time
provides the principal with non-zero reward at zero cost. Despite apparent
exploitation of the agent, this dynamic contract can leave \emph{both} players
better off compared to the best static contract. Our results generalize beyond
success/failure, to arbitrary non-linear contracts which the principal rescales
dynamically.
Finally, we quantify the dependence of our results on knowledge of the time
horizon, and are the first to address this consideration in the study of
strategizing against learning agents.

通过学习代理的重复合同，我们提供了一种动态合同的解决方案，这种合同对于知识有限的主体和代理都有利，并且可以应用于不同的时间范围。

与学习代理人的契约

Contracting with a Learning Agent

Inverse reinforcement learning (IRL) algorithms often rely on (forward)
reinforcement learning or planning over a given time horizon to compute an
approximately optimal policy for a hypothesized reward function and then match
this policy with expert demonstrations. The time horizon plays a critical role
in determining both the accuracy of reward estimate and the computational
efficiency of IRL algorithms. Interestingly, an effective time horizon shorter
than the ground-truth value often produces better results faster. This work
formally analyzes this phenomenon and provides an explanation: the time horizon
controls the complexity of an induced policy class and mitigates overfitting
with limited data. This analysis leads to a principled choice of the effective
horizon for IRL. It also prompts us to reexamine the classic IRL formulation:
it is more natural to learn jointly the reward and the effective horizon
together rather than the reward alone with a given horizon. Our experimental
results confirm the theoretical analysis.

本研究分析了逆强化学习的时间跨度对于奖励估计准确性和计算效率的影响，并提出了使用更短的时间跨度可以更快地产生更好结果的解释。此研究还提出了在逆强化学习中一起学习奖励和有效时间跨度比独立学习奖励更为自然的看法。实验结果证实了理论分析。

逆强化学习有效视野

On the Effective Horizon of Inverse Reinforcement Learning

Linear contextual bandit is an important class of sequential decision making
problems with a wide range of applications to recommender systems, online
advertising, healthcare, and many other machine learning related tasks. While
there is a lot of prior research, tight regret bounds of linear contextual
bandit with infinite action sets remain open. In this paper, we address this
open problem by considering the linear contextual bandit with (changing)
infinite action sets. We prove a regret upper bound on the order of
$O(\sqrt{d^2T\log T})\times \text{poly}(\log\log T)$ where $d$ is the domain
dimension and $T$ is the time horizon. Our upper bound matches the previous
lower bound of $\Omega(\sqrt{d^2 T\log T})$ in [Li et al., 2019] up to iterated
logarithmic terms.

本文研究线性上下文赌博机，特别是具有更改的无穷动作集的情况下的情况。我们证明了一种悔恨上界，其与以前的下界相匹配。