Contextual bandit with linear reward functions is among one of the most
extensively studied models in bandit and online learning research. Recently,
there has been increasing interest in designing \emph{locally private} linear
contextual bandit algorithms, where sensitive information contained in contexts
and rewards is protected against leakage to the general public. While the
classical linear contextual bandit algorithm admits cumulative regret upper
bounds of $\tilde O(\sqrt{T})$ via multiple alternative methods, it has
remained open whether such regret bounds are attainable in the presence of
local privacy constraints, with the state-of-the-art result being $\tilde
O(T^{3/4})$. In this paper, we show that it is indeed possible to achieve an
$\tilde O(\sqrt{T})$ regret upper bound for locally private linear contextual
bandit. Our solution relies on several new algorithmic and analytical ideas,
such as the analysis of mean absolute deviation errors and layered principal
component regression in order to achieve small mean absolute deviation errors.

通过分析均值绝对偏差误差和分层主成分回归，我们展示了一种能够在局部隐私线性情境播放机中实现 O (√T) 累积遗憾上界的解决方案。

关于局部隐私线性情境赌博机的最佳后悔

On the Optimal Regret of Locally Private Linear Contextual Bandit

In this paper, we take a step towards a deeper understanding of learning from
human preferences by systematically comparing the paradigm of reinforcement
learning from human feedback (RLHF) with the recently proposed paradigm of
direct preference optimization (DPO). We focus our attention on the class of
loglinear policy parametrization and linear reward functions. In order to
compare the two paradigms, we first derive minimax statistical bounds on the
suboptimality gap induced by both RLHF and DPO, assuming access to an oracle
that exactly solves the optimization problems. We provide a detailed discussion
on the relative comparison between the two paradigms, simultaneously taking
into account the sample size, policy and reward class dimensions, and the
regularization temperature. Moreover, we extend our analysis to the approximate
optimization setting and derive exponentially decaying convergence rates for
both RLHF and DPO. Next, we analyze the setting where the ground-truth reward
is not realizable and find that, while RLHF incurs a constant additional error,
DPO retains its asymptotically decaying gap by just tuning the temperature
accordingly. Finally, we extend our comparison to the Markov decision process
setting, where we generalize our results with exact optimization. To the best
of our knowledge, we are the first to provide such a comparative analysis for
RLHF and DPO.

通过系统比较强化学习从人类反馈中学习的范例与最近提出的直接偏好优化范例，我们向更深入地理解从人类偏好中学习迈进了一步。我们集中关注对数线性策略参数化和线性奖励函数的类别。

奖励模型学习与直接策略优化：从人类偏好中学习的比较分析

Reward Model Learning vs. Direct Policy Optimization: A Comparative  Analysis of Learning from Human Preferences

Designers of AI agents often iterate on the reward function in a
trial-and-error process until they get the desired behavior, but this only
guarantees good behavior in the training environment. We propose structuring
this process as a series of queries asking the user to compare between
different reward functions. Thus we can actively select queries for maximum
informativeness about the true reward. In contrast to approaches asking the
designer for optimal behavior, this allows us to gather additional information
by eliciting preferences between suboptimal behaviors. After each query, we
need to update the posterior over the true reward function from observing the
proxy reward function chosen by the designer. The recently proposed Inverse
Reward Design (IRD) enables this. Our approach substantially outperforms IRD in
test environments. In particular, it can query the designer about
interpretable, linear reward functions and still infer non-linear ones.

通过与用户交互，选择最能反映真实回报的问题来迭代 AI 代理的奖励函数设计，我们的方法优于 Inverse Reward Design，且可以推断非线性奖励函数，包括可解释的线性奖励函数。