We study reinforcement learning with multinomial logistic (MNL) function
approximation where the underlying transition probability kernel of the Markov
decision processes (MDPs) is parametrized by an unknown transition core with
features of state and action. For the finite horizon episodic setting with
inhomogeneous state transitions, we propose provably efficient algorithms with
randomized exploration having frequentist regret guarantees. For our first
algorithm, $\texttt{RRL-MNL}$, we adapt optimistic sampling to ensure the
optimism of the estimated value function with sufficient frequency and
establish that $\texttt{RRL-MNL}$ is both statistically and computationally
efficient, achieving a $\tilde{O}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}}
\sqrt{T})$ frequentist regret bound with constant-time computational cost per
episode. Here, $d$ is the dimension of the transition core, $H$ is the horizon
length, $T$ is the total number of steps, and $\kappa$ is a problem-dependent
constant. Despite the simplicity and practicality of $\texttt{RRL-MNL}$, its
regret bound scales with $\kappa^{-1}$, which is potentially large in the worst
case. To improve the dependence on $\kappa^{-1}$, we propose
$\texttt{ORRL-MNL}$, which estimates the value function using local gradient
information of the MNL transition model. We show that its frequentist regret
bound is $\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2
H^2)$. To the best of our knowledge, these are the first randomized RL
algorithms for the MNL transition model that achieve both computational and
statistical efficiency. Numerical experiments demonstrate the superior
performance of the proposed algorithms.

我们研究了具有多项式逻辑（MNL）函数逼近的强化学习，其中马尔可夫决策过程（MDPs）的基础转移概率内核由具有状态和动作特性的未知转移核参数化。为了有非齐次状态转移的有限时段的情景，我们提出了具有频率后悔保证的随机探索算法，且具有可证明的高效性。

多项式逻辑函数近似的强化学习中的随机探索

Randomized Exploration for Reinforcement Learning with Multinomial  Logistic Function Approximation

This work advances randomized exploration in reinforcement learning (RL) with
function approximation modeled by linear mixture MDPs. We establish the first
prior-dependent Bayesian regret bound for RL with function approximation; and
refine the Bayesian regret analysis for posterior sampling reinforcement
learning (PSRL), presenting an upper bound of ${\mathcal{O}}(d\sqrt{H^3 T \log
T})$, where $d$ represents the dimensionality of the transition kernel, $H$ the
planning horizon, and $T$ the total number of interactions. This signifies a
methodological enhancement by optimizing the $\mathcal{O}(\sqrt{\log T})$
factor over the previous benchmark (Osband and Van Roy, 2014) specified to
linear mixture MDPs. Our approach, leveraging a value-targeted model learning
perspective, introduces a decoupling argument and a variance reduction
technique, moving beyond traditional analyses reliant on confidence sets and
concentration inequalities to formalize Bayesian regret bounds more
effectively.

利用线性混合马尔可夫决策过程模拟的函数逼近方法，本研究推进了强化学习中的随机探索。我们建立了关于函数逼近的依赖先验的贝叶斯遗憾界限，并对后验抽样强化学习的贝叶斯遗憾分析进行了改进，提出了一个上界为 O (d√(H^3 T log T)) 的方法，其中 d 表示转移核的维度，H 表示规划时间，T 表示总交互次数。相对于线性混合马尔可夫决策过程的先前基准 (Osband 和 Van Roy，2014) 优化了 O (√log T) 因子，我们的方法采用了面向值的模型学习视角，引入解耦和方案和方差减少技术，超越了传统分析对置信区间和集中不等式的依赖，更有效地规范贝叶斯遗憾界限。

先验依赖的函数逼近后验采样强化学习分析

Prior-dependent analysis of posterior sampling reinforcement learning  with function approximation

Dynamic learning systems subject to selective labeling exhibit censoring,
i.e. persistent negative predictions assigned to one or more subgroups of
points. In applications like consumer finance, this results in groups of
applicants that are persistently denied and thus never enter into the training
data. In this work, we formalize censoring, demonstrate how it can arise, and
highlight difficulties in detection. We consider safeguards against censoring -
recourse and randomized-exploration - both of which ensure we collect labels
for points that would otherwise go unobserved. The resulting techniques allow
examples from censored groups to enter into the training data and correct the
model. Our results highlight the otherwise unmeasured harms of censoring and
demonstrate the effectiveness of mitigation strategies across a range of data
generating processes.

在动态学习系统中，选择性标记会导致检查，即对一个或多个子组中分配的持续负面预测。我们正式化了检查，展示了它如何出现，并强调了检查检测的困难。我们考虑对检查的保障 - 救济和随机探索 - 两者都确保我们收集标签以观察到否则不会被观察到的点。结果技术允许来自被审查小组的示例进入培训数据并更正模型，我们的结果突显了检查的未测量危害，并证明了在一系列数据生成过程中缓解策略的有效性。