We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. In this paper, we develop two algorithms for the infinite-horizon average reward setting. Our first algorithm \texttt{UCRL2-MNL} applies to the class of communicating MDPs and achieves an $\tilde{\mathcal{O}}(dD\sqrt{T})$ regret, where $d$ is the dimension of feature mapping, $D$ is the diameter of the underlying MDP, and $T$ is the horizon. The second algorithm \texttt{OVIFH-MNL} is computationally more efficient and applies to the more general class of weakly communicating MDPs, for which we show a regret guarantee of $\tilde{\mathcal{O}}(d^{2/5} \mathrm{sp}(v^*)T^{4/5})$ where $\mathrm{sp}(v^*)$ is the span of the associated optimal bias function. We also prove a lower bound of $\Omega(d\sqrt{DT})$ for learning communicating MDPs with MNL transitions of diameter at most $D$. Furthermore, we show a regret lower bound of $\Omega(dH^{3/2}\sqrt{K})$ for learning $H$-horizon episodic MDPs with MNL function approximation where $K$ is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

我们研究了具有非线性函数逼近的基于模型的强化学习，其中底层马尔可夫决策过程（MDP）的转移函数由一个多项式逻辑模型给出。本文针对无限时间平均奖励设定，提出了两种算法。第一个算法UCRL2-MNL适用于通信MDP类，并实现了一种具有(近似)Ο(dD√T)的遗憾保证，其中d是特征映射的维数，D是底层MDP的直径，T是时间界。第二个算法OVIFH-MNL在计算上更有效，并适用于更一般的弱通信MDP类，我们展示了其具有(近似)Ο(d^(2/5)sp(v^*)T^(4/5))的遗憾保证，其中sp(v^*)是相关最优偏差函数的散度。我们还证明了对于最大直径为D的可通信MDP，学习具有MNL转移的复杂度的Ω(d√(DT))的下界。此外，我们对于具有MNL函数逼近的H-时间界的情况，展示了Ω(dH^(3/2)√K)的遗憾下界，在这里K是序列的数量，该下界优于有限时间界设定的已知最佳下界。

无限时间平均回报马尔科夫决策过程的强化学习与多项式逻辑函数逼近