We consider the task of estimating a structural model of dynamic decisions by
a human agent based upon the observable history of implemented actions and
visited states. This problem has an inherent nested structure: in the inner
problem, an optimal policy for a given reward function is identified while in
the outer problem, a measure of fit is maximized. Several approaches have been
proposed to alleviate the computational burden of this nested-loop structure,
but these methods still suffer from high complexity when the state space is
either discrete with large cardinality or continuous in high dimensions. Other
approaches in the inverse reinforcement learning (IRL) literature emphasize
policy estimation at the expense of reduced reward estimation accuracy. In this
paper we propose a single-loop estimation algorithm with finite time guarantees
that is equipped to deal with high-dimensional state spaces without
compromising reward estimation accuracy. In the proposed algorithm, each policy
improvement step is followed by a stochastic gradient step for likelihood
maximization. We show that the proposed algorithm converges to a stationary
solution with a finite-time guarantee. Further, if the reward is parameterized
linearly, we show that the algorithm approximates the maximum likelihood
estimator sublinearly. Finally, by using robotics control problems in MuJoCo
and their transfer settings, we show that the proposed algorithm achieves
superior performance compared with other IRL and imitation learning benchmarks.

本文提出了一种单步估计算法，用于处理高维状态空间，同时又不会降低奖励估计精度的问题。该算法通过随机梯度最大化似然函数，使每次策略改进都能够进行。研究表明，该算法可以达到平稳状态，同时在 MuJoCo 机器人控制问题和其转移设置中，相比其他逆向强化学习算法和模仿学习基准，该算法表现更好。

高维状态空间中具有有限时间保证的马尔可夫决策过程的结构估计

Structural Estimation of Markov Decision Processes in High-Dimensional State Space with Finite-Time Guarantees

In simultaneous machine translation, the objective is to determine when to
produce a partial translation given a continuous stream of source words, with a
trade-off between latency and quality. We propose a neural machine translation
(NMT) model that makes dynamic decisions when to continue feeding on input or
generate output words. The model is composed of two main components: one to
dynamically decide on ending a source chunk, and another that translates the
consumed chunk. We train the components jointly and in a manner consistent with
the inference conditions. To generate chunked training data, we propose a
method that utilizes word alignment while also preserving enough context. We
compare models with bidirectional and unidirectional encoders of different
depths, both on real speech and text input. Our results on the IWSLT 2020
English-to-German task outperform a wait-k baseline by 2.6 to 3.7% BLEU
absolute.

提出了一种用于机器翻译的神经网络模型，其能够动态决定何时继续输入源文本或生成翻译文本，通过使用单向或双向编码器来处理实际语音和文本输入，使用词对齐方法生成分块训练数据，并在 IWSLT 2020 英 - 德任务上优于 wait-k 基线 2.6 到 3.7% BLEU 中的结果。