Dialogue response generation requires an agent to generate a response
according to the current dialogue history, in terms of which two-party
dialogues have been well studied, but leaving a great gap for multi-party
dialogues at the same time. Different from two-party dialogues where each
response is a direct reply to its previous utterance, the addressee of a
response utterance should be specified before it is generated in the
multi-party scenario. Thanks to the huge amount of two-party conversational
data, various pre-trained language models for two-party dialogue response
generation have been proposed. However, due to the lack of annotated addressee
labels in multi-party dialogue datasets, it is hard to use them to pre-train a
response generation model for multi-party dialogues. To tackle this obstacle,
we propose an Expectation-Maximization (EM) approach that iteratively performs
the expectation steps to generate addressee labels, and the maximization steps
to optimize a response generation model. Theoretical analyses and extensive
experiments have justified the feasibility and effectiveness of our proposed
method.

此篇论文提出了一种基于 EM（期望最大化）算法的方法，用于优化多方对话应答生成模型，解决多方对话数据中缺乏标注指定对话对象的标签的问题。

多方对话应答生成的 EM 预训练

EM Pre-training for Multi-party Dialogue Response Generation

Safe reinforcement learning (RL) aims to learn policies that satisfy certain
constraints before deploying them to safety-critical applications. Previous
primal-dual style approaches suffer from instability issues and lack optimality
guarantees. This paper overcomes the issues from the perspective of
probabilistic inference. We introduce a novel Expectation-Maximization approach
to naturally incorporate constraints during the policy learning: 1) a provable
optimal non-parametric variational distribution could be computed in closed
form after a convex optimization (E-step); 2) the policy parameter is improved
within the trust region based on the optimal variational distribution (M-step).
The proposed algorithm decomposes the safe RL problem into a convex
optimization phase and a supervised learning phase, which yields a more stable
training performance. A wide range of experiments on continuous robotic tasks
shows that the proposed method achieves significantly better constraint
satisfaction performance and better sample efficiency than baselines. The code
is available at this https URL.

该研究通过引入新的期望最大化方法，并从概率推理的角度解决问题，将安全增强学习问题分解为凸优化和监督学习两个阶段，实现了更稳定和更高效的学习表现，并在连续机器人任务的广泛实验中取得了显著的约束满足性能和样本效率提升。