Paraphrasing is expressing the meaning of an input sentence in different
wording while maintaining fluency (i.e., grammatical and syntactical
correctness). Most existing work on paraphrasing use supervised models that are
limited to specific domains (e.g., image captions). Such models can neither be
straightforwardly transferred to other domains nor generalize well, and
creating labeled training data for new domains is expensive and laborious. The
need for paraphrasing across different domains and the scarcity of labeled
training data in many such domains call for exploring unsupervised paraphrase
generation methods. We propose Progressive Unsupervised Paraphrasing (PUP): a
novel unsupervised paraphrase generation method based on deep reinforcement
learning (DRL). PUP uses a variational autoencoder (trained using a
non-parallel corpus) to generate a seed paraphrase that warm-starts the DRL
model. Then, PUP progressively tunes the seed paraphrase guided by our novel
reward function which combines semantic adequacy, language fluency, and
expression diversity measures to quantify the quality of the generated
paraphrases in each iteration without needing parallel sentences. Our extensive
experimental evaluation shows that PUP outperforms unsupervised
state-of-the-art paraphrasing techniques in terms of both automatic metrics and
user studies on four real datasets. We also show that PUP outperforms
domain-adapted supervised algorithms on several datasets. Our evaluation also
shows that PUP achieves a great trade-off between semantic similarity and
diversity of expression.

本研究提出了一种基于深度强化学习的渐进式无监督改写方法，使用变分自动编码器生成种子改写，然后使用一种新的奖励函数来指导渐进调整种子改写，从而实现在不同域中高质量改写。在 4 个数据集上的结果表明，该方法在自动度量和用户研究方面优于监督学习和无监督学习的当前先进技术。

无监督深度强化学习的释义重构

Unsupervised Paraphrasing via Deep Reinforcement Learning

As humans, we often rely on language to learn language. For example, when
corrected in a conversation, we may learn from that correction, over time
improving our language fluency. Inspired by this observation, we propose a
learning algorithm for training semantic parsers from supervision (feedback)
expressed in natural language. Our algorithm learns a semantic parser from
users' corrections such as "no, what I really meant was before his job, not
after", by also simultaneously learning to parse this natural language feedback
in order to leverage it as a form of supervision. Unlike supervision with
gold-standard logical forms, our method does not require the user to be
familiar with the underlying logical formalism, and unlike supervision from
denotation, it does not require the user to know the correct answer to their
query. This makes our learning algorithm naturally scalable in settings where
existing conversational logs are available and can be leveraged as training
data. We construct a novel dataset of natural language feedback in a
conversational setting, and show that our method is effective at learning a
semantic parser from such natural language supervision.

发展了一种从自然语言反馈中训练语义解析器的学习算法，为了使其直观可扩展性，该算法使用了用户纠正、会话记录等已有的自然语言数据作为监督信号，相对于使用严谨的逻辑形式或者特定答案的监督方法，可以接受那些并不熟悉语言形式的用户。研究还构建了一个自然语言反馈的对话数据集，并证明该方法对于从这些自然语言监督信号中学习语义解析器是有效的。