Learning from human feedback plays an important role in aligning generative
models, such as large language models (LLM). However, the effectiveness of this
approach can be influenced by adversaries, who may intentionally provide
misleading preferences to manipulate the output in an undesirable or harmful
direction. To tackle this challenge, we study a specific model within this
problem domain--contextual dueling bandits with adversarial feedback, where the
true preference label can be flipped by an adversary. We propose an algorithm
namely robust contextual dueling bandit (\algo), which is based on
uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an
$\tilde O(d\sqrt{T}+dC)$ regret bound, where $T$ is the number of rounds, $d$
is the dimension of the context, and $ 0 \le C \le T$ is the total number of
adversarial feedback. We also prove a lower bound to show that our regret bound
is nearly optimal, both in scenarios with and without ($C=0$) adversarial
feedback. Additionally, we conduct experiments to evaluate our proposed
algorithm against various types of adversarial feedback. Experimental results
demonstrate its superiority over the state-of-the-art dueling bandit algorithms
in the presence of adversarial feedback.

通过创新性对抗反馈的鲁棒情境对决算法，本研究在学习人类反馈中探索大型语言模型的对齐方法，并证明了在存在或不存在创新性对抗反馈的情况下，算法具有接近最优的后悔界限。同时，在各种类型的创新性对抗反馈中，实验结果表明该算法优于现有的对决算法。

从对抗性反馈中的上下文对决强盗问题的近乎最优算法

Nearly Optimal Algorithms for Contextual Dueling Bandits from  Adversarial Feedback

Modern text-to-speech synthesis pipelines typically involve multiple
processing stages, each of which is designed or learnt independently from the
rest. In this work, we take on the challenging task of learning to synthesise
speech from normalised text or phonemes in an end-to-end manner, resulting in
models which operate directly on character or phoneme input sequences and
produce raw speech audio outputs. Our proposed generator is feed-forward and
thus efficient for both training and inference, using a differentiable
alignment scheme based on token length prediction. It learns to produce high
fidelity audio through a combination of adversarial feedback and prediction
losses constraining the generated audio to roughly match the ground truth in
terms of its total duration and mel-spectrogram. To allow the model to capture
temporal variation in the generated audio, we employ soft dynamic time warping
in the spectrogram-based prediction loss. The resulting model achieves a mean
opinion score exceeding 4 on a 5 point scale, which is comparable to the
state-of-the-art models relying on multi-stage training and additional
supervision.

该研究提出了一种基于端到端的方式来从文本或音素中生成语音的方法，使用逐字符或逐音素音频输出序列，通过可微分的对齐策略来保证高保真度音频的生成，实现了在不需要多阶段训练和额外监督下，比之前的技术达到了相似的高质量合成音效。