The field of antibody-based therapeutics has grown significantly in recent
years, with targeted antibodies emerging as a potentially effective approach to
personalized therapies. Such therapies could be particularly beneficial for
complex, highly individual diseases such as cancer. However, progress in this
field is often constrained by the extensive search space of amino acid
sequences that form the foundation of antibody design. In this study, we
introduce a novel reinforcement learning method specifically tailored to
address the unique challenges of this domain. We demonstrate that our method
can learn the design of high-affinity antibodies against multiple targets in
silico, utilizing either online interaction or offline datasets. To the best of
our knowledge, our approach is the first of its kind and outperforms existing
methods on all tested antigens in the Absolut! database.

该研究引入了一种新颖的增强学习方法，专门针对抗体设计领域的独特挑战，展示出其能够学习在体内或离线数据集中设计高亲和力抗体的能力，为个体疾病如癌症等复杂疾病的靶向抗体疗法的发展提供了新的途径。

抗体 CDRH3 设计的稳定的在线和离线强化学习

Stable Online and Offline Reinforcement Learning for Antibody CDRH3  Design

We study the budget allocation problem in online marketing campaigns that
utilize previously collected offline data. We first discuss the long-term
effect of optimizing marketing budget allocation decisions in the offline
setting. To overcome the challenge, we propose a novel game-theoretic offline
value-based reinforcement learning method using mixed policies. The proposed
method reduces the need to store infinitely many policies in previous methods
to only constantly many policies, which achieves nearly optimal policy
efficiency, making it practical and favorable for industrial usage. We further
show that this method is guaranteed to converge to the optimal policy, which
cannot be achieved by previous value-based reinforcement learning methods for
marketing budget allocation. Our experiments on a large-scale marketing
campaign with tens-of-millions users and more than one billion budget verify
the theoretical results and show that the proposed method outperforms various
baseline methods. The proposed method has been successfully deployed to serve
all the traffic of this marketing campaign.

提出一种基于值函数的强化学习方法来解决在线营销活动中利用离线数据进行预算分配的问题，该方法通过使用混合策略减少存储策略的数量，并实现了接近最优策略的效率，经过大规模的营销活动实验证明该方法优于其他基准方法。

离线约束深度强化学习中的营销预算分配

Marketing Budget Allocation with Offline Constrained Deep Reinforcement  Learning

In high-dimensional time-series analysis, it is essential to have a set of
key factors (namely, the style factors) that explain the change of the observed
variable. For example, volatility modeling in finance relies on a set of risk
factors, and climate change studies in climatology rely on a set of causal
factors. The ideal low-dimensional style factors should balance significance
(with high explanatory power) and stability (consistent, no significant
fluctuations). However, previous supervised and unsupervised feature extraction
methods can hardly address the tradeoff. In this paper, we propose Style Miner,
a reinforcement learning method to generate style factors. We first formulate
the problem as a Constrained Markov Decision Process with explanatory power as
the return and stability as the constraint. Then, we design fine-grained
immediate rewards and costs and use a Lagrangian heuristic to balance them
adaptively. Experiments on real-world financial data sets show that Style Miner
outperforms existing learning-based methods by a large margin and achieves a
relatively 10% gain in R-squared explanatory power compared to the
industry-renowned factors proposed by human experts.

本文介绍了一种基于强化学习的方法，称为 Style Miner，用于生成低维度风格因子，能够在金融和气候变化等领域中提高 R-squared 指数达到 10% 左右的准确度。

Style Miner：利用约束强化学习在时间序列中查找显著且稳定的解释因素

Style Miner: Find Significant and Stable Explanatory Factors in Time Series with Constrained Reinforcement Learning

To build an open-domain multi-turn conversation system is one of the most
interesting and challenging tasks in Artificial Intelligence. Many research
efforts have been dedicated to building such dialogue systems, yet few shed
light on modeling the conversation flow in an ongoing dialogue. Besides, it is
common for people to talk about highly relevant aspects during a conversation.
And the topics are coherent and drift naturally, which demonstrates the
necessity of dialogue flow modeling. To this end, we present the multi-turn
cue-words driven conversation system with reinforcement learning method (RLCw),
which strives to select an adaptive cue word with the greatest future credit,
and therefore improve the quality of generated responses. We introduce a new
reward to measure the quality of cue words in terms of effectiveness and
relevance. To further optimize the model for long-term conversations, a
reinforcement approach is adopted in this paper. Experiments on real-life
dataset demonstrate that our model consistently outperforms a set of
competitive baselines in terms of simulated turns, diversity and human
evaluation.

通过引入多轮提示词和强化学习方法来建立开放式的多轮对话系统，优化对话流程建模，提高生成响应的质量，相关实验结果验证优于竞争基线模型。