Despite intense efforts in basic and clinical research, an individualized
ventilation strategy for critically ill patients remains a major challenge.
Recently, dynamic treatment regime (DTR) with reinforcement learning (RL) on
electronic health records (EHR) has attracted interest from both the healthcare
industry and machine learning research community. However, most learned DTR
policies might be biased due to the existence of confounders. Although some
treatment actions non-survivors received may be helpful, if confounders cause
the mortality, the training of RL models guided by long-term outcomes (e.g.,
90-day mortality) would punish those treatment actions causing the learned DTR
policies to be suboptimal. In this study, we develop a new deconfounding
actor-critic network (DAC) to learn optimal DTR policies for patients. To
alleviate confounding issues, we incorporate a patient resampling module and a
confounding balance module into our actor-critic framework. To avoid punishing
the effective treatment actions non-survivors received, we design a short-term
reward to capture patients' immediate health state changes. Combining
short-term with long-term rewards could further improve the model performance.
Moreover, we introduce a policy adaptation method to successfully transfer the
learned model to new-source small-scale datasets. The experimental results on
one semi-synthetic and two different real-world datasets show the proposed
model outperforms the state-of-the-art models. The proposed model provides
individualized treatment decisions for mechanical ventilation that could
improve patient outcomes.

通过使用新的去混淆 actor-critic 网络模型，基于电子健康记录的强化学习机制的动态治疗模式可获得更好的患者个体化通气治疗决策，从而提高患者的预后。

使用策略适应去交织 Actor-Critic 网络以支持动态治疗方案

Deconfounding Actor-Critic Network with Policy Adaptation for Dynamic Treatment Regimes

Recent advances in reinforcement learning have inspired increasing interest
in learning user modeling adaptively through dynamic interactions, e.g., in
reinforcement learning based recommender systems. Reward function is crucial
for most of reinforcement learning applications as it can provide the guideline
about the optimization. However, current reinforcement-learning-based methods
rely on manually-defined reward functions, which cannot adapt to dynamic and
noisy environments. Besides, they generally use task-specific reward functions
that sacrifice generalization ability. We propose a generative inverse
reinforcement learning for user behavioral preference modelling, to address the
above issues. Instead of using predefined reward functions, our model can
automatically learn the rewards from user's actions based on discriminative
actor-critic network and Wasserstein GAN. Our model provides a general way of
characterizing and explaining underlying behavioral tendencies, and our
experiments show our method outperforms state-of-the-art methods in a variety
of scenarios, namely traffic signal control, online recommender systems, and
scanpath prediction.

提出了一种基于生成式逆强化学习的用户行为偏好建模方法，该方法可以自动学习用户的行为奖励函数，并通过辨别式演员 - 评论家网络和 Wasserstein 生成对抗网络进行建模和解释，实验证明该方法在交通信号控制、在线推荐系统和注视路径预测等场景下优于现有的方法。

生成对抗奖励学习用于泛化行为倾向推断

Generative Adversarial Reward Learning for Generalized Behavior Tendency  Inference

Code summarization and code search have been widely adopted in
sofwaredevelopmentandmaintenance. However, fewstudieshave explored the efcacy
of unifying them. In this paper, we propose TranS^3 , a transformer-based
framework to integrate code summarization with code search. Specifcally, for
code summarization,TranS^3 enables an actor-critic network, where in the actor
network, we encode the collected code snippets via transformer- and
tree-transformer-based encoder and decode the given code snippet to generate
its comment. Meanwhile, we iteratively tune the actor network via the feedback
from the critic network for enhancing the quality of the generated comments.
Furthermore, we import the generated comments to code search for enhancing its
accuracy. To evaluatetheefectivenessof TranS^3 , we conduct a set of
experimental studies and case studies where the experimental results suggest
that TranS^3 can signifcantly outperform multiple state-of-the-art approaches
in both code summarization and code search and the study results further
strengthen the efcacy of TranS^3 from the developers' points of view.

本文提出了一种基于 Transformer 和 Actor-Critic 网络的框架 TranS^3，以整合代码概括和代码搜索，并证明其在这两个领域中均优于现有方法。

TranS^3: 一个基于 Transformer 的框架，用于统一代码摘要和代码搜索

TranS^3: A Transformer-based Framework for Unifying Code Summarization  and Code Search

We study the reinforcement learning problem of complex action control in the
Multi-player Online Battle Arena (MOBA) 1v1 games. This problem involves far
more complicated state and action spaces than those of traditional 1v1 games,
such as Go and Atari series, which makes it very difficult to search any
policies with human-level performance. In this paper, we present a deep
reinforcement learning framework to tackle this problem from the perspectives
of both system and algorithm. Our system is of low coupling and high
scalability, which enables efficient explorations at large scale. Our algorithm
includes several novel strategies, including control dependency decoupling,
action mask, target attention, and dual-clip PPO, with which our proposed
actor-critic network can be effectively trained in our system. Tested on the
MOBA game Honor of Kings, our AI agent, called Tencent Solo, can defeat top
professional human players in full 1v1 games.

本文提出了一种深度强化学习框架，从系统和算法两个角度来解决 Multi-player Online Battle Arena（MOBA）1v1 游戏中复杂动作控制的问题，通过包括控制依赖解耦、动作遮罩、目标注意力和双剪辑 PPO 等多种新颖策略，训练出可以在 MOBA 游戏王者荣耀中打败顶级人类选手的 AI 代理 Tencent Solo。