Fast and efficient transport protocols are the foundation of an increasingly
distributed world. The burden of continuously delivering improved communication
performance to support next-generation applications and services, combined with
the increasing heterogeneity of systems and network technologies, has promoted
the design of Congestion Control (CC) algorithms that perform well under
specific environments. The challenge of designing a generic CC algorithm that
can adapt to a broad range of scenarios is still an open research question. To
tackle this challenge, we propose to apply a novel Reinforcement Learning (RL)
approach. Our solution, MARLIN, uses the Soft Actor-Critic algorithm to
maximize both entropy and return and models the learning process as an
infinite-horizon task. We trained MARLIN on a real network with varying
background traffic patterns to overcome the sim-to-real mismatch that
researchers have encountered when applying RL to CC. We evaluated our solution
on the task of file transfer and compared it to TCP Cubic. While further
research is required, results have shown that MARLIN can achieve comparable
results to TCP with little hyperparameter tuning, in a task significantly
different from its training setting. Therefore, we believe that our work
represents a promising first step toward building CC algorithms based on the
maximum entropy RL framework.

研究提出了一种基于最大熵强化学习算法的拥塞控制解决方案 (MARLIN)，该方法使用软 Actor-Critic 算法并将学习过程建模为一个无限时间任务，经过实验测试，MARLIN 可以在文件传输任务中取得与 TCP Cubic 可比较的结果。

基于 Soft Actor-Critic 的强化学习在真实网络的拥塞控制中的应用

MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion Control in Real Networks

Model-free deep reinforcement learning (RL) algorithms have been successfully
applied to a range of challenging sequential decision making and control tasks.
However, these methods typically suffer from two major challenges: high sample
complexity and brittleness to hyperparameters. Both of these challenges limit
the applicability of such methods to real-world domains. In this paper, we
describe Soft Actor-Critic (SAC), our recently introduced off-policy
actor-critic algorithm based on the maximum entropy RL framework. In this
framework, the actor aims to simultaneously maximize expected return and
entropy. That is, to succeed at the task while acting as randomly as possible.
We extend SAC to incorporate a number of modifications that accelerate training
and improve stability with respect to the hyperparameters, including a
constrained formulation that automatically tunes the temperature
hyperparameter. We systematically evaluate SAC on a range of benchmark tasks,
as well as real-world challenging tasks such as locomotion for a quadrupedal
robot and robotic manipulation with a dexterous hand. With these improvements,
SAC achieves state-of-the-art performance, outperforming prior on-policy and
off-policy methods in sample-efficiency and asymptotic performance.
Furthermore, we demonstrate that, in contrast to other off-policy algorithms,
our approach is very stable, achieving similar performance across different
random seeds. These results suggest that SAC is a promising candidate for
learning in real-world robotics tasks.

本文介绍了一种基于最大熵强化学习框架的离线演员 - 评论家算法 Soft Actor-Critic，其中演员旨在同时最大化期望回报和熵，以在任务中成功执行尽可能随机的动作。作者通过对其进行一系列改进，如约束模型等，提高了模型的稳定性和训练速度，并在基准任务以及四足机器人的运动和灵巧手的机器人操作等现实世界挑战任务中取得了最先进的性能，在样本效率和渐近性能方面优于以往的在线和离线算法。