A biologically plausible method for training an Artificial Neural Network
(ANN) involves treating each unit as a stochastic Reinforcement Learning (RL)
agent, thereby considering the network as a team of agents. Consequently, all
units can learn via REINFORCE, a local learning rule modulated by a global
reward signal, which aligns more closely with biologically observed forms of
synaptic plasticity. However, this learning method tends to be slow and does
not scale well with the size of the network. This inefficiency arises from two
factors impeding effective structural credit assignment: (i) all units
independently explore the network, and (ii) a single reward is used to evaluate
the actions of all units. Accordingly, methods aimed at improving structural
credit assignment can generally be classified into two categories. The first
category includes algorithms that enable coordinated exploration among units,
such as MAP propagation. The second category encompasses algorithms that
compute a more specific reward signal for each unit within the network, like
Weight Maximization and its variants. In this research report, our focus is on
the first category. We propose the use of Boltzmann machines or a recurrent
network for coordinated exploration. We show that the negative phase, which is
typically necessary to train Boltzmann machines, can be removed. The resulting
learning rules are similar to the reward-modulated Hebbian learning rule.
Experimental results demonstrate that coordinated exploration significantly
exceeds independent exploration in training speed for multiple stochastic and
discrete units based on REINFORCE, even surpassing straight-through estimator
(STE) backpropagation.

使用 Boltzmann 机器或经常性网络进行协调探索，从而加快多个基于 REINFORCE 的随机和离散单元的训练速度，甚至超过直接传递估计器 (STE) 反向传播算法。

结构化信用分配与协调探索

Structural Credit Assignment with Coordinated Exploration

An artificial neural network can be trained by uniformly broadcasting a
reward signal to units that implement a REINFORCE learning rule. Though this
presents a biologically plausible alternative to backpropagation in training a
network, the high variance associated with it renders it impractical to train
deep networks. The high variance arises from the inefficient structural credit
assignment since a single reward signal is used to evaluate the collective
action of all units. To facilitate structural credit assignment, we propose
replacing the reward signal to hidden units with the change in the $L^2$ norm
of the unit's outgoing weight. As such, each hidden unit in the network is
trying to maximize the norm of its outgoing weight instead of the global
reward, and thus we call this learning method Weight Maximization. We prove
that Weight Maximization is approximately following the gradient of rewards in
expectation. In contrast to backpropagation, Weight Maximization can be used to
train both continuous-valued and discrete-valued units. Moreover, Weight
Maximization solves several major issues of backpropagation relating to
biological plausibility. Our experiments show that a network trained with
Weight Maximization can learn significantly faster than REINFORCE and slightly
slower than backpropagation. Weight Maximization illustrates an example of
cooperative behavior automatically arising from a population of self-interested
agents in a competitive game without any central coordination.

通过使用 $L^2$ 范数 代替隐藏单元的奖励信号，Weight Maximization 能够解决 REINFORCE 学习规则中出现的高方差问题，使得深度神经网络的训练更加高效。此方法同时解决了反向传播算法中存在的可行性问题，并能够用于训练连续值和离散值单元的神经网络。