Current methods for end-to-end constructive neural combinatorial optimization
usually train a policy using behavior cloning from expert solutions or policy
gradient methods from reinforcement learning. While behavior cloning is
straightforward, it requires expensive expert solutions, and policy gradient
methods are often computationally demanding and complex to fine-tune. In this
work, we bridge the two and simplify the training process by sampling multiple
solutions for random instances using the current model in each epoch and then
selecting the best solution as an expert trajectory for supervised imitation
learning. To achieve progressively improving solutions with minimal sampling,
we introduce a method that combines round-wise Stochastic Beam Search with an
update strategy derived from a provable policy improvement. This strategy
refines the policy between rounds by utilizing the advantage of the sampled
sequences with almost no computational overhead. We evaluate our approach on
the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. The
models trained with our method achieve comparable performance and
generalization to those trained with expert data. Additionally, we apply our
method to the Job Shop Scheduling Problem using a transformer-based
architecture and outperform existing state-of-the-art methods by a wide margin.

通过结合行为克隆和增强学习方法，本文简化了端到端的神经组合优化训练过程，采用随机抽样解决方案并利用概率策略改进来提高模型性能，在旅行推销员问题和车辆路径问题方面取得了令人满意的结果，并应用于作业车间调度问题，超越现有的方法。

神经组合优化的自我改进：无替换抽样，仅改善

Self-Improvement for Neural Combinatorial Optimization: Sample without  Replacement, but Improvement

We propose an effective prompting approach that integrates self-evaluation
guidance through stochastic beam search. Our approach explores the reasoning
search space using a well-calibrated automatic criterion. This enables an
efficient search to produce higher-quality final predictions. With the
self-evaluation guided stochastic beam search, we also balance the
quality--diversity trade-off in the generation of reasoning chains. This allows
our approach to adapt well with majority voting and surpass the corresponding
Codex-backboned baselines by $6.34\%$, $9.56\%$, and $5.46\%$ on the GSM8K,
AQUA, and StrategyQA benchmarks, respectively, in few-shot accuracy. Analysis
of our decompositional reasoning finds it pinpoints logic failures and leads to
higher consistency and robustness.

该研究提出了一种有效的提示方法，通过随机波束搜索融合自我评估指导，可以平衡生成链的质量 - 多样性权衡，并在少次学习的情况下，分别在 GSM8K、AQUA 和 StrategyQA 基准测试中比相应的 Codex-backboned 基线高出 6.34％、9.56％和 5.46％的准确度，同时通过细粒度推理又找到并解决了逻辑失误的问题，提高了一致性和鲁棒性。

通过自我评估引导解码的分解增强推理

Decomposition Enhances Reasoning via Self-Evaluation Guided Decoding

The well-known Gumbel-Max trick for sampling from a categorical distribution
can be extended to sample $k$ elements without replacement. We show how to
implicitly apply this 'Gumbel-Top-$k$' trick on a factorized distribution over
sequences, allowing to draw exact samples without replacement using a
Stochastic Beam Search. Even for exponentially large domains, the number of
model evaluations grows only linear in $k$ and the maximum sampled sequence
length. The algorithm creates a theoretical connection between sampling and
(deterministic) beam search and can be used as a principled intermediate
alternative. In a translation task, the proposed method compares favourably
against alternatives to obtain diverse yet good quality translations. We show
that sequences sampled without replacement can be used to construct
low-variance estimators for expected sentence-level BLEU score and model
entropy.

应用 Gumbel-Top-k 技巧和分解可重复采样，使用随机束搜索进行无重复抽样序列模型的研究，发现序列采样中存在随机束搜索和确定性束搜索之间的理论联系，这一方法在翻译任务中表现优异，且采样无重复序列可用于构造期望 BLEU 得分和模型熵的低方差估计。