Speculative decoding is an inference-acceleration method for large language
models (LLMs) where a small language model generates a draft-token sequence
which is further verified by the target LLM in parallel. Recent works have
advanced this method by establishing a draft-token tree, achieving superior
performance over a single-sequence speculative decoding. However, those works
independently generate tokens at each level of the tree, not leveraging the
tree's entire diversifiability. Besides, their empirical superiority has been
shown for fixed length of sequences, implicitly granting more computational
resource to LLM for the tree-based methods. None of the existing works has
conducted empirical studies with fixed target computational budgets despite its
importance to resource-bounded devices. We present Recursive Speculative
Decoding (RSD), a novel tree-based method that samples draft tokens without
replacement and maximizes the diversity of the tree. During RSD's drafting, the
tree is built by either Gumbel-Top-$k$ trick that draws tokens without
replacement in parallel or Stochastic Beam Search that samples sequences
without replacement while early-truncating unlikely draft sequences and
reducing the computational cost of LLM. We empirically evaluate RSD with Llama
2 and OPT models, showing that RSD outperforms the baseline methods,
consistently for fixed draft sequence length and in most cases for fixed
computational budgets at LLM.

递归推测解码是一种基于树的方法，利用抽样生成多样性的草稿令牌序列来加速大型语言模型，从而在固定的草稿序列长度和计算预算下取得了优越性能。

递归推测解码：通过无替换抽样加速 LLM 推理

Recursive Speculative Decoding: Accelerating LLM Inference via Sampling  Without Replacement

The last six years have witnessed significant progress in adversarially
robust deep learning. As evidenced by the CIFAR-10 dataset category in
RobustBench benchmark, the accuracy under $\ell_\infty$ adversarial
perturbations improved from 44\% in \citet{Madry2018Towards} to 71\% in
\citet{peng2023robust}. Although impressive, existing state-of-the-art is still
far from satisfactory. It is further observed that best-performing models are
often very large models adversarially trained by industrial labs with
significant computational budgets. In this paper, we aim to understand: ``how
much longer can computing power drive adversarial robustness advances?" To
answer this question, we derive \emph{scaling laws for adversarial robustness}
which can be extrapolated in the future to provide an estimate of how much cost
we would need to pay to reach a desired level of robustness. We show that
increasing the FLOPs needed for adversarial training does not bring as much
advantage as it does for standard training in terms of performance
improvements. Moreover, we find that some of the top-performing techniques are
difficult to exactly reproduce, suggesting that they are not robust enough for
minor changes in the training setup. Our analysis also uncovers potentially
worthwhile directions to pursue in future research. Finally, we make our
benchmarking framework (built on top of \texttt{timm}~\citep{rw2019timm})
publicly available to facilitate future analysis in efficient robust deep
learning.

通过推导「对抗鲁棒性的扩展规律」，本文旨在回答计算能力能在多大程度上推动对抗鲁棒性的进展，并发现了一些值得未来研究探索的方向，同时还提供了一个基于「timm」的基准测试框架供进一步的高效鲁棒深度学习分析。

扩展计算能力不足以确保对抗性鲁棒性

Scaling Compute Is Not All You Need for Adversarial Robustness

The advent of transformers, higher computational budgets, and big data has
engendered remarkable progress in Natural Language Processing (NLP). Impressive
performance of industry pre-trained models has garnered public attention in
recent years and made news headlines. That these are industry models is
noteworthy. Rarely, if ever, are academic institutes producing exciting new NLP
models. Using these models is critical for competing on NLP benchmarks and
correspondingly to stay relevant in NLP research. We surveyed 100 papers
published at EMNLP 2022 to determine whether this phenomenon constitutes a
reliance on industry for NLP publications.
We find that there is indeed a substantial reliance. Citations of industry
artifacts and contributions across categories is at least three times greater
than industry publication rates per year. Quantifying this reliance does not
settle how we ought to interpret the results. We discuss two possible
perspectives in our discussion: 1) Is collaboration with industry still
collaboration in the absence of an alternative? Or 2) has free NLP inquiry been
captured by the motivations and research direction of private corporations?

自然语言处理中智能模型的进步与行业模型的重要性及其对学术论文的影响进行调查，发现与行业的合作在 NLP 出版物中有显著依赖性，并探讨了两种可能的解释。

合作还是企业控制？量化 NLP 对工业产物和贡献的依赖程度

Collaboration or Corporate Capture? Quantifying NLP's Reliance on  Industry Artifacts and Contributions

We present Fast-Downsampling MobileNet (FD-MobileNet), an efficient and
accurate network for very limited computational budgets (e.g., 10-140 MFLOPs).
Our key idea is applying an aggressive downsampling strategy to MobileNet
framework. In FD-MobileNet, we perform 32$\times$ downsampling within 12
layers, only half the layers in the original MobileNet. This design brings
three advantages: (i) It remarkably reduces the computational cost. (ii) It
increases the information capacity and achieves significant performance
improvements. (iii) It is engineering-friendly and provides fast actual
inference speed. Experiments on ILSVRC 2012 and PASCAL VOC 2007 datasets
demonstrate that FD-MobileNet consistently outperforms MobileNet and achieves
comparable results with ShuffleNet under different computational budgets, for
instance, surpassing MobileNet by 5.5% on the ILSVRC 2012 top-1 accuracy and
3.6% on the VOC 2007 mAP under a complexity of 12 MFLOPs. On an ARM-based
device, FD-MobileNet achieves 1.11$\times$ inference speedup over MobileNet and
1.82$\times$ over ShuffleNet under the same complexity.

本文介绍了一种适用于计算资源有限的场景（例如 10 到 140 MFLOPs）的网络 FD-MobileNet，其核心设计是在 MobileNet 框架中应用了极度的下采样策略，减少了计算成本、提高了信息容量和推理速度，并通过对 ILSVRC 2012 和 PASCAL VOC 2007 两个数据集的实验验证了 FD-MobileNet 相较于 MobileNet 在同样复杂度下表现更优越的性能。