We propose a new transformer-based image and video tokenizer with Binary
Spherical Quantization (BSQ). BSQ projects the high-dimensional visual
embedding to a lower-dimensional hypersphere and then applies binary
quantization. BSQ is (1) parameter-efficient without an explicit codebook, (2)
scalable to arbitrary token dimensions, and (3) compact: compressing visual
data by up to 100$\times$ with minimal distortion. Our tokenizer uses a
transformer encoder and decoder with simple block-wise causal masking to
support variable-length videos as input. The resulting BSQ-ViT achieves
state-of-the-art visual reconstruction quality on image and video
reconstruction benchmarks with 2.4$\times$ throughput compared to the best
prior methods. Furthermore, by learning an autoregressive prior for adaptive
arithmetic coding, BSQ-ViT achieves comparable results on video compression
with state-of-the-art video compression standards. BSQ-ViT also enables masked
language models to achieve competitive image synthesis quality to GAN- and
diffusion-based methods.

提出了一种新的基于转换器的图像和视频分词器，使用二值球面量化实现。BSQ 将高维视觉嵌入投影到低维超球面上，然后应用二值量化。我们的分词器使用变长视频输入的转换器编码器和解码器，通过简单的分块因果掩蔽实现。基于此的 BSQ-ViT 在图像和视频重建基准上达到了最先进的视觉重建质量，并实现了最佳先前方法的 2.4 倍吞吐量。此外，通过学习自回归先验进行自适应算术编码，BSQ-ViT 在视频压缩方面达到了与最先进的视频压缩标准相当的结果。BSQ-ViT 还使得遮蔽语言模型能够实现与基于 GAN 和扩散的方法相媲美的图像合成质量。

使用二进制球面量化对图像和视频进行令牌化

Image and Video Tokenization with Binary Spherical Quantization

The co-localization problem is a model that simultaneously localizes objects
of the same class within a series of images or videos. In
\cite{joulin2014efficient}, authors present new variants of the Frank-Wolfe
algorithm (aka conditional gradient) that increase the efficiency in solving
the image and video co-localization problems. The authors show the efficiency
of their methods with the rate of decrease in a value called the Wolfe gap in
each iteration of the algorithm. In this project, inspired by the conditional
gradient sliding algorithm (CGS) \cite{CGS:Lan}, We propose algorithms for
solving such problems and demonstrate the efficiency of the proposed algorithms
through numerical experiments. The efficiency of these methods with respect to
the Wolfe gap is compared with implementing them on the YouTube-Objects dataset
for videos.

该研究论文提出了提高图像和视频共定位问题求解效率的新型 Frank-Wolfe 算法，并通过数值实验验证了所提算法的高效性，其中通过在 YouTube-Objects 数据集上的实现将所提方法与 Wolfe 差值进行了比较。

视频共定位问题的新型 Frank-Wolfe 算法变体

New Variants of Frank-Wolfe Algorithm for Video Co-localization Problem

Humans view the world through many sensory channels, e.g., the
long-wavelength light channel, viewed by the left eye, or the high-frequency
vibrations channel, heard by the right ear. Each view is noisy and incomplete,
but important factors, such as physics, geometry, and semantics, tend to be
shared between all views (e.g., a "dog" can be seen, heard, and felt). We
investigate the classic hypothesis that a powerful representation is one that
models view-invariant factors. We study this hypothesis under the framework of
multiview contrastive learning, where we learn a representation that aims to
maximize mutual information between different views of the same scene but is
otherwise compact. Our approach scales to any number of views, and is
view-agnostic. We analyze key properties of the approach that make it work,
finding that the contrastive loss outperforms a popular alternative based on
cross-view prediction, and that the more views we learn from, the better the
resulting representation captures underlying scene semantics. Our approach
achieves state-of-the-art results on image and video unsupervised learning
benchmarks. Code is released at: this http URL

本文研究了一个强大的表示形式，该表示形式对场景的多视图和不完整信息进行建模，通过多视图对比学习来提取多个视角得到的公共信息，该方法优于基于视图交叉预测的选择，经过检验实现了最先进的图像和视频无监督学习基准结果。