Recent text-to-image diffusion-based generative models have the stunning
ability to generate highly detailed and photo-realistic images and achieve
state-of-the-art low FID scores on challenging image generation benchmarks.
However, one of the primary failure modes of these text-to-image generative
models is in composing attributes, objects, and their associated relationships
accurately into an image. In our paper, we investigate this
compositionality-based failure mode and highlight that imperfect text
conditioning with CLIP text-encoder is one of the primary reasons behind the
inability of these models to generate high-fidelity compositional scenes. In
particular, we show that (i) there exists an optimal text-embedding space that
can generate highly coherent compositional scenes which shows that the output
space of the CLIP text-encoder is sub-optimal, and (ii) we observe that the
final token embeddings in CLIP are erroneous as they often include attention
contributions from unrelated tokens in compositional prompts. Our main finding
shows that the best compositional improvements can be achieved (without harming
the model's FID scores) by fine-tuning {\it only} a simple linear projection on
CLIP's representation space in Stable-Diffusion variants using a small set of
compositional image-text pairs. This result demonstrates that the
sub-optimality of the CLIP's output space is a major error source. We also show
that re-weighting the erroneous attention contributions in CLIP can also lead
to improved compositional performances, however these improvements are often
less significant than those achieved by solely learning a linear projection
head, highlighting erroneous attentions to be only a minor error source.

通过研究基于组合性失败模式，我们发现文本到图像生成模型中 CLIP 文本编码器的文本条件不完备是无法生成高保真组合场景的主要原因，并提出仅通过在 CLIP 表示空间上学习简单的线性投影可以实现最佳组合性改进，同时不降低模型的 FID 分数。

文本到图像生成模型中的构成问题的理解和减轻

Understanding and Mitigating Compositional Issues in Text-to-Image  Generative Models

A technical note aiming to offer deeper intuition for the LayerNorm function
common in deep neural networks. LayerNorm is defined relative to a
distinguished 'neural' basis, but it does more than just normalize the
corresponding vector elements. Rather, it implements a composition -- of linear
projection, nonlinear scaling, and then affine transformation -- on input
activation vectors. We develop both a new mathematical expression and geometric
intuition, to make the net effect more transparent. We emphasize that, when
LayerNorm acts on an N-dimensional vector space, all outcomes of LayerNorm lie
within the intersection of an (N-1)-dimensional hyperplane and the interior of
an N-dimensional hyperellipsoid. This intersection is the interior of an
(N-1)-dimensional hyperellipsoid, and typical inputs are mapped near its
surface. We find the direction and length of the principal axes of this
(N-1)-dimensional hyperellipsoid via the eigen-decomposition of a simply
constructed matrix.

一篇技术说明旨在提供对深度神经网络中常见的 LayerNorm 函数更深入的直观理解，通过开发新的数学表达和几何直觉，使其净效应更透明，强调当 LayerNorm 作用于 N 维向量空间时，所有 LayerNorm 的结果位于 (N-1) 维超平面与 N 维超椭球体内部的交集中，该交集是 (N-1) 维超椭球体的内部，而典型输入被映射到其表面附近。我们通过对一个简单构建的矩阵进行特征值分解来找到这个 (N-1) 维超椭球体的主轴方向和长度。

层归一化的几何和动力学

Geometry and Dynamics of LayerNorm

This is a further development of Vision Transformer Pruning via matrix
decomposition. The purpose of the Vision Transformer Pruning is to prune the
dimension of the linear projection of the dataset by learning their associated
importance score in order to reduce the storage, run-time memory, and
computational demands. In this paper we further reduce dimension and complexity
of the linear projection by implementing and comparing several matrix
decomposition methods while preserving the generated important features. We end
up selected the Singular Value Decomposition as the method to achieve our goal
by comparing the original accuracy scores in the original Github repository and
the accuracy scores of using those matrix decomposition methods, including
Singular Value Decomposition, four versions of QR Decomposition, and LU
factorization.

通过使用矩阵分解实现视觉转换器修剪，该论文进一步在保留重要特征的基础上，比较了多种矩阵分解方法，最终选择奇异值分解作为降维和计算复杂度减少的方法，通过与原准确率得分进行比较实现目标。