Natural language interaction is a promising direction for democratizing 3D
shape design. However, existing methods for text-driven 3D shape editing face
challenges in producing decoupled, local edits to 3D shapes. We address this
problem by learning disentangled latent representations that ground language in
3D geometry. To this end, we propose a complementary tool set including a novel
network architecture, a disentanglement loss, and a new editing procedure.
Additionally, to measure edit locality, we define a new metric that we call
part-wise edit precision. We show that our method outperforms existing SOTA
methods by 20% in terms of edit locality, and up to 6.6% in terms of language
reference resolution accuracy. Our work suggests that by solely disentangling
language representations, downstream 3D shape editing can become more local to
relevant parts, even if the model was never given explicit part-based
supervision.

借助学习解耦的潜在表示，包括新颖的网络架构、解耦损失和新的编辑过程，我们旨在解决通过文本进行三维形状编辑时面临的挑战，并提出了称为逐部分编辑精度的新度量方法来评估编辑的区域范围，并表明我们的方法在编辑局部精度方面优于现有方法约 20％，在语言参考分辨率准确性方面高达 6.6％。

LADIS: 用于三维形状编辑的语言分离技术

LADIS: Language Disentanglement for 3D Shape Editing

State-of-the-art Variational Auto-Encoders (VAEs) for learning disentangled
latent representations give impressive results in discovering features like
pitch, pause duration, and accent in speech data, leading to highly
controllable text-to-speech (TTS) synthesis. However, these LSTM-based VAEs
fail to learn latent clusters of speaker attributes when trained on either
limited or noisy datasets. Further, different latent variables start encoding
the same features, limiting the control and expressiveness during speech
synthesis. To resolve these issues, we propose RTI-VAE (Reordered Transformer
with Information reduction VAE) where we minimize the mutual information
between different latent variables and devise a modified Transformer
architecture with layer reordering to learn controllable latent representations
in speech data. We show that RTI-VAE reduces the cluster overlap of speaker
attributes by at least 30\% over LSTM-VAE and by at least 7\% over vanilla
Transformer-VAE.

提出了一种 RTI-VAE 方法，使用修改过的 Transformer 架构和信息减少技术来学习可控制的语音数据的潜在变量，从而降低说话者属性聚类的重叠率，相比于 LSTM-VAE 和 vanilla Transformer-VAE，降低了至少 30％和至少 7％的重叠率。

学习鲁棒的潜在特征表示用于可控语音合成

Learning Robust Latent Representations for Controllable Speech Synthesis

Latent traversal is a popular approach to visualize the disentangled latent
representations. Given a bunch of variations in a single unit of the latent
representation, it is expected that there is a change in a single factor of
variation of the data while others are fixed. However, this impressive
experimental observation is rarely explicitly encoded in the objective function
of learning disentangled representations. This paper defines the variation
predictability of latent disentangled representations. Given image pairs
generated by latent codes varying in a single dimension, this varied dimension
could be closely correlated with these image pairs if the representation is
well disentangled. Within an adversarial generation process, we encourage
variation predictability by maximizing the mutual information between latent
variations and corresponding image pairs. We further develop an evaluation
metric that does not rely on the ground-truth generative factors to measure the
disentanglement of latent representations. The proposed variation
predictability is a general constraint that is applicable to the VAE and GAN
frameworks for boosting disentanglement of latent representations. Experiments
show that the proposed variation predictability correlates well with existing
ground-truth-required metrics and the proposed algorithm is effective for
disentanglement learning.

本文提出了一种基于变量可预测性的方法来优化 VAE 和 GAN 框架中的潜变量表征的解缠方法，该方法通过最大化潜变量变化与相应图像变化之间的互信息来增强变量的可预测性，同时提出了一种新的评估指标来度量潜在维度的解缠性能。研究表明，所提出的变量可预测性方法与当前基于与真实情况的度量模型表现出很好的相关性，可以有效地增强潜变量表征的解缠能力。