Understanding images without explicit supervision has become an important
problem in computer vision. In this paper, we address image captioning by
generating language descriptions of scenes without learning from annotated
pairs of images and their captions. The core component of our approach is a
shared latent space that is structured by visual concepts. In this space, the
two modalities should be indistinguishable. A language model is first trained
to encode sentences into semantically structured embeddings. Image features
that are translated into this embedding space can be decoded into descriptions
through the same language model, similarly to sentence embeddings. This
translation is learned from weakly paired images and text using a loss robust
to noisy assignments and a conditional adversarial component. Our approach
allows to exploit large text corpora outside the annotated distributions of
image/caption data. Our experiments show that the proposed domain alignment
learns a semantically meaningful representation which outperforms previous
work.

通过共享的、结构化的视觉概念潜在空间，将图像特征转化到语义向量嵌入空间中，并使用同一语言模型将其解码为场景描述，无需明确监督来了解图像；这种转化借助于暴露于图像 / 标题数据分布之外的大型文本语料库，并且具有鲁棒性。

共享多模态嵌入的无监督图像字幕生成

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Real-world robotics problems often occur in domains that differ significantly
from the robot's prior training environment. For many robotic control tasks,
real world experience is expensive to obtain, but data is easy to collect in
either an instrumented environment or in simulation. We propose a novel domain
adaptation approach for robot perception that adapts visual representations
learned on a large easy-to-obtain source dataset (e.g. synthetic images) to a
target real-world domain, without requiring expensive manual data annotation of
real world data before policy search. Supervised domain adaptation methods
minimize cross-domain differences using pairs of aligned images that contain
the same object or scene in both the source and target domains, thus learning a
domain-invariant representation. However, they require manual alignment of such
image pairs. Fully unsupervised adaptation methods rely on minimizing the
discrepancy between the feature distributions across domains. We propose a
novel, more powerful combination of both distribution and pairwise image
alignment, and remove the requirement for expensive annotation by using weakly
aligned pairs of images in the source and target domains. Focusing on adapting
from simulation to real world data using a PR2 robot, we evaluate our approach
on a manipulation task and show that by using weakly paired images, our method
compensates for domain shift more effectively than previous techniques,
enabling better robot performance in the real world.

提出了一种新颖的领域适应方法，将在大型易于获得的源数据集 (例如，合成图像) 上学习的视觉表示适应到目标实际世界领域，不需要昂贵的手工数据注释。作者使用弱对齐图像，结合分布对齐的方式来解决实际和模拟环境差异的问题，并在机器人操作任务上对其进行了评估。