Recently, Multi-modal Named Entity Recognition (MNER) has attracted a lot of
attention. Most of the work utilizes image information through region-level
visual representations obtained from a pretrained object detector and relies on
an attention mechanism to model the interactions between image and text
representations. However, it is difficult to model such interactions as image
and text representations are trained separately on the data of their respective
modality and are not aligned in the same space. As text representations take
the most important role in MNER, in this paper, we propose {\bf I}mage-{\bf
t}ext {\bf A}lignments (ITA) to align image features into the textual space, so
that the attention mechanism in transformer-based pretrained textual embeddings
can be better utilized. ITA first aligns the image into regional object tags,
image-level captions and optical characters as visual contexts, concatenates
them with the input texts as a new cross-modal input, and then feeds it into a
pretrained textual embedding model. This makes it easier for the attention
module of a pretrained textual embedding model to model the interaction between
the two modalities since they are both represented in the textual space. ITA
further aligns the output distributions predicted from the cross-modal input
and textual input views so that the MNER model can be more practical in dealing
with text-only inputs and robust to noises from images. In our experiments, we
show that ITA models can achieve state-of-the-art accuracy on multi-modal Named
Entity Recognition datasets, even without image information.

本文提出了一种基于图像与文本对齐的多模态命名实体识别技术，通过将图像特征与文本信息在文本空间中对齐，将两者之间的交互作用结合在一起，以此提高命名实体识别的准确性。

多模态命名实体识别的图像文本对齐

ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition

As robots begin to cohabit with humans in semi-structured environments, the
need arises to understand instructions involving rich variability---for
instance, learning to ground symbols in the physical world. Realistically, this
task must cope with small datasets consisting of a particular users' contextual
assignment of meaning to terms. We present a method for processing a raw stream
of cross-modal input---i.e., linguistic instructions, visual perception of a
scene and a concurrent trace of 3D eye tracking fixations---to produce the
segmentation of objects with a correspondent association to high-level
concepts. To test our framework we present experiments in a table-top object
manipulation scenario. Our results show our model learns the user's notion of
colour and shape from a small number of physical demonstrations, generalising
to identifying physical referents for novel combinations of the words.

本文提出了一种方法来处理跨模态输入的原始流，以产生物体的细分并与高级概念相关联，以学习用户的颜色和形状的概念，并表明该模型可以从少量的物理演示中推广到识别新单词的物理指示。