Automated image captioning has the potential to be a useful tool for people
with vision impairments. Images taken by this user group are often noisy, which
leads to incorrect and even unsafe model predictions. In this paper, we propose
a quality-agnostic framework to improve the performance and robustness of image
captioning models for visually impaired people. We address this problem from
three angles: data, model, and evaluation. First, we show how data augmentation
techniques for generating synthetic noise can address data sparsity in this
domain. Second, we enhance the robustness of the model by expanding a
state-of-the-art model to a dual network architecture, using the augmented data
and leveraging different consistency losses. Our results demonstrate increased
performance, e.g. an absolute improvement of 2.15 on CIDEr, compared to
state-of-the-art image captioning networks, as well as increased robustness to
noise with up to 3 points improvement on CIDEr in more noisy settings. Finally,
we evaluate the prediction reliability using confidence calibration on images
with different difficulty/noise levels, showing that our models perform more
reliably in safety-critical situations. The improved model is part of an
assisted living application, which we develop in partnership with the Royal
National Institute of Blind People.

本文提出了一种质量不受限制的框架，通过数据增强、双网络架构和置信度校准，来提高视障人士图像字幕模型的性能和稳健性，并在与 Royal National Institute of Blind People 合作的辅助生活应用中实现了改进的模型。

适用于视力受损人士的质量不敏感图像字幕技术

Quality-agnostic Image Captioning to Safely Assist People with Vision  Impairment

The use of attention models for automated image captioning has enabled many
systems to produce accurate and meaningful descriptions for images. Over the
years, many novel approaches have been proposed to enhance the attention
process using different feature representations. In this paper, we extend this
approach by creating a guided attention network mechanism, that exploits the
relationship between the visual scene and text-descriptions using spatial
features from the image, high-level information from the topics, and temporal
context from caption generation, which are embedded together in an ordered
embedding space. A pairwise ranking objective is used for training this
embedding space which allows similar images, topics and captions in the shared
semantic space to maintain a partial order in the visual-semantic hierarchy and
hence, helps the model to produce more visually accurate captions. The
experimental results based on MSCOCO dataset shows the competitiveness of our
approach, with many state-of-the-art models on various evaluation metrics.

本文提出了一种引导式的注意力网络机制，将图像的空间特征、主题的高级信息以及生成的字幕的时间上下文嵌入到一个有序的嵌入空间中进行训练，采用成对的排序目标函数，该模型在 MSCOCO 数据集上表现出与众多最先进模型相媲美的竞争力。