Recently, learning open-vocabulary semantic segmentation from text
supervision has achieved promising downstream performance. Nevertheless,
current approaches encounter an alignment granularity gap owing to the absence
of dense annotations, wherein they learn coarse image/region-text alignment
during training yet perform group/pixel-level predictions at inference. Such
discrepancy leads to suboptimal learning efficiency and inferior zero-shot
segmentation results. In this paper, we introduce a Multi-Grained Cross-modal
Alignment (MGCA) framework, which explicitly learns pixel-level alignment along
with object- and region-level alignment to bridge the granularity gap without
any dense annotations. Specifically, MGCA ingeniously constructs pseudo
multi-granular semantic correspondences upon image-text pairs and collaborates
with hard sampling strategies to facilitate fine-grained cross-modal
contrastive learning. Further, we point out the defects of existing group and
pixel prediction units in downstream segmentation and develop an adaptive
semantic unit which effectively mitigates their dilemmas including under- and
over-segmentation. Training solely on CC3M, our method achieves significant
advancements over state-of-the-art methods, demonstrating its effectiveness and
efficiency.

提出了一种多粒度跨模态对齐 (MGCA) 框架，通过在像素级、对象级和区域级学习对齐来解决现有方法在像素级预测和训练时的粒度差异问题，并采用硬采样策略促进精细的跨模态对比学习，进一步开发自适应语义单元来改善像素预测单元在下游分割中的缺陷。在 CC3M 数据集上进行训练后，该方法在性能上显著超过了现有的方法，验证了其有效性和高效性。

基于多粒度跨模态对齐的开放词汇语义分割学习

Multi-Grained Cross-modal Alignment for Learning Open-vocabulary  Semantic Segmentation from Text Supervision

Prior work in visual dialog has focused on training deep neural models on
VisDial in isolation. Instead, we present an approach to leverage pretraining
on related vision-language datasets before transferring to visual dialog. We
adapt the recently proposed ViLBERT (Lu et al., 2019) model for multi-turn
visually-grounded conversations. Our model is pretrained on the Conceptual
Captions and Visual Question Answering datasets, and finetuned on VisDial. Our
best single model outperforms prior published work (including model ensembles)
by more than 1% absolute on NDCG and MRR. Next, we find that additional
finetuning using "dense" annotations in VisDial leads to even higher NDCG --
more than 10% over our base model -- but hurts MRR -- more than 17% below our
base model! This highlights a trade-off between the two primary metrics -- NDCG
and MRR -- which we find is due to dense annotations not correlating well with
the original ground-truth answers to questions.

本文提出了一种基于 ViLBERT 的方法，该方法采用与 Visual Dialog 相关的视觉语言数据集的预训练，随后转移到 Visual Dialog 的训练上。文中还发现，在 Visual Dialog 中使用密集注释进行微调，可以提高 NDCG，但会降低 MRR。

视觉对话的大规模预训练：一个简单的最先进基准线

Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art  Baseline

Modern approaches for multi-person pose estimation in video require large
amounts of dense annotations. However, labeling every frame in a video is
costly and labor intensive. To reduce the need for dense annotations, we
propose a PoseWarper network that leverages training videos with sparse
annotations (every k frames) to learn to perform dense temporal pose
propagation and estimation. Given a pair of video frames---a labeled Frame A
and an unlabeled Frame B---we train our model to predict human pose in Frame A
using the features from Frame B by means of deformable convolutions to
implicitly learn the pose warping between A and B. We demonstrate that we can
leverage our trained PoseWarper for several applications. First, at inference
time we can reverse the application direction of our network in order to
propagate pose information from manually annotated frames to unlabeled frames.
This makes it possible to generate pose annotations for the entire video given
only a few manually-labeled frames. Compared to modern label propagation
methods based on optical flow, our warping mechanism is much more compact (6M
vs 39M parameters), and also more accurate (88.7% mAP vs 83.8% mAP). We also
show that we can improve the accuracy of a pose estimator by training it on an
augmented dataset obtained by adding our propagated poses to the original
manual labels. Lastly, we can use our PoseWarper to aggregate temporal pose
information from neighboring frames during inference. This allows our system to
achieve state-of-the-art pose detection results on the PoseTrack2017 and
PoseTrack2018 datasets. Code has been made available at:
this https URL

通过对稀疏标注的训练视频进行 PoseWarper 网络训练，提出了一种减少需要稠密注释的方法，并借助变形卷积来隐式学习姿态变形，从而在全视频范围内生成姿态注释，并可能改善姿态估计的准确性。