Large language and vision models have transformed how social movements
scholars identify protest and extract key protest attributes from multi-modal
data such as texts, images, and videos. This article documents how we
fine-tuned two large pretrained transformer models, including longformer and
swin-transformer v2, to infer potential protests in news articles using textual
and imagery data. First, the longformer model was fine-tuned using the Dynamic
of Collective Action (DoCA) Corpus. We matched the New York Times articles with
the DoCA database to obtain a training dataset for downstream tasks. Second,
the swin-transformer v2 models was trained on UCLA-protest imagery data.
UCLA-protest project contains labeled imagery data with information such as
protest, violence, and sign. Both fine-tuned models will be available via
https://github.com/Joshzyj/llvms4protest. We release this short technical
report for social movement scholars who are interested in using LLVMs to infer
protests in textual and imagery data.

大型语言和视觉模型已经改变了社会运动学者如何识别抗议活动并从多模态数据中提取关键的抗议属性。本文描述了我们如何通过对大规模预训练的转换器模型（包括 longformer 和 swin-transformer v2）进行微调，使用文本和图像数据来推断新闻文章中的潜在抗议活动。我们为下游任务使用 Dynamic of Collective Action (DoCA) Corpus 训练了 longformer 模型，并将纽约时报文章与 DoCA 数据库匹配，以获取训练数据集。我们还使用了 UCLA-protest 图像数据对 swin-transformer v2 模型进行了训练。我们通过 https://github.com/Joshzyj/llvms4protest 发布了这篇简短的技术报告，供对使用 LLVMs 来推断文本和图像数据中的抗议活动感兴趣的社会运动学者使用。

LLVMs4Protest: 利用大型语言和视觉模型解读新闻中的抗议事件

LLVMs4Protest: Harnessing the Power of Large Language and Vision Models  for Deciphering Protests in the News

Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of
adapting an action recognition model from a labelled source domain to an
unlabelled target domain that contains "target-private" categories, which are
present in the target but absent in the source. In this work we deviate from
the prior work of training a specialized open-set classifier or weighted
adversarial learning by proposing to use pre-trained Language and Vision Models
(CLIP). The CLIP is well suited for OUVDA due to its rich representation and
the zero-shot recognition capabilities. However, rejecting target-private
instances with the CLIP's zero-shot protocol requires oracle knowledge about
the target-private label names. To circumvent the impossibility of the
knowledge of label names, we propose AutoLabel that automatically discovers and
generates object-centric compositional candidate target-private class names.
Despite its simplicity, we show that CLIP when equipped with AutoLabel can
satisfactorily reject the target-private instances, thereby facilitating better
alignment between the shared classes of the two domains. The code is available.

本研究提出了一种基于预训练语言和视觉模型的 open-set 无监督视频域自适应方法，并引入了 AutoLabel 来发现和生成目标专有类别的类名，通过改进的 CLIP 模型可以有效地对目标专有的类别进行识别，并提高两个域之间分享类别的对齐。

基于 CLIP 的开放集视频领域自适应框架 AutoLabel

AutoLabel: CLIP-based framework for Open-set Video Domain Adaptation

In this paper, we aim to understand whether current language and vision
(LaVi) models truly grasp the interaction between the two modalities. To this
end, we propose an extension of the MSCOCO dataset, FOIL-COCO, which associates
images with both correct and "foil" captions, that is, descriptions of the
image that are highly similar to the original ones, but contain one single
mistake ("foil word"). We show that current LaVi models fall into the traps of
this data and perform badly on three tasks: a) caption classification (correct
vs. foil); b) foil word detection; c) foil word correction. Humans, in
contrast, have near-perfect performance on those tasks. We demonstrate that
merely utilising language cues is not enough to model FOIL-COCO and that it
challenges the state-of-the-art by requiring a fine-grained understanding of
the relation between text and image.

本文通过提出 FOIL-COCO 数据集并进行实验，证明现有的语言与视觉模型在理解两种模态之间的互动方面存在缺陷，并需要使用更加细致的文本与图像关联方法进行改进。