We propose a multilingual adversarial training model for determining whether
a sentence contains an idiomatic expression. Given that a key challenge with
this task is the limited size of annotated data, our model relies on
pre-trained contextual representations from different multi-lingual
state-of-the-art transformer-based language models (i.e., multilingual BERT and
XLM-RoBERTa), and on adversarial training, a training method for further
enhancing model generalization and robustness. Without relying on any
human-crafted features, knowledge bases, or additional datasets other than the
target datasets, our model achieved competitive results and ranked 6th place in
SubTask A (zero-shot) setting and 15th place in SubTask A (one-shot) setting.

本文提出了一种多语言对抗训练模型，以判断一个句子是否包含习语表达。该模型利用不同多语言最先进的基于转换器的语言模型（即多语言 BERT 和 XLM-Roberta）的预训练上下文表示以及对抗性训练，提高模型的泛化能力和鲁棒性，在不依赖于人工创造的特征，知识库或除目标数据集以外的其他数据集的情况下，我们的模型取得了有竞争力的结果，在 SubTask A（零样本）设定中排名第 6 位，在 SubTask A（单样本）设定中排名第 15 位。

OCHADAI 在 SemEval-2022 任务 2 中：针对多语言成语鉴别的对抗训练

OCHADAI at SemEval-2022 Task 2: Adversarial Training for Multilingual Idiomaticity Detection

Most modern NLP systems make use of pre-trained contextual representations
that attain astonishingly high performance on a variety of tasks. Such high
performance should not be possible unless some form of linguistic structure
inheres in these representations, and a wealth of research has sprung up on
probing for it. In this paper, we draw a distinction between intrinsic probing,
which examines how linguistic information is structured within a
representation, and the extrinsic probing popular in prior work, which only
argues for the presence of such information by showing that it can be
successfully extracted. To enable intrinsic probing, we propose a novel
framework based on a decomposable multivariate Gaussian probe that allows us to
determine whether the linguistic information in word embeddings is dispersed or
focal. We then probe fastText and BERT for various morphosyntactic attributes
across 36 languages. We find that most attributes are reliably encoded by only
a few neurons, with fastText concentrating its linguistic structure more than
BERT.

本文讨论了自然语言处理系统中之前探测语言结构方法的缺陷，并提出了基于多元高斯探针的内在探测框架，以便于检测词向量的语言信息。通过 36 种语言的实验证明，多数形态语法特征由少数神经元可靠编码，而 fastText 相较于 BERT 更加集中其语言结构。