This work presents a new resource for borrowing identification and analyzes
the performance and errors of several models on this task. We introduce a new
annotated corpus of Spanish newswire rich in unassimilated lexical borrowings
-- words from one language that are introduced into another without
orthographic adaptation -- and use it to evaluate how several sequence labeling
models (CRF, BiLSTM-CRF, and Transformer-based models) perform. The corpus
contains 370,000 tokens and is larger, more borrowing-dense, OOV-rich, and
topic-varied than previous corpora available for this task. Our results show
that a BiLSTM-CRF model fed with subword embeddings along with either
Transformer-based embeddings pretrained on codeswitched data or a combination
of contextualized word embeddings outperforms results obtained by a
multilingual BERT-based model.

本研究提供了一种新的借词识别资源，并分析了几种模型在此任务上的性能和错误。我们介绍了一个新的西班牙新闻语料库，其中包含 370,000 个标记，用于评估几种序列标记模型（CRF，BiLSTM-CRF 和基于 Transformer 的模型）的表现。我们的结果表明，一个 BiLSTM-CRF 模型配合子词嵌入，以及预先训练对话切换数据的 Transformer-based 嵌入或一个上下文化词嵌入的组合胜过多语种 BERT-based 模型得到的结果。

检测西班牙语中的未同化借词：一个带注释的语料库和建模方法

Detecting Unassimilated Borrowings in Spanish: An Annotated Corpus and Approaches to Modeling

Successful methods for unsupervised neural machine translation (UNMT) employ
crosslingual pretraining via self-supervision, often in the form of a masked
language modeling or a sequence generation task, which requires the model to
align the lexical- and high-level representations of the two languages. While
cross-lingual pretraining works for similar languages with abundant corpora, it
performs poorly in low-resource and distant languages. Previous research has
shown that this is because the representations are not sufficiently aligned. In
this paper, we enhance the bilingual masked language model pretraining with
lexical-level information by using type-level cross-lingual subword embeddings.
Empirical results demonstrate improved performance both on UNMT (up to 4.5
BLEU) and bilingual lexicon induction using our method compared to a UNMT
baseline.

本文介绍了一种基于子词嵌入的双语掩码语言模型预训练方法，应用于无监督神经机器翻译和双语词汇归纳任务中均取得了较好的性能表现。

针对无监督神经机器翻译，提升预训练语言模型的词汇能力

Improving the Lexical Ability of Pretrained Language Models for  Unsupervised Neural Machine Translation

We propose several ways of reusing subword embeddings and other weights in
subword-aware neural language models. The proposed techniques do not benefit a
competitive character-aware model, but some of them improve the performance of
syllable- and morpheme-aware models while showing significant reductions in
model sizes. We discover a simple hands-on principle: in a multi-layer input
embedding model, layers should be tied consecutively bottom-up if reused at
output. Our best morpheme-aware model with properly reused weights beats the
competitive word-level model by a large margin across multiple languages and
has 20%-87% fewer parameters.

本研究提出了利用子单词嵌入和其他权重重用的方法，其中在多层输入嵌入模型中，应从下到上连续捆绑层以在输出时重用，最终建立的最佳词素感知模型在多种语言下可以比竞争的词级模型具有更好的性能且模型参数减少 20%-87%。