Food image segmentation is an important task that has ubiquitous
applications, such as estimating the nutritional value of a plate of food.
Although machine learning models have been used for segmentation in this
domain, food images pose several challenges. One challenge is that food items
can overlap and mix, making them difficult to distinguish. Another challenge is
the degree of inter-class similarity and intra-class variability, which is
caused by the varying preparation methods and dishes a food item may be served
in. Additionally, class imbalance is an inevitable issue in food datasets. To
address these issues, two models are trained and compared, one based on
convolutional neural networks and the other on Bidirectional Encoder
representation for Image Transformers (BEiT). The models are trained and
valuated using the FoodSeg103 dataset, which is identified as a robust
benchmark for food image segmentation. The BEiT model outperforms the previous
state-of-the-art model by achieving a mean intersection over union of 49.4 on
FoodSeg103. This study provides insights into transfering knowledge using
convolution and Transformer-based approaches in the food image domain.

本文研究了食物图像分割的困难，提供了一个鲁棒的基准数据集 FoodSeg103，采用卷积神经网络和双向编码器表示图像转换器（BEiT）进行对比，并证明 BEiT 在食物图像分割中的表现优于其他模型，表明转移学习可以提高图像分割性能。

使用 Transformer 和卷积进行食品图像分割的知识转移

Transferring Knowledge for Food Image Segmentation using Transformers  and Convolutions

This research is the second phase in a series of investigations on developing
an Optical Character Recognition (OCR) of Arabic historical documents and
examining how different modeling procedures interact with the problem. The
first research studied the effect of Transformers on our custom-built Arabic
dataset. One of the downsides of the first research was the size of the
training data, a mere 15000 images from our 30 million images, due to lack of
resources. Also, we add an image enhancement layer, time and space
optimization, and Post-Correction layer to aid the model in predicting the
correct word for the correct context. Notably, we propose an end-to-end text
recognition approach using Vision Transformers as an encoder, namely BEIT, and
vanilla Transformer as a decoder, eliminating CNNs for feature extraction and
reducing the model's complexity. The experiments show that our end-to-end model
outperforms Convolutions Backbones. The model attained a CER of 4.46%.

本文介绍了针对阿拉伯历史文献的光学字符识别技术的研究，提出了一种端到端文本识别方法 BEIT，并通过实验比较证明，该方法优于卷积神经网络的特征提取方法，达到了 4.46% 的识别准确率。

一种基于 Transformer 模型的端到端 OCR 框架，用于识别具有变音符的大规模古典阿拉伯文多字体语料库的手写体 Arabic 识别

An End-to-End OCR Framework for Robust Arabic-Handwriting Recognition using a Novel Transformers-based Model and an Innovative 270 Million-Words Multi-Font Corpus of Classical Arabic with Diacritics

We introduce Corrupted Image Modeling (CIM) for self-supervised visual
pre-training. CIM uses an auxiliary generator with a small trainable BEiT to
corrupt the input image instead of using artificial [MASK] tokens, where some
patches are randomly selected and replaced with plausible alternatives sampled
from the BEiT output distribution. Given this corrupted image, an enhancer
network learns to either recover all the original image pixels, or predict
whether each visual token is replaced by a generator sample or not. The
generator and the enhancer are simultaneously trained and synergistically
updated. After pre-training, the enhancer can be used as a high-capacity visual
encoder for downstream tasks. CIM is a general and flexible visual pre-training
framework that is suitable for various network architectures. For the first
time, CIM demonstrates that both ViT and CNN can learn rich visual
representations using a unified, non-Siamese framework. Experimental results
show that our approach achieves compelling results in vision benchmarks, such
as ImageNet classification and ADE20K semantic segmentation.

本篇论文介绍了 Corrupted Image Modeling (CIM) 用于图像自监督预训练，通过使用额外的生成器和小型可训练 BEiT 对输入图像进行损坏来实现，而不是使用人工 MASK 令牌，并在训练后可以将增强器用作下游任务的高容量视觉编码器。CIM 是一种通用且灵活的视觉预训练框架，适用于各种网络结构，使用非 Siamese 框架首次证明了 ViT 和 CNN 都可以学习到丰富的视觉表示，并在图像分类和语义分割方面取得了令人满意的结果。

自监督视觉预训练的损坏图像建模

Corrupted Image Modeling for Self-Supervised Visual Pre-Training

We introduce a self-supervised vision representation model BEiT, which stands
for Bidirectional Encoder representation from Image Transformers. Following
BERT developed in the natural language processing area, we propose a masked
image modeling task to pretrain vision Transformers. Specifically, each image
has two views in our pre-training, i.e, image patches (such as 16x16 pixels),
and visual tokens (i.e., discrete tokens). We first "tokenize" the original
image into visual tokens. Then we randomly mask some image patches and fed them
into the backbone Transformer. The pre-training objective is to recover the
original visual tokens based on the corrupted image patches. After pre-training
BEiT, we directly fine-tune the model parameters on downstream tasks by
appending task layers upon the pretrained encoder. Experimental results on
image classification and semantic segmentation show that our model achieves
competitive results with previous pre-training methods. For example, base-size
BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming
from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size
BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with
supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models
are available at this https URL

本研究介绍了一种名为 BEiT 的自监督视觉表示模型，使用双向编码器表示图像转换器并进行了预训练，效果显著。