This article offers an empirical study on the different ways of encoding Chinese, Japanese, Korean (CJK) and English languages for text classification. Different encoding levels are studied, including UTF-8 bytes, characters, words, romanized characters and romanized words. For all encoding levels, whenever applicable, we provide comparisons with linear models, fastText and convolutional networks. For convolutional networks, we compare between encoding mechanisms using character glyph images, one-hot (or one-of-n) encoding, and embedding. In total there are 473 models, using 14 large-scale text classification datasets in 4 languages including Chinese, English, Japanese and Korean. Some conclusions from these results include that byte-level one-hot encoding based on UTF-8 consistently produces competitive results for convolutional networks, that word-level n-grams linear models are competitive even without perfect word segmentation, and that fastText provides the best result using character-level n-gram encoding but can overfit when the features are overly rich.

本文针对中日韩和英语言的文本分类，研究了不同编码方式，包括 UTF-8 字节、字符、词、罗马化字符和罗马化词，对线性模型、fastText 和卷积神经网络进行了比较，对卷积神经网络的编码机制进行了研究，使用了包括字符字形图像、one-hot 编码和嵌入在内的不同编码机制，总共使用了 473 个模型，并使用包括中英日韩四种语言的 14 个大型文本分类数据集。结果表明，基于 UTF-8 的字节级 one-hot 编码一致表现出色，词级 n-gram 的线性模型即使没有完美的词分割也能表现出色，而 fastText 提供了最佳结果，但当特征过于丰富时容易出现过拟合。

中、英、日、韓語文本分類中最佳編碼方式為何？