Tokenizing raw texts into word units is an essential pre-processing step for
critical tasks in the NLP pipeline such as tagging, parsing, named entity
recognition, and more. For most languages, this tokenization step
straightforward. However, for languages with high token-internal comp