We highlight several issues in the evaluation of historical text
normalization systems that make it hard to tell how well these systems would
actually work in practice---i.e., for new datasets or languages; in comparison
to more na\"ive systems; or as a preprocessing step for downstream nlp t