在拼写错误的情况下，令牌化修复

论文标题

在拼写错误的情况下，令牌化修复

Tokenization Repair in the Presence of Spelling Errors

论文作者

Bast, Hannah, Hertel, Matthias, Mohamed, Mostafa M.

论文摘要

我们考虑以下令牌化修复问题：鉴于自然语言文本具有缺失或虚假空间的任何组合，请纠正这些文本。拼写错误可能存在，但是纠正它们不是问题的一部分。例如，给定：“ tispa per iSabout令牌izaionrep air”，计算“ Tis Paper是关于Tokenizaion的修复”。我们确定了三种高质量代币化修复的关键要素，所有这些都缺少以前的工作：具有双向组件的深度语言模型，使用拼写错误培训文本模型，并利用已经存在的空间信息。我们的方法还可以通过不仅修复更多的令牌化错误，还可以修复更多的拼写错误来改善现有的拼写检查器：一旦清楚哪个字符形成一个单词，他们就容易得多地找出正确的单词。我们提供六个基准测试，涵盖三个用例（OCR错误，PDF的文本提取，人为错误）以及部分正确的空间信息和所有空间所缺少的情况。我们根据最佳现有方法和非平凡的基线评估我们的方法。我们根据https://ad.cs.uni-freiburg.de/publications提供完整的可重复性。

We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it's not part of the problem to correct them. For example, given: "Tispa per isabout token izaionrep air", compute "Tis paper is about tokenizaion repair". We identify three key ingredients of high-quality tokenization repair, all missing from previous work: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our methods also improve existing spell checkers by fixing not only more tokenization errors but also more spelling errors: once it is clear which characters form a word, it is much easier for them to figure out the correct word. We provide six benchmarks that cover three use cases (OCR errors, text extraction from PDF, human errors) and the cases of partially correct space information and all spaces missing. We evaluate our methods against the best existing methods and a non-trivial baseline. We provide full reproducibility under https://ad.cs.uni-freiburg.de/publications .

下载PDF全文

下载文献需遵守相关版权规定

论文标题