中文拼写检查检索错误

论文标题

中文拼写检查检索错误

Error-Robust Retrieval for Chinese Spelling Check

论文作者

Yin, Xunjian, Hu, Xinyu, Jiang, Jin, Wan, Xiaojun

论文摘要

中文拼写检查（CSC）旨在在中文中检测和纠正错误令牌，该语境具有广泛的应用。但是，它面临着不足的注释数据的挑战，以及以前方法实际上可能无法完全利用现有数据集的问题。在本文中，我们介绍了有关中文拼写检查（RERIC）的误解信息的插件检索方法，可以将其直接应用于现有的CSC型号。用于检索的数据存储是完全基于培训数据构建的，并根据CSC的特征进行精心设计。具体而言，我们采用多模式表示，在检索过程中查询和钥匙的计算中融合语音，形态和上下文信息，以增强对潜在误差的鲁棒性。此外，为了更好地判断检索到的候选人，围绕要检查的令牌的n-gram被视为值，并用于特定的重读。 Sighan基准测试的实验结果表明，我们提出的方法对现有工作有了很大的改进。

Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts, which has a wide range of applications. However, it is confronted with the challenges of insufficient annotated data and the issue that previous methods may actually not fully leverage the existing datasets. In this paper, we introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check (RERIC), which can be directly applied to existing CSC models. The datastore for retrieval is built completely based on the training data, with elaborate designs according to the characteristics of CSC. Specifically, we employ multimodal representations that fuse phonetic, morphologic, and contextual information in the calculation of query and key during retrieval to enhance robustness against potential errors. Furthermore, in order to better judge the retrieved candidates, the n-gram surrounding the token to be checked is regarded as the value and utilized for specific reranking. The experiment results on the SIGHAN benchmarks demonstrate that our proposed method achieves substantial improvements over existing work.

下载PDF全文

下载文献需遵守相关版权规定

论文标题