论文标题
Rethinkcws:中文单词分割是解决的任务吗?
RethinkCWS: Is Chinese Word Segmentation a Solved Task?
论文作者
论文摘要
随着深度神经网络的快速发展,尤其是成功使用大型预训练的模型,中国单词分割(CWS)系统的性能已逐渐达到平稳状态。在本文中,我们盘点我们所取得的成就,并重新考虑CWS任务中剩下的事情。从方法论上讲,我们对现有CWS系统提出了细粒度的评估,该评估不仅使我们能够诊断现有模型的优势和劣势(在数据库内的设置下),而且使我们能够量化不同标准之间的差异并减轻负面传递问题在进行多标准学习时。从策略上讲,尽管没有在本文中提出一种新颖的模型,但我们对八个模型和七个数据集的全面实验以及彻底的分析可以寻找一些有希望的方向以进行未来的研究。我们公开提供所有代码,并发布一个可以快速评估和诊断用户模型的接口:https://github.com/neulab/interpreteval。
The performance of the Chinese Word Segmentation (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks, especially the successful use of large pre-trained models. In this paper, we take stock of what we have achieved and rethink what's left in the CWS task. Methodologically, we propose a fine-grained evaluation for existing CWS systems, which not only allows us to diagnose the strengths and weaknesses of existing models (under the in-dataset setting), but enables us to quantify the discrepancy between different criterion and alleviate the negative transfer problem when doing multi-criteria learning. Strategically, despite not aiming to propose a novel model in this paper, our comprehensive experiments on eight models and seven datasets, as well as thorough analysis, could search for some promising direction for future research. We make all codes publicly available and release an interface that can quickly evaluate and diagnose user's models: https://github.com/neulab/InterpretEval.