使用基于变压器的神经模型改进越南文本的序列标记

论文标题

使用基于变压器的神经模型改进越南文本的序列标记

Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models

论文作者

The, Viet Bui, Thi, Oanh Tran, Le-Hong, Phuong

论文摘要

本文介绍了我们关于使用Mutingual Bert嵌入的研究和一些新的神经模型来改善越南语言的序列标记任务。我们提出了新的模型体系结构，并对VLSP 2016和VLSP 2018的两个命名实体识别数据集进行了广泛的评估，以及VLSP 2010和VLSP 2013的两个词性标记数据集。我们所提出的模型胜过现有的现有方法，并实现新的状态结果。尤其是，我们将词性标签的准确性提高到VLSP 2010语料库的95.40％，在VLSP 2013语料库上的准确性为96.77％。在VLSP 2016语料库中，指定实体识别的F1得分为94.07％，在VLSP 2018语料库中的F1得分为90.31％。我们的代码和预培训模型Vibert和Velectra被发布为开源，以促进采用和进一步研究。

This paper describes our study on using mutilingual BERT embeddings and some new neural models for improving sequence tagging tasks for the Vietnamese language. We propose new model architectures and evaluate them extensively on two named entity recognition datasets of VLSP 2016 and VLSP 2018, and on two part-of-speech tagging datasets of VLSP 2010 and VLSP 2013. Our proposed models outperform existing methods and achieve new state-of-the-art results. In particular, we have pushed the accuracy of part-of-speech tagging to 95.40% on the VLSP 2010 corpus, to 96.77% on the VLSP 2013 corpus; and the F1 score of named entity recognition to 94.07% on the VLSP 2016 corpus, to 90.31% on the VLSP 2018 corpus. Our code and pre-trained models viBERT and vELECTRA are released as open source to facilitate adoption and further research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题