针对各种韩国NLP任务的令牌化策略的实证研究

论文标题

针对各种韩国NLP任务的令牌化策略的实证研究

An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks

论文作者

Park, Kyubyong, Lee, Joohong, Jang, Seongbo, Jung, Dawoon

论文摘要

通常，令牌化是大多数文本处理工作中的第一步。由于令牌是一个原子单元，它嵌入了文本的上下文信息，因此如何定义一个令牌在模型的性能中起着决定性的作用。即使字节对编码（BPE）被认为是事实上的标准标记方法，因为它的简单性和普遍性仍然不清楚BPE是否在所有语言中仍然可以使用BPE最佳效果，并且是否最适合BPE。在本文中，我们测试了几种令牌化策略，以回答我们的主要研究问题，也就是说：“韩国NLP任务的最佳令牌化策略是什么？”实验结果表明，形态分割的混合方法随后BPE在韩文中最有效，从英语机器翻译和自然语言理解任务，例如Kornli，Korsts，NSMC和PAWS-X。例外，对于Korquad来说，朝鲜的延伸，BPE细分是最有效的。

Typically, tokenization is the very first step in most text processing works. As a token serves as an atomic unit that embeds the contextual information of text, how to define a token plays a decisive role in the performance of a model.Even though Byte Pair Encoding (BPE) has been considered the de facto standard tokenization method due to its simplicity and universality, it still remains unclear whether BPE works best across all languages and tasks. In this paper, we test several tokenization strategies in order to answer our primary research question, that is, "What is the best tokenization strategy for Korean NLP tasks?" Experimental results demonstrate that a hybrid approach of morphological segmentation followed by BPE works best in Korean to/from English machine translation and natural language understanding tasks such as KorNLI, KorSTS, NSMC, and PAWS-X. As an exception, for KorQuAD, the Korean extension of SQuAD, BPE segmentation turns out to be the most effective.

下载PDF全文

下载文献需遵守相关版权规定

论文标题