论文标题
一种新的数据增强方法,用于意图分类增强及其在口语对话数据集中的应用
A new data augmentation method for intent classification enhancement and its application on spoken conversation datasets
论文作者
论文摘要
意图分类器对于虚拟代理系统的成功操作至关重要。在语音激活的系统中尤其如此,在这些系统中,数据可能会嘈杂,并具有许多模棱两可的用户意图方向。在操作开始之前,这些分类器通常缺乏现实世界培训数据。主动学习是一种常见方法,用于帮助标记大量收集的用户输入。但是,这种方法需要手动标记工作数小时。我们提出最近的邻居得分改进(NNSI)算法,用于自动数据选择和标记。 NNSI通过自动选择高度磁性样品并以高精度将其标记来减少手动标记的需求。这是通过从语义上相似的文本样本组中集成分类器的输出来完成的。然后可以将标记的样品添加到训练集中,以提高分类器的准确性。我们证明了在两个大规模的现实语音对话系统上使用NNSI。对我们结果的评估表明,我们的方法能够以很高的精度选择和标记有用的样品。将这些新样本添加到培训数据中可以显着提高分类器,并将错误率降低了10%。
Intent classifiers are vital to the successful operation of virtual agent systems. This is especially so in voice activated systems where the data can be noisy with many ambiguous directions for user intents. Before operation begins, these classifiers are generally lacking in real-world training data. Active learning is a common approach used to help label large amounts of collected user input. However, this approach requires many hours of manual labeling work. We present the Nearest Neighbors Scores Improvement (NNSI) algorithm for automatic data selection and labeling. The NNSI reduces the need for manual labeling by automatically selecting highly-ambiguous samples and labeling them with high accuracy. This is done by integrating the classifier's output from a semantically similar group of text samples. The labeled samples can then be added to the training set to improve the accuracy of the classifier. We demonstrated the use of NNSI on two large-scale, real-life voice conversation systems. Evaluation of our results showed that our method was able to select and label useful samples with high accuracy. Adding these new samples to the training data significantly improved the classifiers and reduced error rates by up to 10%.