基于自适应排名的样本选择，用于弱监督的类不平衡文本分类

论文标题

基于自适应排名的样本选择，用于弱监督的类不平衡文本分类

Adaptive Ranking-based Sample Selection for Weakly Supervised Class-imbalanced Text Classification

论文作者

Song, Linxin, Zhang, Jieyu, Yang, Tianxiang, Goto, Masayuki

论文摘要

为了廉价地获得大量的培训标签，研究人员最近采用了弱监督（WS）范式，该范式利用标签规则来综合培训标签，而不是使用单个注释来实现自然语言处理（NLP）任务的竞争成果。但是，尽管在各种NLP任务中是常见的问题，但在应用WS范式时通常会忽略数据不平衡。为了应对这一挑战，我们提出了基于自适应排名的样本选择（ARS2），这是一个模型不合时宜的框架，可减轻WS范式中数据不平衡问题。具体而言，它根据当前模型的输出来计算概率的边缘分数，以测量和对每个数据点的清洁度进行排名。然后，根据阶级和规则感知的排名对排名数据进行采样。特别是，两种示例策略对应于我们的动机：（1）用平衡的数据批次训练模型以减少数据不平衡问题，以及（2）利用每个标签规则的专业知识来收集干净的样本。四个具有四个不同不平衡比率的四个文本分类数据集的实验表明，ARS2的表现优于最新的不平衡学习和WS方法，导致其F1分数提高了2％-57.8％。

To obtain a large amount of training labels inexpensively, researchers have recently adopted the weak supervision (WS) paradigm, which leverages labeling rules to synthesize training labels rather than using individual annotations to achieve competitive results for natural language processing (NLP) tasks. However, data imbalance is often overlooked in applying the WS paradigm, despite being a common issue in a variety of NLP tasks. To address this challenge, we propose Adaptive Ranking-based Sample Selection (ARS2), a model-agnostic framework to alleviate the data imbalance issue in the WS paradigm. Specifically, it calculates a probabilistic margin score based on the output of the current model to measure and rank the cleanliness of each data point. Then, the ranked data are sampled based on both class-wise and rule-aware ranking. In particular, the two sample strategies corresponds to our motivations: (1) to train the model with balanced data batches to reduce the data imbalance issue and (2) to exploit the expertise of each labeling rule for collecting clean samples. Experiments on four text classification datasets with four different imbalance ratios show that ARS2 outperformed the state-of-the-art imbalanced learning and WS methods, leading to a 2%-57.8% improvement on their F1-score.

下载PDF全文

下载文献需遵守相关版权规定

论文标题