标记文本数据的自适应规则发现

论文标题

标记文本数据的自适应规则发现

Adaptive Rule Discovery for Labeling Text Data

论文作者

Galhotra, Sainyam, Golshan, Behzad, Tan, Wang-Chiew

论文摘要

创建和收集标记的数据是机器学习管道中的主要瓶颈之一，并且出现自动化特征生成技术，例如深度学习，通常需要大量的培训数据，这进一步加剧了问题。虽然弱势阶层的技术已经规避了这种瓶颈，但现有框架要么要求用户编写一组不同的高质量规则来标记数据（例如浮潜），要么要求数据的标签子集以自动开采规则（例如，Sneuba）。手动编写规则的过程可能乏味且耗时。同时，在不平衡的设置中，创建标有数据的标签子集可能是昂贵的，甚至是不可行的。这是由于以下事实：不平衡设置中的随机样本通常仅包含少数积极实例。为了解决这些缺点，我们提出了达尔文，这是一种交互式系统，旨在减轻撰写规则的任务，以在弱监督的设置中标记文本数据。给定初始标签规则，达尔文会自动为手头标签任务生成一组候选规则，并利用注释者的反馈来调整候选规则。我们描述了达尔文如何可扩展和通用。它可以通过大型文本语料库（即超过100万个句子）运行，并支持广泛的标签功能（即，可以使用上下文免费语法指定的任何功能）。最后，我们在五个现实世界中的数据集上进行了一套实验，达尔文使注释者能够有效地生成弱监督的标签，并且成本较小。实际上，我们的实验表明，与Snuba相比，达尔文发现的规则平均可以识别出40％的阳性实例，即使它提供了1000个标记的实例。

Creating and collecting labeled data is one of the major bottlenecks in machine learning pipelines and the emergence of automated feature generation techniques such as deep learning, which typically requires a lot of training data, has further exacerbated the problem. While weak-supervision techniques have circumvented this bottleneck, existing frameworks either require users to write a set of diverse, high-quality rules to label data (e.g., Snorkel), or require a labeled subset of the data to automatically mine rules (e.g., Snuba). The process of manually writing rules can be tedious and time consuming. At the same time, creating a labeled subset of the data can be costly and even infeasible in imbalanced settings. This is due to the fact that a random sample in imbalanced settings often contains only a few positive instances. To address these shortcomings, we present Darwin, an interactive system designed to alleviate the task of writing rules for labeling text data in weakly-supervised settings. Given an initial labeling rule, Darwin automatically generates a set of candidate rules for the labeling task at hand, and utilizes the annotator's feedback to adapt the candidate rules. We describe how Darwin is scalable and versatile. It can operate over large text corpora (i.e., more than 1 million sentences) and supports a wide range of labeling functions (i.e., any function that can be specified using a context free grammar). Finally, we demonstrate with a suite of experiments over five real-world datasets that Darwin enables annotators to generate weakly-supervised labels efficiently and with a small cost. In fact, our experiments show that rules discovered by Darwin on average identify 40% more positive instances compared to Snuba even when it is provided with 1000 labeled instances.

下载PDF全文

下载文献需遵守相关版权规定

论文标题