论文标题
互动弱监督:学习数据标签的有用启发式方法
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling
论文作者
论文摘要
获得大型注释数据集对于培训成功的机器学习模型至关重要,并且在实践中通常是瓶颈。弱监督提供了一种有希望的替代方法,可通过使用多种嘈杂的启发式方法生成概率标签来生成标记的数据集,而无需地面真相注释。这个过程可以扩展到大型数据集,并在医疗保健和电子商务等不同领域中表现出最先进的表现。从用户生成的启发式方法中学习的一个实际问题是,他们的创建需要那些手工制作的人的创造力,远见和领域专业知识,这一过程可能是乏味和主观的。我们为交互式弱监督开发了第一个框架,其中一种方法提出了启发式方法,并从每个提出的启发式词的用户反馈中学习。我们的实验表明,仅需要少量反馈迭代才能训练在没有访问地面真相训练标签的情况下实现高度竞争性测试的模型。我们进行用户研究,这表明用户能够有效地提供有关启发式方法的反馈,并且测试集结果跟踪了模拟甲壳的性能。
Obtaining large annotated datasets is critical for training successful machine learning models and it is often a bottleneck in practice. Weak supervision offers a promising alternative for producing labeled datasets without ground truth annotations by generating probabilistic labels using multiple noisy heuristics. This process can scale to large datasets and has demonstrated state of the art performance in diverse domains such as healthcare and e-commerce. One practical issue with learning from user-generated heuristics is that their creation requires creativity, foresight, and domain expertise from those who hand-craft them, a process which can be tedious and subjective. We develop the first framework for interactive weak supervision in which a method proposes heuristics and learns from user feedback given on each proposed heuristic. Our experiments demonstrate that only a small number of feedback iterations are needed to train models that achieve highly competitive test set performance without access to ground truth training labels. We conduct user studies, which show that users are able to effectively provide feedback on heuristics and that test set results track the performance of simulated oracles.