论文标题

QACTOR:噪音标记为流数据的在线积极学习

QActor: On-line Active Learning for Noisy Labeled Stream Data

论文作者

Younesian, Taraneh, Zhao, Zilong, Ghiassi, Amirmasoud, Birke, Robert, Chen, Lydia Y.

论文摘要

嘈杂的标记数据比不断在网络和社交媒体上不断发表的自我生成内容的稀有性更重要。由于隐私问题和政府法规,这种数据流只能在有限的时间内存储并用于学习目的。为了克服这种在线场景中的噪音,我们提出了新颖的Qactor:通过质量模型选择据称清洁样品,并积极查询甲骨文中最有用的真实标签。尽管前者可能会遭受在线场景的数据量较低,但后者受到人类专家的可用性和成本的约束。 Qactor迅速结合了用于数据过滤的质量模型和用于清洁最有用数据的甲骨文查询。 Qactor的目的是利用严格的Oracle预算来稳健地提高学习准确性。 Qactor探索了结合不同查询分配和不确定性度量的各种策略。 Qactor的一个主要特征是根据每个数据批次的学习损失动态调整查询限制。我们广泛评估馈入分类器的不同图像数据集,该数据集可以是标准机器学习(ML)模型或深神经网络(DNN),其噪声标签比率在30%至80%之间。我们的结果表明,QACTOR几乎可以使用仅使用清洁数据以最多支付Oracle的6%的地面真相数据来匹配最佳精度。

Noisy labeled data is more a norm than a rarity for self-generated content that is continuously published on the web and social media. Due to privacy concerns and governmental regulations, such a data stream can only be stored and used for learning purposes in a limited duration. To overcome the noise in this on-line scenario we propose QActor which novel combines: the selection of supposedly clean samples via quality models and actively querying an oracle for the most informative true labels. While the former can suffer from low data volumes of on-line scenarios, the latter is constrained by the availability and costs of human experts. QActor swiftly combines the merits of quality models for data filtering and oracle queries for cleaning the most informative data. The objective of QActor is to leverage the stringent oracle budget to robustly maximize the learning accuracy. QActor explores various strategies combining different query allocations and uncertainty measures. A central feature of QActor is to dynamically adjust the query limit according to the learning loss for each data batch. We extensively evaluate different image datasets fed into the classifier that can be standard machine learning (ML) models or deep neural networks (DNN) with noise label ratios ranging between 30% and 80%. Our results show that QActor can nearly match the optimal accuracy achieved using only clean data at the cost of at most an additional 6% of ground truth data from the oracle.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源