论文标题
实例开发用于从稀疏标记的漂流数据流中学习临时概念的实例开发
Instance exploitation for learning temporary concepts from sparsely labeled drifting data streams
论文作者
论文摘要
由于在线工具和系统的数量越来越多,从流数据源中持续学习越来越流行。处理动态和永恒的问题会带来新的挑战,基于批处理的离线算法在计算时间和预测性能方面不足。最关键的局限性之一是,我们不能假定可以访问有限和完整的数据集 - 我们始终必须为可能补充我们模型的新数据做好准备。这提出了为潜在无界流提供标签的关键问题。在现实世界中,我们被迫应对非常严格的预算限制,因此,我们很可能会面临带注释的实例的稀缺,这对于监督学习至关重要。在我们的工作中,我们强调了这个问题,并提出了一种新颖的实例开发技术。我们表明,何时:(i)数据的特征是暂时的非平稳概念,并且(ii)很少有在很长一段时间内跨越的标签,实际上,风险过度拟合和更积极地适应模型实际上是更好的,而不是通过利用我们唯一的标签实例,而不是我们坚持使用标准学习模式和严重的不良配件。我们为我们的方法提供了不同的策略和配置,以及一种合奏算法,试图在风险和正常适应性之间保持甜蜜的位置。最后,我们使用与给定问题相关的最新流媒体算法对我们的方法进行了复杂的深入比较分析。
Continual learning from streaming data sources becomes more and more popular due to the increasing number of online tools and systems. Dealing with dynamic and everlasting problems poses new challenges for which traditional batch-based offline algorithms turn out to be insufficient in terms of computational time and predictive performance. One of the most crucial limitations is that we cannot assume having access to a finite and complete data set - we always have to be ready for new data that may complement our model. This poses a critical problem of providing labels for potentially unbounded streams. In the real world, we are forced to deal with very strict budget limitations, therefore, we will most likely face the scarcity of annotated instances, which are essential in supervised learning. In our work, we emphasize this problem and propose a novel instance exploitation technique. We show that when: (i) data is characterized by temporary non-stationary concepts, and (ii) there are very few labels spanned across a long time horizon, it is actually better to risk overfitting and adapt models more aggressively by exploiting the only labeled instances we have, instead of sticking to a standard learning mode and suffering from severe underfitting. We present different strategies and configurations for our methods, as well as an ensemble algorithm that attempts to maintain a sweet spot between risky and normal adaptation. Finally, we conduct a complex in-depth comparative analysis of our methods, using state-of-the-art streaming algorithms relevant to the given problem.