论文标题
非常快速的流式传输函数最大化
Very Fast Streaming Submodular Function Maximization
论文作者
论文摘要
数据摘要已成为理解数据的宝贝的宝贵工具。由于其引人注目的理论特性,子管函数已成为摘要算法的重点。这些算法提供了最坏的近似值,保证了更高的计算和内存要求的费用。但是,许多实际应用并不属于这种最坏的案例,但通常更加行善。在本文中,我们提出了一种称为“三分之一”的新的函数最大化算法,该算法忽略了最坏的案例,但在很高的可能性方面提供了一个很好的解决方案。它可以从数据流中选择最有用的项目,并在固定内存预算中保持可证明的性能。在广泛的评估中,我们将我们的方法与$ 8 $ $ 8 $不同数据集的$ 6 $的方法进行比较,而没有概念漂移。我们表明,我们的算法优于当前的最新算法,同时使用的资源更少。最后,我们重点介绍了我们算法的现实用例,用于伽马射线天文学中的数据摘要。我们在https://github.com/sbuschjaeger/submodularstreamingmaximization上公开提供代码。
Data summarization has become a valuable tool in understanding even terabytes of data. Due to their compelling theoretical properties, submodular functions have been in the focus of summarization algorithms. These algorithms offer worst-case approximations guarantees to the expense of higher computation and memory requirements. However, many practical applications do not fall under this worst-case, but are usually much more well-behaved. In this paper, we propose a new submodular function maximization algorithm called ThreeSieves, which ignores the worst-case, but delivers a good solution in high probability. It selects the most informative items from a data-stream on the fly and maintains a provable performance on a fixed memory budget. In an extensive evaluation, we compare our method against $6$ other methods on $8$ different datasets with and without concept drift. We show that our algorithm outperforms current state-of-the-art algorithms and, at the same time, uses fewer resources. Last, we highlight a real-world use-case of our algorithm for data summarization in gamma-ray astronomy. We make our code publicly available at https://github.com/sbuschjaeger/SubmodularStreamingMaximization.