论文标题
Messi:内存数据系列索引
MESSI: In-Memory Data Series Indexing
论文作者
论文摘要
数据系列相似性搜索是跨许多不同域的多个数据系列分析应用程序的核心操作。但是,最新技术无法提供交互式探索所需的时间性能,或分析大数据系列集合。在这项工作中,我们提出了梅西(Messi),这是第一个用于现代硬件内存操作的数据系列索引。我们的索引利用了现代硬件并行化机会(即SIMD说明,多核和多存储架构),以加速索引构建和相似性搜索处理时间。此外,它受益于在平行工人和数据结构的设置和协调中的仔细设计中受益,从而使其在内存操作中的性能最大化。我们对合成和真实数据集进行的实验表明,在索引构造时,梅西的整体梅西在索引构造方面的速度最高4倍,并且在查询答案时比最先进的并行方法快11倍。 Messi是第一个在_50msec(跨不同数据集的30-75msec)中回答100GB数据集上的确切相似性搜索查询的人,该查询可以对非常大的数据系列集合进行实时,交互式数据探索。
Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-core and multi-socket architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction, and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in _50msec (30-75msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.