论文标题
K-均在分布的近似相似性搜索中非常规应用
Unconventional application of k-means for distributed approximate similarity search
论文作者
论文摘要
基于度量空间中距离函数的相似性搜索是许多应用程序的基本问题。对类似对象的查询导致了著名的机器学习任务最近的邻居识别。在这种情况下,已经提出了许多数据索引策略,共同称为公制访问方法(MAM),以加快针对类似元素的查询。此外,由于确切的解决相似性查询的方法可能很复杂且耗时,因此似乎已选择替代选项可以减少查询执行时间,例如返回近似结果或求助于分布式计算平台。在本文中,我们介绍了蒙版(与$ k $ -Meanss的多级近似相似性搜索),这是$ k $ -Means算法的非常规应用,作为用于近似相似性搜索的多级索引结构的基础,适用于公制空间。我们表明,可以为此目的利用$ k $ - 均值的固有属性,例如代表具有更少原型的高密度数据区域。使用合成数据集和高维空间中的合成数据集和现实世界数据集评估了这种新索引方法的实现。结果是有希望的,并支持了这种新型索引方法在多个域中的适用性。
Similarity search based on a distance function in metric spaces is a fundamental problem for many applications. Queries for similar objects lead to the well-known machine learning task of nearest-neighbours identification. Many data indexing strategies, collectively known as Metric Access Methods (MAM), have been proposed to speed up queries for similar elements in this context. Moreover, since exact approaches to solve similarity queries can be complex and time-consuming, alternative options have appeared to reduce query execution time, such as returning approximate results or resorting to distributed computing platforms. In this paper, we introduce MASK (Multilevel Approximate Similarity search with $k$-means), an unconventional application of the $k$-means algorithm as the foundation of a multilevel index structure for approximate similarity search, suitable for metric spaces. We show that inherent properties of $k$-means, like representing high-density data areas with fewer prototypes, can be leveraged for this purpose. An implementation of this new indexing method is evaluated, using a synthetic dataset and a real-world dataset in a high-dimensional and high-sparsity space. Results are promising and underpin the applicability of this novel indexing method in multiple domains.