论文标题

LSF-JOIN:分布式全对的局部敏感过滤偏斜下方相似性

LSF-Join: Locality Sensitive Filtering for Distributed All-Pairs Set Similarity Under Skew

论文作者

Rashtchian, Cyrus, Sharma, Aneesh, Woodruff, David P.

论文摘要

全对设置相似性是一项广泛使用的数据挖掘任务,即使对于大和高维数据集也是如此。传统上,相似性搜索集中在发现非常相似的对上,为此,已知各种有效的算法。但是,最近的工作突出了找到相对较小的相交大小的成对组的重要性。例如,在推荐系统中,即使他们的兴趣仅在一小部分项目上重叠,两个用户也可能是相同的。在这样的系统中,某些尺寸通常是高度偏斜的,因为它们非常受欢迎。这两种属性共同使以前的方法对于大输入尺寸而言是不可行的。为了解决此问题,我们提出了一种新的分布式算法LSF-JOIN,以近似全对设置相似性。我们算法的核心是基于位置敏感过滤的随机选择过程。我们的方法偏离了先前的近似算法,这些算法基于局部性敏感的哈希。从理论上讲,我们表明LSF-Join有效地找到了最亲密的对,即使对于小相似性阈值和偏斜的输入集也是如此。我们证明,LSF-JOIN的通信,工作和最大负载可以保证,并且我们还通过实验证明了其在多个图表上的准确性。

All-pairs set similarity is a widely used data mining task, even for large and high-dimensional datasets. Traditionally, similarity search has focused on discovering very similar pairs, for which a variety of efficient algorithms are known. However, recent work highlights the importance of finding pairs of sets with relatively small intersection sizes. For example, in a recommender system, two users may be alike even though their interests only overlap on a small percentage of items. In such systems, some dimensions are often highly skewed because they are very popular. Together these two properties render previous approaches infeasible for large input sizes. To address this problem, we present a new distributed algorithm, LSF-Join, for approximate all-pairs set similarity. The core of our algorithm is a randomized selection procedure based on Locality Sensitive Filtering. Our method deviates from prior approximate algorithms, which are based on Locality Sensitive Hashing. Theoretically, we show that LSF-Join efficiently finds most close pairs, even for small similarity thresholds and for skewed input sets. We prove guarantees on the communication, work, and maximum load of LSF-Join, and we also experimentally demonstrate its accuracy on multiple graphs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源