论文标题

关于$ p_1^{ - 1} $在局部敏感哈希的问题上

On the Problem of $p_1^{-1}$ in Locality-Sensitive Hashing

论文作者

Ahle, Thomas Dybdahl

论文摘要

局部敏感的哈希(LSH)函数称为$(R,Cr,P_1,P_2)$ - 敏感,如果两个距离小于$ R $ collide的数据点具有至少$ p_1 $,而数据点的距离至少$ CR $ collide collide collide collide collide collide collide collide collide collide collide最多$ p_2 $。这些功能构成了成功的Indyk-Motwani算法(STOC 1998)的基础。特别地,可以构建具有查询时间$ \ tilde o(n^ρ/p_1)$的$ c $ - 适当的邻居数据结构,其中$ρ= \ frac {\ log1/p_1} {\ log1/p_1} {\ log1/p_2} \ in(0,1)$。也就是说,次线性时间,只要$ p_1 $不太小。这很重要,因为大多数高维最接近的邻居问题遭受了维度的诅咒,并且不能准确地解决,而不是比数据库的蛮力线性扫描更快。 不幸的是,最佳的LSH功能往往具有非常低的碰撞概率,$ P_1 $和$ P_2 $。包括余弦和Jaccard相似性的最佳功能。这意味着,即使对于最近的邻居,LSH的$​​ n^ρ/p_1 $查询时间毕竟也不是次线性的! 在本文中,我们改进了一般的Indyk-Motwani算法,以将LSH的查询时间减少到$ \ tilde o(N^ρ/P_1^{1-ρ})$(以及相应的空间用法。均方根查询时间,对于任何碰撞概率至少$ 1/n $。对于$ p_1 $和$ p_2 $ himper,我们对所有先前方法的改进都可以是\ emph {在查询时间和空间中都可以\ emph {最高$ n $}。 改进来自对Indyk-Motwani算法的简单更改,该算法很容易在现有软件包中实现。

A Locality-Sensitive Hash (LSH) function is called $(r,cr,p_1,p_2)$-sensitive, if two data-points with a distance less than $r$ collide with probability at least $p_1$ while data points with a distance greater than $cr$ collide with probability at most $p_2$. These functions form the basis of the successful Indyk-Motwani algorithm (STOC 1998) for nearest neighbour problems. In particular one may build a $c$-approximate nearest neighbour data structure with query time $\tilde O(n^ρ/p_1)$ where $ρ=\frac{\log1/p_1}{\log1/p_2}\in(0,1)$. That is, sub-linear time, as long as $p_1$ is not too small. This is significant since most high dimensional nearest neighbour problems suffer from the curse of dimensionality, and can't be solved exact, faster than a brute force linear-time scan of the database. Unfortunately, the best LSH functions tend to have very low collision probabilities, $p_1$ and $p_2$. Including the best functions for Cosine and Jaccard Similarity. This means that the $n^ρ/p_1$ query time of LSH is often not sub-linear after all, even for approximate nearest neighbours! In this paper, we improve the general Indyk-Motwani algorithm to reduce the query time of LSH to $\tilde O(n^ρ/p_1^{1-ρ})$ (and the space usage correspondingly.) Since $n^ρp_1^{ρ-1} < n \Leftrightarrow p_1 > n^{-1}$, our algorithm always obtains sublinear query time, for any collision probabilities at least $1/n$. For $p_1$ and $p_2$ small enough, our improvement over all previous methods can be \emph{up to a factor $n$} in both query time and space. The improvement comes from a simple change to the Indyk-Motwani algorithm, which can easily be implemented in existing software packages.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源