一种有效的共享内存平行的sindhorn-knopp算法来计算单词Mover的距离

论文标题

一种有效的共享内存平行的sindhorn-knopp算法来计算单词Mover的距离

An Efficient Shared-memory Parallel Sinkhorn-Knopp Algorithm to Compute the Word Mover's Distance

论文作者

Tithi, Jesmin Jahan, Petrini, Fabrizio

论文摘要

单词移动器的距离（WMD）是一个度量标准，它通过计算将源/查询文档的所有单词的成本计算为目标文档的最相似单词的成本来衡量两个文本文档之间的语义差异。在两个文档之间计算WMD是昂贵的，因为它需要解决一个优化问题，该优化问题成本\（o（v^3log（v））\），其中\（v \）是文档中唯一单词的数量。幸运的是，WMD可以被框起来，因为它的地球搬运工距离（EMD）（也称为最佳运输距离），已经证明，可以通过为优化问题添加熵惩罚，可以调整相似的想法来调整算法的惩罚，从而将算法复杂性降低到\（O（v^2）\），以便对WMD进行了调整。此外，可以通过一次计算单个查询文档的WMD对多个目标文档的WMD高度平行（例如，查找给定推文是否与一天中发生的任何其他推文相似）。在本文中，我们提出了一种共享的记忆并行sindhorn-knopp算法，以通过采用\（o（v^2）\）EMD算法来计算一个文档的WMD，以计算一个文档的WMD。我们使用算法转换将原始密集的计算重核更改为稀疏的计算内核，并在Intel \ destel \ desktregistered {}} 4-Sockets 4-Sockets Cascade Lake Machine W.R.R.R.R.T.上获得了\（67 \ times \）加速。其顺序运行。我们的并行算法超过\（700 \ times \），比内部使用优化矩阵库调用的幼稚并行python代码快。

The Word Mover's Distance (WMD) is a metric that measures the semantic dissimilarity between two text documents by computing the cost of moving all words of a source/query document to the most similar words of a target document optimally. Computing WMD between two documents is costly because it requires solving an optimization problem that costs \(O(V^3log(V))\) where \(V\) is the number of unique words in the document. Fortunately, the WMD can be framed as the Earth Mover's Distance (EMD) (also known as the Optimal Transportation Distance) for which it has been shown that the algorithmic complexity can be reduced to \(O(V^2)\) by adding an entropy penalty to the optimization problem and a similar idea can be adapted to compute WMD efficiently. Additionally, the computation can be made highly parallel by computing WMD of a single query document against multiple target documents at once (e.g., finding whether a given tweet is similar to any other tweets happened in a day). In this paper, we present a shared-memory parallel Sinkhorn-Knopp Algorithm to compute the WMD of one document against many other documents by adopting the \(O(V^2)\) EMD algorithm. We used algorithmic transformations to change the original dense compute-heavy kernel to a sparse compute kernel and obtained \(67\times\) speedup using \(96\) cores on the state-of-the-art of Intel\textregistered{} 4-sockets Cascade Lake machine w.r.t. its sequential run. Our parallel algorithm is over \(700\times\) faster than the naive parallel python code that internally uses optimized matrix library calls.

下载PDF全文

下载文献需遵守相关版权规定

论文标题