论文标题
Wasserstein距离正则序列表示不对称结构域中的文本匹配
Wasserstein Distance Regularized Sequence Representation for Text Matching in Asymmetrical Domains
论文作者
论文摘要
从不对称域中匹配文本的一种方法是将输入序列投射到一个共同的语义空间中,因为特征向量可以很容易地定义和学习匹配函数。在现实世界中的匹配实践中,经常观察到,随着训练的继续,从不同领域投射的特征向量往往是无法区分的。但是,现象在现有匹配模型中经常被忽略。结果,特征向量是在没有任何正则化的情况下构造的,这不可避免地增加了学习下游匹配函数的困难。在本文中,我们提出了一种针对不对称域中的文本匹配量身定制的新型匹配方法,称为WD匹配。在WD匹配中,定义了基于Wasserstein距离的正规器,以使从不同域投射的向量正规化。结果,该方法强制执行特征投影函数以生成向量,使得对应于不同域的矢量不能轻易区分。 WD匹配的训练过程相当于一款游戏,可以最大程度地减少Wasserstein距离正常的匹配损失。 WD匹配可通过使用该方法作为其基础匹配模型来改善不同的文本匹配方法。论文中已经利用了四种流行的文本匹配方法。基于四个公开基准测试的实验结果表明,WD匹配始终优于基础方法和基准。
One approach to matching texts from asymmetrical domains is projecting the input sequences into a common semantic space as feature vectors upon which the matching function can be readily defined and learned. In real-world matching practices, it is often observed that with the training goes on, the feature vectors projected from different domains tend to be indistinguishable. The phenomenon, however, is often overlooked in existing matching models. As a result, the feature vectors are constructed without any regularization, which inevitably increases the difficulty of learning the downstream matching functions. In this paper, we propose a novel match method tailored for text matching in asymmetrical domains, called WD-Match. In WD-Match, a Wasserstein distance-based regularizer is defined to regularize the features vectors projected from different domains. As a result, the method enforces the feature projection function to generate vectors such that those correspond to different domains cannot be easily discriminated. The training process of WD-Match amounts to a game that minimizes the matching loss regularized by the Wasserstein distance. WD-Match can be used to improve different text matching methods, by using the method as its underlying matching model. Four popular text matching methods have been exploited in the paper. Experimental results based on four publicly available benchmarks showed that WD-Match consistently outperformed the underlying methods and the baselines.