使用干预措施改善文本匹配建议系统的分布概括

论文标题

使用干预措施改善文本匹配建议系统的分布概括

Using Interventions to Improve Out-of-Distribution Generalization of Text-Matching Recommendation Systems

论文作者

Bansal, Parikshit, Prabhu, Yashoteja, Kiciman, Emre, Sharma, Amit

论文摘要

鉴于用户的输入文本，通过将输入文本与可用项目的描述（例如电子商务平台上的产品到产品推荐）进行比较，输出相关项目输出相关项目。随着用户的兴趣和项目库存的预期会发生变化，对于文本匹配系统来说，将其推广到数据转移非常重要，该任务称为离分布（OOD）概括。但是，我们发现，在配对项目相关性数据（例如，用户点击）上微调大型基本语言模型的流行方法可能适合OOD概括。对于产品推荐任务，在推荐新类别或未来时间段的项目时，微型调整的准确性比基本模型更差。为了解释这种概括失败，我们考虑了基于干预的重要性度量，该指标表明，微调模型捕获了虚假的相关性，并且未能学习确定任何两个文本输入之间相关性的因果特征。此外，因果正则化的标准方法在这种情况下不适用，因为与图像不同，文本匹配任务中没有普遍虚假的功能（相同的标记可能是虚假的或因与其匹配的文本的不同）。因此，对于文本输入上的OOD概括，我们强调了一个不同的目标：避免某些功能的高度分数。我们使用基于干预的正规器来限制任何代币对模型相关性评分的因果效应，以类似于基本模型。亚马逊产品和3个问题建议数据集的结果表明，我们提出的常规器改进了分布和OOD评估的概括，尤其是在基本模型不准确的困难情况下。

Given a user's input text, text-matching recommender systems output relevant items by comparing the input text to available items' description, such as product-to-product recommendation on e-commerce platforms. As users' interests and item inventory are expected to change, it is important for a text-matching system to generalize to data shifts, a task known as out-of-distribution (OOD) generalization. However, we find that the popular approach of fine-tuning a large, base language model on paired item relevance data (e.g., user clicks) can be counter-productive for OOD generalization. For a product recommendation task, fine-tuning obtains worse accuracy than the base model when recommending items in a new category or for a future time period. To explain this generalization failure, we consider an intervention-based importance metric, which shows that a fine-tuned model captures spurious correlations and fails to learn the causal features that determine the relevance between any two text inputs. Moreover, standard methods for causal regularization do not apply in this setting, because unlike in images, there exist no universally spurious features in a text-matching task (the same token may be spurious or causal depending on the text it is being matched to). For OOD generalization on text inputs, therefore, we highlight a different goal: avoiding high importance scores for certain features. We do so using an intervention-based regularizer that constraints the causal effect of any token on the model's relevance score to be similar to the base model. Results on Amazon product and 3 question recommendation datasets show that our proposed regularizer improves generalization for both in-distribution and OOD evaluation, especially in difficult scenarios when the base model is not accurate.

下载PDF全文

下载文献需遵守相关版权规定

论文标题