论文标题
从稀疏数据中进行半监督学习的分解图表
Factorized Graph Representations for Semi-Supervised Learning from Sparse Data
论文作者
论文摘要
节点分类是图数据管理中的重要问题。它通常是通过从几个标记的种子节点开始迭代的各种标签传播方法来求解的。对于在类之间具有任意兼容性的图表,这些方法至关重要地取决于知道域专家或启发式方法必须提供的兼容性矩阵。我们可以以原则可扩展的方式直接从稀疏标记的图中直接估算正确的兼容性?我们肯定地回答了这个问题,并提出了一种称为“遥远兼容性估计”的方法,即使在非常稀疏的标记图上有效(例如,标记了10,000个节点中的1个),以稍后的时间来标记其余节点。我们的方法首先创建多分解的图表表示(大小与图形无关),然后对这些较小的图形草图执行估计。我们将代数放大定义为利用算法更新方程的代数属性以扩大稀疏信号的更一般思想。我们表明,我们的估计器按比替代方法快的数量级来表明,端到端分类精度与使用金标准兼容性相当。这使其成为任何现有标签传播方法的廉价预处理步骤,并消除了当前对启发式方法的依赖。
Node classification is an important problem in graph data management. It is commonly solved by various label propagation methods that work iteratively starting from a few labeled seed nodes. For graphs with arbitrary compatibilities between classes, these methods crucially depend on knowing the compatibility matrix that must be provided by either domain experts or heuristics. Can we instead directly estimate the correct compatibilities from a sparsely labeled graph in a principled and scalable way? We answer this question affirmatively and suggest a method called distant compatibility estimation that works even on extremely sparsely labeled graphs (e.g., 1 in 10,000 nodes is labeled) in a fraction of the time it later takes to label the remaining nodes. Our approach first creates multiple factorized graph representations (with size independent of the graph) and then performs estimation on these smaller graph sketches. We define algebraic amplification as the more general idea of leveraging algebraic properties of an algorithm's update equations to amplify sparse signals. We show that our estimator is by orders of magnitude faster than an alternative approach and that the end-to-end classification accuracy is comparable to using gold standard compatibilities. This makes it a cheap preprocessing step for any existing label propagation method and removes the current dependence on heuristics.