论文标题
用加权术语进行树木摘要的高效且最佳的算法
Efficient and Optimal Algorithms for Tree Summarization with Weighted Terminologies
论文作者
论文摘要
将数据集的一小部分向用户呈现的数据摘要已广泛应用于众多应用程序和系统中。许多数据集用层次术语编码,例如,疾病9,医学主题和基因本体学的国际分类,仅举几例。在本文中,我们研究了选择各种K元素以汇总具有层次术语的输入数据集的问题,并在本体论结构中可视化摘要。我们提出了一种有效的贪婪算法,以解决(1-1/e)= 62%的approximation保证解决问题。尽管这种贪婪的解决方案大约可以达到质量保证的答案,但仍然不是最佳的。为了最佳解决该问题,我们进一步开发了一种动态编程算法,以使用称为OVDO的本体论术语来获得最佳的log-data可视化答案。理论上分析了OVDO的复杂性和正确性。此外,我们提出了一种有用的减少树木的优化技术,以去除零重量的无用节点,然后将树缩小成较小的节点,从而确保在许多现实世界应用中OVDO的效率加速。对现实世界数据集的广泛实验结果显示了我们提出的近似和精确算法在树数据摘要中的有效性和效率。
Data summarization that presents a small subset of a dataset to users has been widely applied in numerous applications and systems. Many datasets are coded with hierarchical terminologies, e.g., the international classification of Diseases-9, Medical Subject Heading, and Gene Ontology, to name a few. In this paper, we study the problem of selecting a diverse set of k elements to summarize an input dataset with hierarchical terminologies, and visualize the summary in an ontology structure. We propose an efficient greedy algorithm to solve the problem with (1-1/e) = 62% -approximation guarantee. Although this greedy solution achieves quality-guaranteed answers approximately but it is still not optimal. To tackle the problem optimally, we further develop a dynamic programming algorithm to obtain optimal answers for graph visualization of log-data using ontology terminologies called OVDO . The complexity and correctness of OVDO are theoretically analyzed. In addition, we propose a useful optimization technique of tree reduction to remove useless nodes with zero weights and shrink the tree into a smaller one, which ensures the efficiency acceleration of OVDO in many real-world applications. Extensive experimental results on real-world datasets show the effectiveness and efficiency of our proposed approximate and exact algorithms for tree data summarization.