论文标题

TADOC:直接有关压缩的文本分析

TADOC: Text Analytics Directly on Compression

论文作者

Zhang, Feng, Zhai, Jidong, Shen, Xipeng, Wang, Dalin, Chen, Zheng, Mutlu, Onur, Chen, Wenguang, Du, Xiaoyong

论文摘要

本文直接对压缩(TADOC)的文本分析进行了全面描述,该分析可以直接对压缩文本数据进行直接文档分析。本文解释了塔多克的概念及其有效实现的挑战。此外,提出了一系列有效解决这些挑战的准则和技术解决方案,包括采用层次压缩方法以及一组新型算法和数据结构设计。各种复杂性的六个数据分析任务的实验表明,TADOC可以节省90.8%的存储空间和87.9%的内存使用情况,同时将数据处理时间减半。

This article provides a comprehensive description of Text Analytics Directly on Compression (TADOC), which enables direct document analytics on compressed textual data. The article explains the concept of TADOC and the challenges to its effective realizations. Additionally, a series of guidelines and technical solutions that effectively address those challenges, including the adoption of a hierarchical compression method and a set of novel algorithms and data structure designs, are presented. Experiments on six data analytics tasks of various complexities show that TADOC can save 90.8% storage space and 87.9% memory usage, while halving data processing times.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源