通过端到端的上下文感知聚类为开放响应调查提供见解

论文标题

通过端到端的上下文感知聚类为开放响应调查提供见解

Providing Insights for Open-Response Surveys via End-to-End Context-Aware Clustering

论文作者

Esmaeilzadeh, Soheil, Williams, Brian, Shamsi, Davood, Vikingstad, Onar

论文摘要

教师经常进行调查，以从预定义的一组学生那里收集数据，以了解感兴趣的主题。在分析开放式文本响应的调查时，非常耗时，劳动力密集并且难以手动将所有响应处理成一个有见地和全面的报告。在分析步骤中，传统上，教师必须阅读每种响应，并决定如何对它们进行分组，以提取有见地的信息。即使只能使用某些关键字进行分组响应，这种方法将受到限制，因为它不仅无法说明嵌入式上下文，而且无法检测到单词不可表达的多词或短语和语义。在这项工作中，我们提出了一种新颖的端到端上下文感知框架，该框架在开放响应调查数据中提取，聚集和缩写嵌入了语义模式。我们的框架依赖于预先训练的自然语言模型，以将文本数据编码为语义向量。然后，编码的向量将其聚集在最佳调整的组数组中，或者分为具有预先指定标题的一组组。在前一种情况下，将进一步分析集群，以提取一组代表性的关键字或摘要句子，这些句子用作簇的标签。在我们的框架中，对于指定的群集，我们最终提供了上下文感知的文字界面，这些文字曲线在每个组中演示了语义上突出的关键字。为了尊重用户隐私，我们成功地构建了适用于移动设备上实时分析的框架的设备实现，并在合成数据集上对其进行了测试。我们的框架通过从调查数据中提取最有见地的信息文章的过程来降低成本。

Teachers often conduct surveys in order to collect data from a predefined group of students to gain insights into topics of interest. When analyzing surveys with open-ended textual responses, it is extremely time-consuming, labor-intensive, and difficult to manually process all the responses into an insightful and comprehensive report. In the analysis step, traditionally, the teacher has to read each of the responses and decide on how to group them in order to extract insightful information. Even though it is possible to group the responses only using certain keywords, such an approach would be limited since it not only fails to account for embedded contexts but also cannot detect polysemous words or phrases and semantics that are not expressible in single words. In this work, we present a novel end-to-end context-aware framework that extracts, aggregates, and abbreviates embedded semantic patterns in open-response survey data. Our framework relies on a pre-trained natural language model in order to encode the textual data into semantic vectors. The encoded vectors then get clustered either into an optimally tuned number of groups or into a set of groups with pre-specified titles. In the former case, the clusters are then further analyzed to extract a representative set of keywords or summary sentences that serve as the labels of the clusters. In our framework, for the designated clusters, we finally provide context-aware wordclouds that demonstrate the semantically prominent keywords within each group. Honoring user privacy, we have successfully built the on-device implementation of our framework suitable for real-time analysis on mobile devices and have tested it on a synthetic dataset. Our framework reduces the costs at-scale by automating the process of extracting the most insightful information pieces from survey data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题