使用数据科学方法识别语义重复的问题：Quora案例研究

论文标题

使用数据科学方法识别语义重复的问题：Quora案例研究

Identifying Semantically Duplicate Questions Using Data Science Approach: A Quora Case Study

论文作者

Ansari, Navedanjum, Sharma, Rajesh

论文摘要

根据问题的目的，确保Quora等社交媒体平台（例如Quora）等语义上相同的问题非常重要，可以确保将内容的质量和数量呈现给用户，并基于该问题的意图，从而丰富了整体用户体验。检测重复的问题是一个具有挑战性的问题，因为自然语言非常表现力，并且可以使用不同的单词，短语和句子结构来传达独特的意图。众所周知，机器学习和深度学习方法在识别类似文本时取得了比传统的自然语言处理技术取得的优越成果。在本文中，我们将Quora进行案例研究，我们探索并应用了不同的机器学习和深度学习技术，以识别Quora数据集的重复问题的任务。通过使用功能工程，功能重要性技术以及对七个选定的机器学习分类器进行实验，我们证明了我们的模型优于先前对此任务的研究。具有字符级期限频率和逆项频率的XGBoost模型是我们最好的机器学习模型，它也优于一些深度学习基线模型。我们应用了深度学习技术来模拟由手套嵌入，长期记忆，卷积，最大池，密集，批处理归一化，激活功能和模型合并组成的多个层的四个不同的深神经网络。我们的深度学习模型比机器学习模型获得了更好的准确性。四个拟议的体系结构中有三个优于以前的机器学习和深度学习研究工作的准确性，四分之二的模型优于以前关于Quora问题对数据集的深度学习研究的精度，而我们的最佳模型获得了85.82％的准确性，该精度接近了Quora的准确性。

Identifying semantically identical questions on, Question and Answering social media platforms like Quora is exceptionally significant to ensure that the quality and the quantity of content are presented to users, based on the intent of the question and thus enriching overall user experience. Detecting duplicate questions is a challenging problem because natural language is very expressive, and a unique intent can be conveyed using different words, phrases, and sentence structuring. Machine learning and deep learning methods are known to have accomplished superior results over traditional natural language processing techniques in identifying similar texts. In this paper, taking Quora for our case study, we explored and applied different machine learning and deep learning techniques on the task of identifying duplicate questions on Quora's dataset. By using feature engineering, feature importance techniques, and experimenting with seven selected machine learning classifiers, we demonstrated that our models outperformed previous studies on this task. Xgboost model with character level term frequency and inverse term frequency is our best machine learning model that has also outperformed a few of the Deep learning baseline models. We applied deep learning techniques to model four different deep neural networks of multiple layers consisting of Glove embeddings, Long Short Term Memory, Convolution, Max pooling, Dense, Batch Normalization, Activation functions, and model merge. Our deep learning models achieved better accuracy than machine learning models. Three out of four proposed architectures outperformed the accuracy from previous machine learning and deep learning research work, two out of four models outperformed accuracy from previous deep learning study on Quora's question pair dataset, and our best model achieved accuracy of 85.82% which is close to Quora state of the art accuracy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题