孟加拉语中的窃检测：基于文本相似性的方法

论文标题

孟加拉语中的窃检测：基于文本相似性的方法

Plagiarism Detection in the Bengali Language: A Text Similarity-Based Approach

论文作者

Ghosh, Satyajit, Ghosh, Aniruddha, Ghosh, Bittaswer, Roy, Abhishek

论文摘要

pla窃意味着从事他人的工作，而不是为此归功于他们。窃是学术界和研究人员中最严重的问题之一。即使有多种工具可以在文档中检测窃，但其中大多数是特定于域的，旨在在英语文本中起作用，但pla窃不仅限于单一语言。孟加拉语是孟加拉国最广泛的语言，是印度第二大口语的语言，有3亿英语的母语和3700万本语言者。 pla窃检测需要大量的语料库进行比较。孟加拉文学的历史为1300年。因此，大多数孟加拉文学书籍尚未正确数字化。由于我们目的没有这样的语料库，因此我们从印度国家数字图书馆收集了孟加拉文学书籍，并从中提取了全面的方法论并构建了我们的语料库。我们的实验结果发现，使用OCR，文本提取的平均精度在72.10％-79.89％之间。 Levenshtein距离算法用于确定窃。我们已经构建了一个用于最终用户的Web应用程序，并成功地测试了孟加拉文本中的窃检测。将来，我们旨在构建一个具有更多书籍的语料库，以进行更准确的检测。

Plagiarism means taking another person's work and not giving any credit to them for it. Plagiarism is one of the most serious problems in academia and among researchers. Even though there are multiple tools available to detect plagiarism in a document but most of them are domain-specific and designed to work in English texts, but plagiarism is not limited to a single language only. Bengali is the most widely spoken language of Bangladesh and the second most spoken language in India with 300 million native speakers and 37 million second-language speakers. Plagiarism detection requires a large corpus for comparison. Bengali Literature has a history of 1300 years. Hence most Bengali Literature books are not yet digitalized properly. As there was no such corpus present for our purpose so we have collected Bengali Literature books from the National Digital Library of India and with a comprehensive methodology extracted texts from it and constructed our corpus. Our experimental results find out average accuracy between 72.10 % - 79.89 % in text extraction using OCR. Levenshtein Distance algorithm is used for determining Plagiarism. We have built a web application for end-user and successfully tested it for Plagiarism detection in Bengali texts. In future, we aim to construct a corpus with more books for more accurate detection.

下载PDF全文

下载文献需遵守相关版权规定

论文标题