INLTK：指示语言的自然语言工具包

论文标题

INLTK：指示语言的自然语言工具包

iNLTK: Natural Language Toolkit for Indic Languages

论文作者

Arora, Gaurav

论文摘要

我们提出了INLTK，这是一个由预训练的语言模型组成的开源NLP库，并提供了针对数据增强，文本相似性，句子嵌入，单词嵌入，单词嵌入，标记和文本生成13个指示语言的支持。通过使用INLTK的预训练模型进行公开可用数据集的文本分类，我们明显超过了先前报道的结果。在这些数据集上，我们还表明，通过使用INLTK的预训练模型和数据增强，我们可以通过使用不到10％的培训数据来实现以前最佳性能的95％以上。 INLTK已经被社区广泛使用，在Github上有40,000多个下载，600多个星星和100多个叉子。该库可在https://github.com/goru001/inltk上找到。

We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and data augmentation from iNLTK, we can achieve more than 95% of the previous best performance by using less than 10% of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.

下载PDF全文

下载文献需遵守相关版权规定

论文标题