论文标题

INLTK:指示语言的自然语言工具包

iNLTK: Natural Language Toolkit for Indic Languages

论文作者

Arora, Gaurav

论文摘要

我们提出了INLTK,这是​​一个由预训练的语言模型组成的开源NLP库,并提供了针对数据增强,文本相似性,句子嵌入,单词嵌入,单词嵌入,标记和文本生成13个指示语言的支持。通过使用INLTK的预训练模型进行公开可用数据集的文本分类,我们明显超过了先前报道的结果。在这些数据集上,我们还表明,通过使用INLTK的预训练模型和数据增强,我们可以通过使用不到10%的培训数据来实现以前最佳性能的95%以上。 INLTK已经被社区广泛使用,在Github上有40,000多个下载,600多个星星和100多个叉子。该库可在https://github.com/goru001/inltk上找到。

We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and data augmentation from iNLTK, we can achieve more than 95% of the previous best performance by using less than 10% of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源