论文标题
L3Cube-Mahacorpus和Mahabert:Marathi单语语料库,Marathi Bert语言模型和资源
L3Cube-MahaCorpus and MahaBERT: Marathi Monolingual Corpus, Marathi BERT Language Models, and Resources
论文作者
论文摘要
我们向L3Cube-Mahacorpus提供了一个从不同互联网来源刮擦的马拉地语单语数据集。我们以2480万个判决和289万个令牌扩展现有的马拉地语单语语料库。我们进一步介绍了Mahabert,Mahaalbert和Maharoberta所有基于Bert的蒙面语言模型和Mahaft,MAHAFT,快速文字嵌入的嵌入式嵌入在全部Marathi语料库上,具有7.52亿个令牌。我们展示了这些资源对下游马拉松情绪分析,文本分类和指定实体识别(NER)任务的有效性。我们还发布了Mahagpt,这是一种在Marathi语料库中训练的生成Marathi GPT模型。马拉地语是印度的一种流行语言,但仍然缺乏这些资源。这项工作是向马拉地语建立开放资源的一步。数据和模型可在https://github.com/l3cube-pune/marathinlp上找到。
We present L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual corpus with 24.8M sentences and 289M tokens. We further present, MahaBERT, MahaAlBERT, and MahaRoBerta all BERT-based masked language models, and MahaFT, the fast text word embeddings both trained on full Marathi corpus with 752M tokens. We show the effectiveness of these resources on downstream Marathi sentiment analysis, text classification, and named entity recognition (NER) tasks. We also release MahaGPT, a generative Marathi GPT model trained on Marathi corpus. Marathi is a popular language in India but still lacks these resources. This work is a step forward in building open resources for the Marathi language. The data and models are available at https://github.com/l3cube-pune/MarathiNLP .