L3Cube-Hindbert和Devbert：基于Devanagari的印地语和马拉地语的预培训的BERT变压器模型

论文标题

L3Cube-Hindbert和Devbert：基于Devanagari的印地语和马拉地语的预培训的BERT变压器模型

L3Cube-HindBERT and DevBERT: Pre-Trained BERT Transformer models for Devanagari based Hindi and Marathi Languages

论文作者

Joshi, Raviraj

论文摘要

模型中心上当前可用的单语印度伯特模型的性能不如下游任务上的多语性模型。我们提出了L3Cube-Hindbert，这是印地语单语语料库预先培训的印地语BERT模型。此外，由于印度语言，印地语和马拉地语都共享Devanagari脚本，因此我们为两种语言训练单个模型。我们发布了Devbert，这是一种在马拉地语和印地语单语数据集中训练的Devanagari Bert模型。我们在下游印地语和马拉地文本分类上评估了这些模型，并指定了实体识别任务。基于Hindbert和Devbert的模型对多语言Muril，Indionbert和XLM-R显示出显着改善。基于这些观察结果，我们还发布了其他指示语言的单语言模型，泰卢固语，马拉雅拉姆语，泰米尔语，古吉拉特语，阿萨姆语，奥迪亚，孟加拉语和旁遮普语。这些模型在https://huggingface.co/l3cube-pune上共享。

The monolingual Hindi BERT models currently available on the model hub do not perform better than the multi-lingual models on downstream tasks. We present L3Cube-HindBERT, a Hindi BERT model pre-trained on Hindi monolingual corpus. Further, since Indic languages, Hindi and Marathi share the Devanagari script, we train a single model for both languages. We release DevBERT, a Devanagari BERT model trained on both Marathi and Hindi monolingual datasets. We evaluate these models on downstream Hindi and Marathi text classification and named entity recognition tasks. The HindBERT and DevBERT-based models show significant improvements over multi-lingual MuRIL, IndicBERT, and XLM-R. Based on these observations we also release monolingual BERT models for other Indic languages Kannada, Telugu, Malayalam, Tamil, Gujarati, Assamese, Odia, Bengali, and Punjabi. These models are shared at https://huggingface.co/l3cube-pune .

下载PDF全文

下载文献需遵守相关版权规定

论文标题