论文标题

L3Cube-Mahasbert和Hindsbert:句子Bert模型和基准标准印度和马拉地语的Bert句子表示

L3Cube-MahaSBERT and HindSBERT: Sentence BERT Models and Benchmarking BERT Sentence Representations for Hindi and Marathi

论文作者

Joshi, Ananya, Kajale, Aditi, Gadre, Janhavi, Deode, Samruddhi, Joshi, Raviraj

论文摘要

Vanilla Bert模型的句子表示在句子相似性任务上不能很好地工作。在STS或NLI数据集上专门培训的句子 - Bert模型显示出可提供最先进的性能。但是,由于缺乏这些专业数据集,为低资源语言构建这些模型并不简单。这项工作着重于两种低资源的印度语言印地语和马拉地语。我们使用使用机器翻译准备的合成NLI和STS数据集训练这些语言的句子模型。我们表明,NLI预训练的策略随后进行STSB微调有效地为印度和马拉地语生成高性能句子相似模型。使用这种简单策略训练的香草伯特模型优于使用复杂培训策略训练的多语言Labse。这些模型在下游文本分类和相似性任务上进行了评估。我们在真实文本分类数据集上评估这些模型,以显示从合成数据培训获得的嵌入也可以推广到真实数据集,因此代表了低资源语言的有效训练策略。我们还提供了来自快速文本模型,多语言BERT模型(Mbert,Indingbert,XLM-Roberta,Muril)的句子嵌入的比较分析,多语言句子嵌入模型(Laser,Labse)和基于L3Cube-Mahabert和Hindbert的单语bert模型。我们分别发布了Marathi和Hindi的最先进的句子模型L3Cube-Mahasbert和Hindsbert。我们的工作也是建立低资源句子嵌入模型的指南。

Sentence representation from vanilla BERT models does not work well on sentence similarity tasks. Sentence-BERT models specifically trained on STS or NLI datasets are shown to provide state-of-the-art performance. However, building these models for low-resource languages is not straightforward due to the lack of these specialized datasets. This work focuses on two low-resource Indian languages, Hindi and Marathi. We train sentence-BERT models for these languages using synthetic NLI and STS datasets prepared using machine translation. We show that the strategy of NLI pre-training followed by STSb fine-tuning is effective in generating high-performance sentence-similarity models for Hindi and Marathi. The vanilla BERT models trained using this simple strategy outperform the multilingual LaBSE trained using a complex training strategy. These models are evaluated on downstream text classification and similarity tasks. We evaluate these models on real text classification datasets to show embeddings obtained from synthetic data training are generalizable to real datasets as well and thus represent an effective training strategy for low-resource languages. We also provide a comparative analysis of sentence embeddings from fast text models, multilingual BERT models (mBERT, IndicBERT, xlm-RoBERTa, MuRIL), multilingual sentence embedding models (LASER, LaBSE), and monolingual BERT models based on L3Cube-MahaBERT and HindBERT. We release L3Cube-MahaSBERT and HindSBERT, the state-of-the-art sentence-BERT models for Marathi and Hindi respectively. Our work also serves as a guide to building low-resource sentence embedding models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源