Semeval-2022任务11：在利用数据增强和合奏以识别孟加拉国命名实体的团队 -

论文标题

Semeval-2022任务11：在利用数据增强和合奏以识别孟加拉国命名实体的团队 -

TEAM-Atreides at SemEval-2022 Task 11: On leveraging data augmentation and ensemble to recognize complex Named Entities in Bangla

论文作者

Tasnim, Nazia, Shihab, Md. Istiak Hossain, Sushmit, Asif Shahriyar, Bethard, Steven, Sadeque, Farig

论文摘要

许多领域，例如生物学和医疗保健领域，艺术作品和组织名称，都嵌套，重叠，不连续的实体提到，这些实体甚至在实践中甚至可能在语法上或语义上都是模棱两可的。传统的序列标记算法无法识别这些复杂提及，因为它们可能违反了建立序列标记方案的假设。在本文中，我们描述了我们对Semeval 2022任务11的贡献，该任务11在识别这种复杂的命名实体方面。我们利用了多个基于电气的模型的合奏，这些模型仅在孟加拉语言上仔细考虑，其基于Electra的模型在英语上预处理以在Track-11上实现竞争性能。除了提供系统描述外，我们还将介绍有关建筑决策，数据集增强和竞争后发现的实验结果。

Many areas, such as the biological and healthcare domain, artistic works, and organization names, have nested, overlapping, discontinuous entity mentions that may even be syntactically or semantically ambiguous in practice. Traditional sequence tagging algorithms are unable to recognize these complex mentions because they may violate the assumptions upon which sequence tagging schemes are founded. In this paper, we describe our contribution to SemEval 2022 Task 11 on identifying such complex Named Entities. We have leveraged the ensemble of multiple ELECTRA-based models that were exclusively pretrained on the Bangla language with the performance of ELECTRA-based models pretrained on English to achieve competitive performance on the Track-11. Besides providing a system description, we will also present the outcomes of our experiments on architectural decisions, dataset augmentations, and post-competition findings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题