索引：印度尼西亚的成语和表达数据集用于披肩测试

论文标题

索引：印度尼西亚的成语和表达数据集用于披肩测试

InDEX: Indonesian Idiom and Expression Dataset for Cloze Test

论文作者

Qiu, Xinying, Shi, Guofeng

论文摘要

我们提出了索引，这是一个用于披肩测试的印尼语和表达数据集。该数据集包含10438个独特的句子，适用于289个习语和表达式，我们为其生成15种不同类型的干扰器，从而产生了大型的披风风格的语料库。许多固定测试理解理解的基线模型将BERT与随机初始化应用于学习嵌入表示形式。但是成语和固定表达方式不同，因此短语的字面意义可能与它们的上下文含义一致。因此，我们探索了将静态和上下文表示的不同方法为更强的基线模型结合。实验表明，将定义和随机初始化结合将更好地支持习惯的披肩测试模型性能，无论是独立还是与固定表达式混合。尽管对于没有特殊含义的固定表达式，但具有随机初始化的静态嵌入足以适应固定测试模型。

We propose InDEX, an Indonesian Idiom and Expression dataset for cloze test. The dataset contains 10438 unique sentences for 289 idioms and expressions for which we generate 15 different types of distractors, resulting in a large cloze-style corpus. Many baseline models of cloze test reading comprehension apply BERT with random initialization to learn embedding representation. But idioms and fixed expressions are different such that the literal meaning of the phrases may or may not be consistent with their contextual meaning. Therefore, we explore different ways to combine static and contextual representations for a stronger baseline model. Experimentations show that combining definition and random initialization will better support cloze test model performance for idioms whether independently or mixed with fixed expressions. While for fixed expressions with no special meaning, static embedding with random initialization is sufficient for cloze test model.

下载PDF全文

下载文献需遵守相关版权规定

论文标题