Skillspan：英语职位发布的硬技能提取

论文标题

Skillspan：英语职位发布的硬技能提取

SkillSpan: Hard and Soft Skill Extraction from English Job Postings

论文作者

Zhang, Mike, Jensen, Kristian Nørgaard, Sonniks, Sif Dam, Plank, Barbara

论文摘要

技能提取（SE）是一项重要且广泛研究的任务，可用于洞悉劳动力市场动态。但是，有数据集和注释指南。可用的数据集很少，并且在跨度级别上包含众包标签或预定义技能库存的标签。为了解决这一差距，我们介绍了SkillSpan，这是一个新颖的SE数据集，该数据集由14.5k句子和超过12.5k的注释跨度组成。我们发布了其各自的指南，这些准则是通过域专家注释的三种不同来源的不同资源。我们引入了BERT基线（Devlin等，2019）。为了改善这一基线，我们尝试了针对长期跨度优化的语言模型（Joshi等，2020; Beltagy等，2020），对职位发布域进行连续预培训（Han and Eisenstein，2019; Gururangan et al。，2020），以及多任务处理（Multi-Task Learning（Caruana，Caruana，1997年）。我们的结果表明，适应域的模型的表现明显优于其非适应性对应的模型，而单任务的表现优于多任务学习。

Skill Extraction (SE) is an important and widely-studied task useful to gain insights into labor market dynamics. However, there is a lacuna of datasets and annotation guidelines; available datasets are few and contain crowd-sourced labels on the span-level or labels from a predefined skill inventory. To address this gap, we introduce SKILLSPAN, a novel SE dataset consisting of 14.5K sentences and over 12.5K annotated spans. We release its respective guidelines created over three different sources annotated for hard and soft skills by domain experts. We introduce a BERT baseline (Devlin et al., 2019). To improve upon this baseline, we experiment with language models that are optimized for long spans (Joshi et al., 2020; Beltagy et al., 2020), continuous pre-training on the job posting domain (Han and Eisenstein, 2019; Gururangan et al., 2020), and multi-task learning (Caruana, 1997). Our results show that the domain-adapted models significantly outperform their non-adapted counterparts, and single-task outperforms multi-task learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题