Finer：XBRL标记的财务数字实体识别

论文标题

Finer：XBRL标记的财务数字实体识别

FiNER: Financial Numeric Entity Recognition for XBRL Tagging

论文作者

Loukas, Lefteris, Fergadiotis, Manos, Chalkidis, Ilias, Spyropoulou, Eirini, Malakasiotis, Prodromos, Androutsopoulos, Ion, Paliouras, Georgios

论文摘要

公开交易的公司必须提交具有广泛业务报告语言（XBRL）单词级标签的定期报告。手动标记报告是乏味和昂贵的。因此，我们将XBRL标记作为金融领域的新实体提取任务和版本Finer-139（带有金XBRL标签的110万个句子）的新实体提取任务。与典型的实体提取数据集不同，Finer-139使用139种实体类型的标签集。大多数注释的令牌是数字，每个令牌的正确标签主要取决于上下文而不是令牌本身。我们表明，数字表达式的子字片段损害了伯特的性能，从而使单词级的比尔斯姆斯表现更好。为了提高BERT的性能，我们提出了两种简单有效的解决方案，这些解决方案将数字表达式替换为伪tokens，以反映原始令牌形状和数字幅度。我们还尝试了金融领域的现有BERT模型Fin-Bert，并发布了我们自己的BERT（SEC-BERT），该公司在金融文件中进行了预先培训，该文件表现最好。通过数据和错误分析，我们最终确定了可能的局限性，以激发未来在XBRL标签上的工作。

Publicly traded companies are required to submit periodic reports with eXtensive Business Reporting Language (XBRL) word-level tags. Manually tagging the reports is tedious and costly. We, therefore, introduce XBRL tagging as a new entity extraction task for the financial domain and release FiNER-139, a dataset of 1.1M sentences with gold XBRL tags. Unlike typical entity extraction datasets, FiNER-139 uses a much larger label set of 139 entity types. Most annotated tokens are numeric, with the correct tag per token depending mostly on context, rather than the token itself. We show that subword fragmentation of numeric expressions harms BERT's performance, allowing word-level BILSTMs to perform better. To improve BERT's performance, we propose two simple and effective solutions that replace numeric expressions with pseudo-tokens reflecting original token shapes and numeric magnitudes. We also experiment with FIN-BERT, an existing BERT model for the financial domain, and release our own BERT (SEC-BERT), pre-trained on financial filings, which performs best. Through data and error analysis, we finally identify possible limitations to inspire future work on XBRL tagging.

下载PDF全文

下载文献需遵守相关版权规定

论文标题