论文标题

分析赠款申请的管道

A Pipeline for Analysing Grant Applications

论文作者

Pan, Shuaiqun, Méndez, Sergio J. Rodríguez, Taylor, Kerry

论文摘要

数据挖掘技术可以将大量的非结构化数据转化为定量数据,这些数据迅速揭示了原始数据背后的见解,趋势和模式。在本文中,采用数据挖掘模型来分析提交给澳大利亚政府研究资助机构的2019年赠款申请,以调查赠款计划是否按预期成功识别了创新的项目建议。赠款应用程序是经过同行评审的研究建议,其中包括审阅者分配的特定``创新和创造力''(IC)分数。除了预测每个研究建议的IC分数外,我们对了解创新建议的词汇量特别感兴趣。为了解决此问题,研究和探索了各种数据挖掘模型和特征编码算法。结果,我们提出了一个具有最佳性能的模型,一个随机的森林(RF)分类器,而不是编码的文档,该文档的特征表示存在或不存在um rigrams。在特定的情况下,统一项由修改的项频率 - 反向文档频率(TF-IDF)算法编码,该算法仅实现TF-IDF的IDF部分。除了提出的模型外,本文还提出了一条严格的实验管道,用于分析赠款应用,实验结果证明了其可行性。

Data mining techniques can transform massive amounts of unstructured data into quantitative data that quickly reveal insights, trends, and patterns behind the original data. In this paper, a data mining model is applied to analyse the 2019 grant applications submitted to an Australian Government research funding agency to investigate whether grant schemes successfully identifies innovative project proposals, as intended. The grant applications are peer-reviewed research proposals that include specific ``innovation and creativity'' (IC) scores assigned by reviewers. In addition to predicting the IC score for each research proposal, we are particularly interested in understanding the vocabulary of innovative proposals. In order to solve this problem, various data mining models and feature encoding algorithms are studied and explored. As a result, we propose a model with the best performance, a Random Forest (RF) classifier over documents encoded with features denoting the presence or absence of unigrams. In specific, the unigram terms are encoded by a modified Term Frequency - Inverse Document Frequency (TF-IDF) algorithm, which only implements the IDF part of TF-IDF. Besides the proposed model, this paper also presents a rigorous experimental pipeline for analysing grant applications, and the experimental results prove its feasibility.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源