论文标题

超越准确性:经验数据的ROI驱动数据分析

Beyond Accuracy: ROI-driven Data Analytics of Empirical Data

论文作者

Deshpande, Gouri, Ruhe, Guenther

论文摘要

本视觉论文表明,在执行数据分析时考虑回报(ROI)至关重要。关于“需要多少分析”的决定?很难回答。投资回报率可以指导什么?,如何?,多少?给定问题的分析。方法:通过两项侧重于Mozilla Firefox项目中的需求依赖性提取的经验研究来验证拟议的概念框架。这两个案例研究是(i)评估针对天真的贝叶斯和随机森林机器学习者进行二进制依赖分类的微调BERT,以及(ii)对被动学习(随机抽样)的积极学习,需要依赖性提取。对于这两种情况,他们的分析投资(成本)均得到估算,并且预测DA的可实现收益,以确定调查的分支点。结果:对于第一项研究,只要有40%以上的培训数据可用,对随机森林的表现就优于随机森林。与基线相比,在第二个主动学习中,在较少的迭代率和更高的ROI(基于随机采样的RF分类器)中,实现了更高的F1精度。在这两项研究中,估计,投资努力可能会得到多少分析?结论:对经验数据的深度和广度的决策不应仅基于准确性措施做出。由于ROI驱动的数据分析提供了一个简单而有效的方向,以发现何时停止进一步研究的同时考虑各种类型的分析的成本和价值,因此它有助于避免过度分析经验数据。

This vision paper demonstrates that it is crucial to consider Return-on-Investment (ROI) when performing Data Analytics. Decisions on "How much analytics is needed"? are hard to answer. ROI could guide for decision support on the What?, How?, and How Much? analytics for a given problem. Method: The proposed conceptual framework is validated through two empirical studies that focus on requirements dependencies extraction in the Mozilla Firefox project. The two case studies are (i) Evaluation of fine-tuned BERT against Naive Bayes and Random Forest machine learners for binary dependency classification and (ii) Active Learning against passive Learning (random sampling) for REQUIRES dependency extraction. For both the cases, their analysis investment (cost) is estimated, and the achievable benefit from DA is predicted, to determine a break-even point of the investigation. Results: For the first study, fine-tuned BERT performed superior to the Random Forest, provided that more than 40% of training data is available. For the second, Active Learning achieved higher F1 accuracy within fewer iterations and higher ROI compared to Baseline (Random sampling based RF classifier). In both the studies, estimate on, How much analysis likely would pay off for the invested efforts?, was indicated by the break-even point. Conclusions: Decisions for the depth and breadth of DA of empirical data should not be made solely based on the accuracy measures. Since ROI-driven Data Analytics provides a simple yet effective direction to discover when to stop further investigation while considering the cost and value of the various types of analysis, it helps to avoid over-analyzing empirical data.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源