论文标题
优化机器学习的数据收集
Optimizing Data Collection for Machine Learning
论文作者
论文摘要
现代深度学习系统需要庞大的数据集来实现令人印象深刻的性能,但是关于要收集多少数据或哪种数据,几乎没有指导。过度收集数据会产生不必要的当前成本,而收集不足可能会产生未来的成本并延迟工作流程。我们提出了一个新的范式,将数据收集工作流程建模为正式的最佳数据收集问题,该问题使设计人员可以指定绩效目标,收集成本,时间范围以及未能达到目标的惩罚。此外,该公式将概括为需要多个数据源的任务,例如半监督学习中使用的标记和未标记的数据。为了解决我们的问题,我们开发了学习优越的汇总(LOC),从而最大程度地减少了预期的未来收集成本。最后,我们将我们的框架与通过从神经缩放定律中推断的估计数据需求的常规基准进行了数字比较。我们大大降低了无法在多个分类,细分和检测任务上满足所需绩效目标的风险,同时保持较低的总收集成本。
Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.