用于概率预测的校准指标

论文标题

用于概率预测的校准指标

Metrics of calibration for probabilistic predictions

论文作者

Arrieta-Ibarra, Imanol, Gujral, Paman, Tannen, Jonathan, Tygert, Mark, Xu, Cherie

论文摘要

预测通常是概率；例如，明天可能是降水的预测，但机会只有30％。鉴于这种概率预测以及实际结果，“可靠性图”有助于检测和诊断预测和结果之间的统计学上显着差异 - 所谓的“错误校准”。规范可靠性图直方图预测的观察到的和期望值；用软内核密度估计代替硬直方图套筒是另一种常见实践。但是，哪些垃圾箱或内核的宽度最好？观察到的和期望值之间累积差异的图在很大程度上避免了这个问题，直接表现出错误的校准作为图形的斜线斜率。即使距离线的恒定偏移无关紧要，斜率也很容易被定量精确感知。无需箱或执行核密度估计。现有的错误校准标准指标每个总结了可靠性图作为单个标量统计量。累积图自然会导致标量指标，以使累积差异偏离零的图形；良好的校准对应于一个水平平坦的图，几乎没有偏离零。累积方法目前是非常规的，但提供了许多有利的统计特性，可以通过数学理论保证，并以严格的证明和说明性的数值示例支持。特别是，不可避免地基于嵌入或内核密度估计的指标必须权衡统计置信度，以使解决变化的能力作为预测概率的函数，反之亦然。扩大垃圾箱或内核的平均噪声，同时放弃一些解决力。缩小垃圾箱或内核会增强分辨力，同时平均消除那么多噪音。

Predictions are often probabilities; e.g., a prediction could be for precipitation tomorrow, but with only a 30% chance. Given such probabilistic predictions together with the actual outcomes, "reliability diagrams" help detect and diagnose statistically significant discrepancies -- so-called "miscalibration" -- between the predictions and the outcomes. The canonical reliability diagrams histogram the observed and expected values of the predictions; replacing the hard histogram binning with soft kernel density estimation is another common practice. But, which widths of bins or kernels are best? Plots of the cumulative differences between the observed and expected values largely avoid this question, by displaying miscalibration directly as the slopes of secant lines for the graphs. Slope is easy to perceive with quantitative precision, even when the constant offsets of the secant lines are irrelevant; there is no need to bin or perform kernel density estimation. The existing standard metrics of miscalibration each summarize a reliability diagram as a single scalar statistic. The cumulative plots naturally lead to scalar metrics for the deviation of the graph of cumulative differences away from zero; good calibration corresponds to a horizontal, flat graph which deviates little from zero. The cumulative approach is currently unconventional, yet offers many favorable statistical properties, guaranteed via mathematical theory backed by rigorous proofs and illustrative numerical examples. In particular, metrics based on binning or kernel density estimation unavoidably must trade-off statistical confidence for the ability to resolve variations as a function of the predicted probability or vice versa. Widening the bins or kernels averages away random noise while giving up some resolving power. Narrowing the bins or kernels enhances resolving power while not averaging away as much noise.

下载PDF全文

下载文献需遵守相关版权规定

论文标题