数据应该被扔掉吗？以不同的精度汇总间隔审查的数据集

论文标题

数据应该被扔掉吗？以不同的精度汇总间隔审查的数据集

Should data ever be thrown away? Pooling interval-censored data sets with different precision

论文作者

Tretiak, Krasymyr, Ferson, Scott

论文摘要

在许多工程应用程序和项目中，数据质量是重要的考虑因素。数据收集程序并不总是涉及仔细利用最精确的仪器和最严格的协议。结果，数据总是受到不精确的影响，有时会受到数据质量质量的急剧影响。已经提出了不精确的不同数学表示，包括一种经典方法来审查数据，当提出的误差模型正确时，该方法被认为是最佳的，并且基于部分识别的较弱的方法称为间隔统计信息，该方法较少，从而使假设更少。最大化统计结果的质量通常对于许多工程项目的成功至关重要，并且出现的一个自然问题是是否应该将不同质量的数据汇总在一起，或者我们应该仅包括精确的测量并忽略不精确的数据。有些人担心将精确和不精确的测量结合在一起会贬值汇总数据的整体质量。有些担心，排除较低精度的数据会增加其对结果的总体不确定性，因为较低的样本量意味着更多的抽样不确定性。本文探讨了这些问题，并描述了仿真结果，这些结果表明何时建议将相当精确的数据与相当不精确的数据结合在一起，通过使用不同的数学表示不精确的分析进行比较。当低质量数据集不超过一定程度的不确定性时，首选池数据集。但是，只要数据是随机的，如果其减少采样不确定性不会抵消其不精确对整体不确定性的影响，则拒绝低质量数据可能是合法的。

Data quality is an important consideration in many engineering applications and projects. Data collection procedures do not always involve careful utilization of the most precise instruments and strictest protocols. As a consequence, data are invariably affected by imprecision and sometimes sharply varying levels of quality of the data. Different mathematical representations of imprecision have been suggested, including a classical approach to censored data which is considered optimal when the proposed error model is correct, and a weaker approach called interval statistics based on partial identification that makes fewer assumptions. Maximizing the quality of statistical results is often crucial to the success of many engineering projects, and a natural question that arises is whether data of differing qualities should be pooled together or we should include only precise measurements and disregard imprecise data. Some worry that combining precise and imprecise measurements can depreciate the overall quality of the pooled data. Some fear that excluding data of lesser precision can increase their overall uncertainty about results because lower sample size implies more sampling uncertainty. This paper explores these concerns and describes simulation results that show when it is advisable to combine fairly precise data with rather imprecise data by comparing analyses using different mathematical representations of imprecision. Pooling data sets is preferred when the low-quality data set does not exceed a certain level of uncertainty. However, so long as the data are random, it may be legitimate to reject the low-quality data if its reduction of sampling uncertainty does not counterbalance the effect of its imprecision on the overall uncertainty.

下载PDF全文

下载文献需遵守相关版权规定

论文标题