论文标题
关于合适性问题的分类探索性数据分析
Categorical exploratory data analysis on goodness-of-fit issues
论文作者
论文摘要
如果格言“所有模型都是错误的” -George Box在数据分析中仍然是正确的,尤其是在分析现实世界数据时,我们应该使用可见且可解释的数据驱动模式来注释这种智慧。这种注释可以严重散发出对有效性的宝贵光明,以及统计建模作为数据分析方法的局限性。为了避免将我们的真实数据持有潜在的无法实现甚至是不现实的理论结构,我们建议利用称为分类探索性数据分析(CEDA)的数据分析范式。我们从拟合优点的角度说明了两个现实世界中的数据集的优点。在两个数据集中,正态分布的铃铛形状似乎很合适。我们应用CEDA来通过几个重要的分布方面拟合或偏离模型形状的位置和偏差。我们还证明,CEDA提供了基于树的P值的版本,并将其与基于传统统计方法的P值进行比较。沿着我们的数据分析,我们投资计算工作来使图形显示以阐明将CEDA用作数据科学教育中数据分析的一种主要方式的优势。
If the aphorism "All models are wrong"- George Box, continues to be true in data analysis, particularly when analyzing real-world data, then we should annotate this wisdom with visible and explainable data-driven patterns. Such annotations can critically shed invaluable light on validity as well as limitations of statistical modeling as a data analysis approach. In an effort to avoid holding our real data to potentially unattainable or even unrealistic theoretical structures, we propose to utilize the data analysis paradigm called Categorical Exploratory Data Analysis (CEDA). We illustrate the merits of this proposal with two real-world data sets from the perspective of goodness-of-fit. In both data sets, the Normal distribution's bell shape seemingly fits rather well by first glance. We apply CEDA to bring out where and how each data fits or deviates from the model shape via several important distributional aspects. We also demonstrate that CEDA affords a version of tree-based p-value, and compare it with p-values based on traditional statistical approaches. Along our data analysis, we invest computational efforts in making graphic display to illuminate the advantages of using CEDA as one primary way of data analysis in Data Science education.