论文标题
对现代方法的反思:流行病学应用统计学习的良好实践
Reflection on modern methods: Good practices for applied statistical learning in epidemiology
论文作者
论文摘要
统计学习(SL)包括从复杂数据中提取知识的方法。在公共卫生研究和流行病学中,越来越多地实施了广义线性模型的SL方法,因为它们可以在传统统计方法失败的情况下在复杂或高维数据的情况下更好地执行。但是,这些新颖的方法通常包括随机抽样,可能会导致结果变异性。数据科学的最佳实践可以帮助确保鲁棒性。作为案例研究,我们包括了四个SL模型,这些模型以前已应用于分析环境混合物与健康结果之间的关系。我们在100个初始化的值中运行了每个模型的随机数生成或“种子”,并评估了由此产生的估计和推断中的变异性。所有方法在结果中均表现出一些依赖种子的变异性。方法之间的变异程度不同,关注感兴趣的程度。依赖于随机种子的任何SL方法都会表现出一定程度的种子敏感性。我们建议研究人员在实施这些方法以增强结果的可解释性和鲁棒性时重复各种种子作为灵敏度分析。
Statistical learning (SL) includes methods that extract knowledge from complex data. SL methods beyond generalized linear models are being increasingly implemented in public health research and epidemiology because they can perform better in instances with complex or high-dimensional data---settings when traditional statistical methods fail. These novel methods, however, often include random sampling which may induce variability in results. Best practices in data science can help to ensure robustness. As a case study, we included four SL models that have been applied previously to analyze the relationship between environmental mixtures and health outcomes. We ran each model across 100 initializing values for random number generation, or "seeds," and assessed variability in resulting estimation and inference. All methods exhibited some seed-dependent variability in results. The degree of variability differed across methods and exposure of interest. Any SL method reliant on a random seed will exhibit some degree of seed sensitivity. We recommend that researchers repeat their analysis with various seeds as a sensitivity analysis when implementing these methods to enhance interpretability and robustness of results.