论文标题
如何测量您的应用程序:在在线受控实验中测量应用程序性能的几个陷阱和补救措施
How to Measure Your App: A Couple of Pitfalls and Remedies in Measuring App Performance in Online Controlled Experiments
论文作者
论文摘要
有效测量,理解和改善移动应用程序性能对于移动应用程序开发人员来说至关重要。在整个移动互联网景观中,公司运行在线控制的实验(A/B测试),并具有数千个性能指标,以了解应用程序绩效如何影响用户保留,并防止降低用户体验的服务或应用程序回归。为了捕获特定于性能指标的某些特征,例如巨大的观察量和分配中的偏度,行业标准的做法是在A/B测试中的所有性能事件中构建一个性能指标作为分数。根据我们在SNAP提供的数千种A/B测试的经验中,我们发现了这种行业标准的方式来计算性能指标的一些陷阱,这些陷阱可能导致性能指标无法解释的运动以及与用户参与度量指标的意外失误。在本文中,我们讨论了这一行业标准化实践的两个主要陷阱,以衡量移动应用程序的性能。一种是由移动设备和用户参与度的强大异质性引起的,另一种是由治疗后用户参与度变化引起的自选择性偏见。为了补救这两个陷阱,我们介绍了几种可扩展的方法,包括用户级性能计算,插补以及缺失度量值的匹配。我们已经在模拟数据和实际A/B测试上广泛评估了这些方法,并将其部署到Snap的内部实验平台中。
Effectively measuring, understanding, and improving mobile app performance is of paramount importance for mobile app developers. Across the mobile Internet landscape, companies run online controlled experiments (A/B tests) with thousands of performance metrics in order to understand how app performance causally impacts user retention and to guard against service or app regressions that degrade user experiences. To capture certain characteristics particular to performance metrics, such as enormous observation volume and high skewness in distribution, an industry-standard practice is to construct a performance metric as a quantile over all performance events in control or treatment buckets in A/B tests. In our experience with thousands of A/B tests provided by Snap, we have discovered some pitfalls in this industry-standard way of calculating performance metrics that can lead to unexplained movements in performance metrics and unexpected misalignment with user engagement metrics. In this paper, we discuss two major pitfalls in this industry-standard practice of measuring performance for mobile apps. One arises from strong heterogeneity in both mobile devices and user engagement, and the other arises from self-selection bias caused by post-treatment user engagement changes. To remedy these two pitfalls, we introduce several scalable methods including user-level performance metric calculation and imputation and matching for missing metric values. We have extensively evaluated these methods on both simulation data and real A/B tests, and have deployed them into Snap's in-house experimentation platform.