论文标题
可解释的单元类型注释的多解决分类回归
Multiresolution categorical regression for interpretable cell type annotation
论文作者
论文摘要
在许多分类响应回归应用中,响应类别都接受了多分辨率结构。也就是说,响应类别的子集自然可以合并为更粗的响应类别。在此类应用中,从业人员通常对估计预测因子影响响应类别概率的分辨率感兴趣。在本文中,我们提出了一种将多项式逻辑回归模型拟合到高维度的方法,该模型以统一和数据驱动的方式解决了此问题。特别是,我们的方法允许从业人员确定哪些预测因子区分粗体类别,而不是细类别,哪些预测因素可以区分优秀类别,而哪些预测因素是无关紧要的。对于模型拟合,我们提出了一种可扩展的算法,当通过重叠或非重叠类别集定义粗体类别时,可以应用该算法。我们方法的统计属性表明,它可以以现有估计器无法使用的方式来利用这种多分辨率结构。我们使用我们的方法将细胞类型概率与细胞基因表达曲线的函数(即细胞类型注释)建模。我们拟合的模型提供了新型的生物学见解,这些见解可能对未来的自动化和手动细胞类型注释方法有用。
In many categorical response regression applications, the response categories admit a multiresolution structure. That is, subsets of the response categories may naturally be combined into coarser response categories. In such applications, practitioners are often interested in estimating the resolution at which a predictor affects the response category probabilities. In this article, we propose a method for fitting the multinomial logistic regression model in high dimensions that addresses this problem in a unified and data-driven way. In particular, our method allows practitioners to identify which predictors distinguish between coarse categories but not fine categories, which predictors distinguish between fine categories, and which predictors are irrelevant. For model fitting, we propose a scalable algorithm that can be applied when the coarse categories are defined by either overlapping or nonoverlapping sets of fine categories. Statistical properties of our method reveal that it can take advantage of this multiresolution structure in a way existing estimators cannot. We use our method to model cell type probabilities as a function of a cell's gene expression profile (i.e., cell type annotation). Our fitted model provides novel biological insights which may be useful for future automated and manual cell type annotation methodology.