论文标题
在基因组研究的预测分析中处理高度相关的基因
Handling highly correlated genes in prediction analysis of genomic studies
论文作者
论文摘要
背景:选择特征基因来预测表型是分析基因组学数据的典型任务之一。尽管为预测开发了许多通用算法,但在预测模型中处理高度相关的基因仍未得到很好的解决。基因之间的高度相关性引入了技术问题,例如多共线性问题,导致了不可靠的预测模型。此外,当因果基因(其变体对表型具有实际的生物学作用)与其他基因高度相关时,大多数算法以纯粹的数据驱动方式从相关组中选择特征基因。由于基因之间的相关结构在条件发生变化时可能会发生重大变化,因此基于未正确选择的特征基因的预测模型是不可靠的。因此,我们旨在将因果生物学信号保持在预测过程中,并建立更健壮的预测模型。 方法:我们提出了一种分组算法,该算法将高度相关的基因视为一个组,并使用其共同模式来代表该组在特征选择中的生物学信号。我们的新型分组算法可以集成到现有的预测算法中,以增强其预测性能。我们提出的分组方法具有两个优点。首先,使用基因组的共同模式使预测在条件变化下更加稳健和可靠。其次,它报告了整个相关的基因组,作为发现预测任务的生物标志物,使研究人员能够进行后续研究以识别已确定组中的因果基因。 结果:使用模拟细胞表型使用真实的基准SCRNA-SEQ数据集,我们证明了我们的新方法在(1)预测细胞表型和(2)特征基因选择中都显着优于标准模型。
Background: Selecting feature genes to predict phenotypes is one of the typical tasks in analyzing genomics data. Though many general-purpose algorithms were developed for prediction, dealing with highly correlated genes in the prediction model is still not well addressed. High correlation among genes introduces technical problems, such as multi-collinearity issues, leading to unreliable prediction models. Furthermore, when a causal gene (whose variants have an actual biological effect on a phenotype) is highly correlated with other genes, most algorithms select the feature gene from the correlated group in a purely data-driven manner. Since the correlation structure among genes could change substantially when condition changes, the prediction model based on not correctly selected feature genes is unreliable. Therefore, we aim to keep the causal biological signal in the prediction process and build a more robust prediction model. Method: We propose a grouping algorithm, which treats highly correlated genes as a group and uses their common pattern to represent the group's biological signal in feature selection. Our novel grouping algorithm can be integrated into existing prediction algorithms to enhance their prediction performance. Our proposed grouping method has two advantages. First, using the gene group's common patterns makes the prediction more robust and reliable under condition change. Second, it reports whole correlated gene groups as discovered biomarkers for prediction tasks, allowing researchers to conduct follow-up studies to identify causal genes within the identified groups. Result: Using real benchmark scRNA-seq datasets with simulated cell phenotypes, we demonstrate our novel method significantly outperforms standard models in both (1) prediction of cell phenotypes and (2) feature gene selection.
