论文标题
无监督的机器学习框架,用于区分Covid-19期间的主要关注变体
Unsupervised machine learning framework for discriminating major variants of concern during COVID-19
论文作者
论文摘要
由于病毒的高突变速率,Covid-19大流行迅速发展。该病毒的某些变体(例如三角洲和Omicron)出现了病毒特性的改变,导致严重的传播和死亡率。这些变体在全球范围内为旅行,生产力和世界经济带来了重大影响。无监督的机器学习方法具有压缩,表征和可视化未标记数据的能力。本文提出了一个框架,该框架利用无监督的机器学习方法来区分和可视化基于其基因组序列的主要COVID-19变体之间的关联。这些方法包括选定的维度降低和聚类技术的组合。该框架通过对数据进行K-MER分析并进一步可视化来处理RNA序列,并使用选定的维度降低方法进行比较,包括主成分分析(PCA),T分配的随机邻居嵌入(T-SNE)和统一的歧管近似投影(UMAP)。我们的框架还采用集聚性层次聚类来可视化所选变体(Delta和Omicron)使用树状图的主要关注变体和乡村突变差异之间的突变差异。我们还通过树状图为选定的变体提供了国家的突变差异。我们发现,所提出的框架可以有效地区分主要变体,并有可能在未来识别新兴变体。
Due to the high mutation rate of the virus, the COVID-19 pandemic evolved rapidly. Certain variants of the virus, such as Delta and Omicron, emerged with altered viral properties leading to severe transmission and death rates. These variants burdened the medical systems worldwide with a major impact to travel, productivity, and the world economy. Unsupervised machine learning methods have the ability to compress, characterize, and visualize unlabelled data. This paper presents a framework that utilizes unsupervised machine learning methods to discriminate and visualize the associations between major COVID-19 variants based on their genome sequences. These methods comprise a combination of selected dimensionality reduction and clustering techniques. The framework processes the RNA sequences by performing a k-mer analysis on the data and further visualises and compares the results using selected dimensionality reduction methods that include principal component analysis (PCA), t-distributed stochastic neighbour embedding (t-SNE), and uniform manifold approximation projection (UMAP). Our framework also employs agglomerative hierarchical clustering to visualize the mutational differences among major variants of concern and country-wise mutational differences for selected variants (Delta and Omicron) using dendrograms. We also provide country-wise mutational differences for selected variants via dendrograms. We find that the proposed framework can effectively distinguish between the major variants and has the potential to identify emerging variants in the future.