论文标题
生命的边缘:病毒RNA中核苷酸序列的分布
On the verge of life: Distribution of nucleotide sequences in viral RNAs
论文作者
论文摘要
该研究的目的是使用从病毒RNA中核苷酸序列的分布获得的参数分析病毒。寻求输入数据同质性,我们仅分析单链RNA病毒。两种方法用于获得核苷酸序列。在第一个中,考虑了相等长度(四个核苷酸)的块。在第二种方法中,整个RNA基因组被腺嘌呤或最常见的核苷酸分为“空间”。在两种情况下都研究了排名 - 频率分布。在第一种方法中,Pólya和负高几何分布可产生最佳拟合度。对于第二种方法中获得的分布,我们计算了一组参数,包括熵,平均序列长度及其分散。计算出的参数成为病毒分类的基础。我们观察到,在各对参数上跨越平面上的病毒接近性对应于相关物种。在某些情况下,观察到无关物种的这种接近性,因此呼吁扩大分类中使用的一组参数。我们还观察到,在人类冠状病毒(MERS,SARS-COV和SARS-COV和SARS-COV-2的不同核苷酸与其他四个冠状病毒的相同核苷酸)的情况下,第二种方法中获得的第四次最常见的核苷酸序列是不同的。我们预计我们的发现将作为由RNA病毒引起的严重性和传染性引起的疾病分类的补充工具。
The aim of the study is to analyze viruses using parameters obtained from distributions of nucleotide sequences in the viral RNA. Seeking for the input data homogeneity, we analyze single-stranded RNA viruses only. Two approaches are used to obtain the nucleotide sequences; In the first one, chunks of equal length (four nucleotides) are considered. In the second approach, the whole RNA genome is divided into parts by adenine or the most frequent nucleotide as a "space". Rank--frequency distributions are studied in both cases. Within the first approach, the Pólya and the negative hypergeometric distribution yield the best fit. For the distributions obtained within the second approach, we have calculated a set of parameters, including entropy, mean sequence length, and its dispersion. The calculated parameters became the basis for the classification of viruses. We observed that proximity of viruses on planes spanned on various pairs of parameters corresponds to related species. In certain cases, such a proximity is observed for unrelated species as well calling thus for the expansion of the set of parameters used in the classification. We also observed that the fourth most frequent nucleotide sequences obtained within the second approach are of different nature in case of human coronaviruses (different nucleotides for MERS, SARS-CoV, and SARS-CoV-2 versus identical nucleotides for four other coronaviruses). We expect that our findings will be useful as a supplementary tool in the classification of diseases caused by RNA viruses with respect to severity and contagiousness.