论文标题
生物群体:一种用于半监督最大基因组数据学习的网络应用程序
BioKlustering: a web app for semi-supervised learning of maximally imbalanced genomic data
论文作者
论文摘要
摘要:基因组序列的准确表型预测是生物学和医学研究中高度令人垂涎的任务。尽管机器学习是在各个领域中准确预测的关键,但生物学数据的复杂性可能使许多方法不可应用。我们介绍了BioKlustering,这是一种用户友好的开源和公开可用的Web应用程序,用于无监督和半监督的学习,专门针对所有类别的顺序对齐和/或实验表型,而所有类都无法进行。在其主要优势中,生物群体1)允许在仅观察到一个类别的情况下的最大不平衡设置,包括观察到的情况,目前在大多数半监督的方法中被禁止,2)将不一致的序列作为输入,因此将其用于众多的序列(例如不可能的),或者不可用的序列(例如不可能),或者不可能进行诸如Virus and Bacteria和Bacteria的专业,并且是3),而不是3岁)。与小样本量合作良好。 可用性和实现:BioKlustering(https://bioklustering.wid.wisc.edu)是一个由基于Python的框架Django实施的免费网络应用程序,并支持所有主要的浏览器。 Web应用程序不需要任何安装,并且可以公开使用和开源(https://github.com/solislemuslab/bioklustering)。
Summary: Accurate phenotype prediction from genomic sequences is a highly coveted task in biological and medical research. While machine-learning holds the key to accurate prediction in a variety of fields, the complexity of biological data can render many methodologies inapplicable. We introduce BioKlustering, a user-friendly open-source and publicly available web app for unsupervised and semi-supervised learning specialized for cases when sequence alignment and/or experimental phenotyping of all classes are not possible. Among its main advantages, BioKlustering 1) allows for maximally imbalanced settings of partially observed labels including cases when only one class is observed, which is currently prohibited in most semi-supervised methods, 2) takes unaligned sequences as input and thus, allows learning for widely diverse sequences (impossible to align) such as virus and bacteria, 3) is easy to use for anyone with little or no programming expertise, and 4) works well with small sample sizes. Availability and Implementation: BioKlustering (https://bioklustering.wid.wisc.edu) is a freely available web app implemented with Django, a Python-based framework, with all major browsers supported. The web app does not need any installation, and it is publicly available and open-source (https://github.com/solislemuslab/bioklustering).