论文标题

关于生物序列设计适应性景观模型探索的底漆

A primer on model-guided exploration of fitness landscapes for biological sequence design

论文作者

Sinai, Sam, Kelsic, Eric D

论文摘要

机器学习方法越来越多地用于应对生物学家面临的挑战。从这种交叉授粉中大大受益的一个领域是生物序列设计的问题,它具有巨大的治疗应用潜力。但是,这些领域之间的沟通仍然存在,这导致生物学家发现机器学习无法获得的进展,并阻碍了机器学习科学家从对生物工程中有影响力问题的贡献。序列设计可以看作是在离散的高维空间上的搜索过程,其中每个序列都与函数相关联。该序列到功能的图被称为“健身景观”。因此,设计具有特定函数的序列是在此空间内“发现”这样(通常很少)序列的问题。今天,由于在大量生物序列的合成和测试中取得了令人印象深刻的进展,我们可以建立具有良好插值能力的预测模型,从而实现模型训练和验证。但是,寻找使用我们喜欢使用这些模型的属性的有用序列通常仍然是一个挑战。特别是,在此引物中,我们强调说,实验设计的算法(我们称为“勘探策略”)是一个相关但与众不同的问题,与构建序列到功能的良好模型相关。我们回顾了当前文献的进步和见解 - 绝不是完整的治疗方法 - 同时强调了最佳模型引导探索的理想特征,并涵盖了我们自己经验中汲取的潜在陷阱。该底漆可以作为来自不同领域的研究人员的起点,这些研究人员对使用模型搜索序列空间的问题感兴趣,但也许不知道源于其领域以外的方法。

Machine learning methods are increasingly employed to address challenges faced by biologists. One area that will greatly benefit from this cross-pollination is the problem of biological sequence design, which has massive potential for therapeutic applications. However, significant inefficiencies remain in communication between these fields which result in biologists finding the progress in machine learning inaccessible, and hinder machine learning scientists from contributing to impactful problems in bioengineering. Sequence design can be seen as a search process on a discrete, high-dimensional space, where each sequence is associated with a function. This sequence-to-function map is known as a "Fitness Landscape". Designing a sequence with a particular function is hence a matter of "discovering" such a (often rare) sequence within this space. Today we can build predictive models with good interpolation ability due to impressive progress in the synthesis and testing of biological sequences in large numbers, which enables model training and validation. However, it often remains a challenge to find useful sequences with the properties that we like using these models. In particular, in this primer we highlight that algorithms for experimental design, what we call "exploration strategies", are a related, yet distinct problem from building good models of sequence-to-function maps. We review advances and insights from current literature -- by no means a complete treatment -- while highlighting desirable features of optimal model-guided exploration, and cover potential pitfalls drawn from our own experience. This primer can serve as a starting point for researchers from different domains that are interested in the problem of searching a sequence space with a model, but are perhaps unaware of approaches that originate outside their field.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源