论文标题

分子产生的数据有效图语法学习

Data-Efficient Graph Grammar Learning for Molecular Generation

论文作者

Guo, Minghao, Thost, Veronika, Li, Beichen, Das, Payel, Chen, Jie, Matusik, Wojciech

论文摘要

分子产生的问题最近受到了重大关注。现有方法通常基于深层神经网络,需要在具有数万个样本的大型数据集上进行培训。但是,实际上,由于劳动密集型实验和数据收集,通常有限的类别化学数据集的大小通常受到限制(例如,数十个样品)。这给深度学习生成模型带来了一个巨大的挑战,可以全面描述分子设计空间。另一个主要挑战是仅生成物理合成的分子。对于基于神经网络的生成模型而言,这是一项非平凡的任务,因为只能从有限的训练数据中提取和概括相关的化学知识。在这项工作中,我们提出了一个数据效率的生成模型,该模型可以从数据集中学到的数量级比通用基准小的数量级。该方法的核心是一种可学习的图形语法,它从一系列生产规则中生成分子。没有任何人力援助,这些生产规则将自动从培训数据中构建。此外,可以通过进一步的语法优化将其他化学知识纳入模型中。我们学到的图形语法产生了最新的结果,可为三个单体数据集生成高质量分子,这些分子仅包含$ {\ sim} 20 $样本。我们的方法在一项充满挑战的聚合物生成任务中,只有$ 117 $培训样本,并且使用$ 81 $ k的数据点具有竞争力。代码可在https://github.com/gmh14/data_feffidiced_grammar上找到。

The problem of molecular generation has received significant attention recently. Existing methods are typically based on deep neural networks and require training on large datasets with tens of thousands of samples. In practice, however, the size of class-specific chemical datasets is usually limited (e.g., dozens of samples) due to labor-intensive experimentation and data collection. This presents a considerable challenge for the deep learning generative models to comprehensively describe the molecular design space. Another major challenge is to generate only physically synthesizable molecules. This is a non-trivial task for neural network-based generative models since the relevant chemical knowledge can only be extracted and generalized from the limited training data. In this work, we propose a data-efficient generative model that can be learned from datasets with orders of magnitude smaller sizes than common benchmarks. At the heart of this method is a learnable graph grammar that generates molecules from a sequence of production rules. Without any human assistance, these production rules are automatically constructed from training data. Furthermore, additional chemical knowledge can be incorporated in the model by further grammar optimization. Our learned graph grammar yields state-of-the-art results on generating high-quality molecules for three monomer datasets that contain only ${\sim}20$ samples each. Our approach also achieves remarkable performance in a challenging polymer generation task with only $117$ training samples and is competitive against existing methods using $81$k data points. Code is available at https://github.com/gmh14/data_efficient_grammar.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源