论文标题
从数据中学到的半经验性汉密尔顿人可以具有与密度功能理论相当的准确性
Semiempirical Hamiltonians learned from data can have accuracy comparable to Density Functional Theory
论文作者
论文摘要
量子化学为化学家提供了宝贵的信息,但是高计算成本限制了可以研究的系统的大小和类型。机器学习(ML)已成为一种在保持高准确性的同时大幅降低成本的一种手段。但是,ML模型经常通过使用组件(例如人工深度学习的人工神经网络)来牺牲可解释性,而这些神经网络充当了黑匣子。这些组件具有从大量数据中学习所需的灵活性,但很难深入了解预测的物理或化学基础。在这里,我们证明了半经验量子化学(SEQC)模型可以从大量数据中学习而无需牺牲可解释性。 SEQC模型是具有固定的原子轨道能和相互作用的密度基于功能的紧密结合(DFTB)的模型,这是原子间距离的一维函数。该模型以类似于训练深度学习模型的方式进行了培训以从头算数据。使用反映训练数据准确性的基准测试,我们表明该模型保持物理合理的功能形式,同时获得准确性,相对于具有完整基础集合外推(CCSD(T)*/CBS)的耦合群集能量,这与密度功能理论(DFT)相当。这表明训练有素的SEQC模型可以实现低计算成本和高精度,而无需牺牲可解释性。与深度学习模型所需的相比,使用物理动机模型形式还大大减少了训练模型所需的缩写数据的数量。
Quantum chemistry provides chemists with invaluable information, but the high computational cost limits the size and type of systems that can be studied. Machine learning (ML) has emerged as a means to dramatically lower cost while maintaining high accuracy. However, ML models often sacrifice interpretability by using components, such as the artificial neural networks of deep learning, that function as black boxes. These components impart the flexibility needed to learn from large volumes of data but make it difficult to gain insight into the physical or chemical basis for the predictions. Here, we demonstrate that semiempirical quantum chemical (SEQC) models can learn from large volumes of data without sacrificing interpretability. The SEQC model is that of Density Functional based Tight Binding (DFTB) with fixed atomic orbital energies and interactions that are one-dimensional functions of interatomic distance. This model is trained to ab initio data in a manner that is analogous to that used to train deep learning models. Using benchmarks that reflect the accuracy of the training data, we show that the resulting model maintains a physically reasonable functional form while achieving an accuracy, relative to coupled cluster energies with a complete basis set extrapolation (CCSD(T)*/CBS), that is comparable to that of density functional theory (DFT). This suggests that trained SEQC models can achieve low computational cost and high accuracy without sacrificing interpretability. Use of a physically-motivated model form also substantially reduces the amount of ab initio data needed to train the model compared to that required for deep learning models.