We aim to extend quantum-level accuracy and insight to high throughput scales. To that end, ab-initio and semi-empirical methods will be combined with Machine Learning (ML) approaches generalizing the accuracy of these tools to scale. We have recently demonstrated fully scalable QM-accurate molecular dynamics of proteins in explicit water [1]. In the context of CACHE6 challenge, we aim to extend our previous work to predict ligand-protein binding affinities at QM accuracy.
The first stage of our pipeline involves the application of a clustering algorithm to an ultra-large library of compounds (PubChem). This algorithm groups compounds based on their structural and physicochemical characteristics. From the obtained clusters, a representative subset of candidates will be selected, ensuring a broad yet viable initial pool of potential candidates. Clusters are ranked based on two properties:
- Binding Score: The binding score will be computed using AutoDock Vina ‘rigid docking’ software.
- Molecular Properties: The biocompatibility and bioavailability of molecules would also be calculated at the same stage as the binding capacity. In particular, the solubility and lipophilicity will be computed using a molecular property prediction ML model based on Density Functional Theory (DFT) and Density Functional Tight Binding theory (DFTB) electronic structure methods [2,3].
All remaining molecules in the most promising clusters would then be evaluated. The Variational Auto-Encoder model [4] will be used to generate new chemical derivatives and effectively extend the number of molecules within selected clusters. We are aiming to have a prospective set of 1,000 to 10,000 candidates for the next step of our methodology. These candidates are re-docked using the ‘flexible’ method from Vina software, to select a set containing multiple initially promising poses for a wide set of compounds for further analysis. This initial wide set of compounds will be sorted by score and truncated to include only those with the best Vina scores (lead finding).
Next, for the retained leads, near ab-initio accuracy free-energy binding predictions will be obtained based on molecular dynamics simulations using the state-of-the-art SO3krates ML force field. This universal force field is trained on a dataset of 4 million (bio)molecular fragments [1]. Three steps of increasing cost are to be carried out at SO3krates level, with a further trimming of the lead set after each step:
- Energy minimization: Gradient descent energy minimization.
- Normal mode analysis: Estimation of a numerical Hessian and normal mode analysis to obtain a harmonic estimate of the free energy of binding, including vibrational components.
- Molecular dynamics: Binding energy profile using enhanced sampling molecular dynamics of the complex in explicit water.
This methodology aims to ensure the identification of the most promising compounds to be submitted for further investigation and potential therapeutic development within CACHE. By combining classical techniques with cutting-edge ML models, we can accelerate the drug discovery process and increase the likelihood of identifying effective candidates.