Challenge #6

Hit Identification
Method type (check all that applies)
Deep learning
Free energy perturbation
High-throughput docking
Machine learning
Physics-based
Hybrid of the above
High-throughput docking combined with state-of-the-art quantum-level machine learning
Description of your approach (min 200 and max 800 words)

We aim to extend quantum-level accuracy and insight to high throughput scales. To that end, ab-initio and semi-empirical methods will be combined with Machine Learning (ML) approaches generalizing the accuracy of these tools to scale. We have recently demonstrated fully scalable QM-accurate molecular dynamics of proteins in explicit water [1]. In the context of CACHE6 challenge, we aim to extend our previous work to predict ligand-protein binding affinities at QM accuracy.

The first stage of our pipeline involves the application of a clustering algorithm to an ultra-large library of compounds (PubChem). This algorithm groups compounds based on their structural and physicochemical characteristics. From the obtained clusters, a representative subset of candidates will be selected, ensuring a broad yet viable initial pool of potential candidates. Clusters are ranked based on two properties:

  • Binding Score: The binding score will be computed using AutoDock Vina ‘rigid docking’ software.
  • Molecular Properties: The biocompatibility and bioavailability of molecules would also be calculated at the same stage as the binding capacity. In particular, the solubility and lipophilicity will be computed using a molecular property prediction ML model based on Density Functional Theory (DFT) and Density Functional Tight Binding theory (DFTB) electronic structure methods [2,3].

All remaining molecules in the most promising clusters would then be evaluated. The Variational Auto-Encoder model [4] will be used to generate new chemical derivatives and effectively extend the number of molecules within selected clusters. We are aiming to have a prospective set of 1,000 to 10,000 candidates for the next step of our methodology. These candidates are re-docked using the ‘flexible’ method from Vina software, to select a set containing multiple initially promising poses for a wide set of compounds for further analysis. This initial wide set of compounds will be sorted by score and truncated to include only those with the best Vina scores (lead finding).

Next, for the retained leads, near ab-initio accuracy free-energy binding predictions will be obtained based on molecular dynamics simulations using the state-of-the-art SO3krates ML force field. This universal force field is trained on a dataset of 4 million (bio)molecular fragments [1]. Three steps of increasing cost are to be carried out at SO3krates level, with a further trimming of the lead set after each step:

  1. Energy minimization: Gradient descent energy minimization.
  2. Normal mode analysis: Estimation of a numerical Hessian and normal mode analysis to obtain a harmonic estimate of the free energy of binding, including vibrational components.
  3. Molecular dynamics: Binding energy profile using enhanced sampling molecular dynamics of the complex in explicit water. 

This methodology aims to ensure the identification of the most promising compounds to be submitted for further investigation and potential therapeutic development within CACHE. By combining classical techniques with cutting-edge ML models, we can accelerate the drug discovery process and increase the likelihood of identifying effective candidates. 

What makes your approach stand out from the community? (<100 words)

Our group has pioneered the use of machine learning in quantum chemistry and has extensive experience in chemical compound space exploration. We’ve excelled in the blind test of organic crystal structure predictions [5-8], which could mainly be attributed to a careful multi-physics ranking strategy. This proposal combines advanced ML-powered techniques developed in-house (property prediction, compound generation, physics-inspired ML force fields), which together with a selection of the open source third-party tools, would allow us to deliver similar accuracy to biologically relevant scales.

Method Name
QCACHE
Commercial software packages used
  • FHI-aims
Free software packages used
  • AutoDock Vina
  • MGLTools
  • Psi4
  • Atomic Simulation Environment (ASE)
  • VMD
  • RDkit
  • SO3KRATES
  • MDAnalysis
  • UCSF Chimera
  • DFTB+

 

Relevant publications of previous uses by your group of this software/method
  1. Unke, O. T., Stöhr, M., Ganscha, S., Unterthiner, T., Maennel, H., Kashubin, S., Ahlin, D., Gastegger, M., Medrano Sandonas, L., Berryman, J. T., Tkatchenko, A., & Müller, K.-R. (2024). Biomolecular dynamics with machine-learned quantum-mechanical force fields trained on diverse chemical fragments. Science Advances, 10(14). https://doi.org/10.1126/sciadv.adn4397 
  2. Medrano Sandonas, L., Hoja, J., Ernst, B. G., Vázquez-Mayagoitia, Á., DiStasio, R. A., & Tkatchenko, A. (2023). “Freedom of design” in chemical compound space: towards rational in silico design of molecules with targeted quantum-mechanical properties. Chemical Science, 14(39), 10702–10717. https://doi.org/10.1039/d3sc03598k 
  3.  Medrano Sandonas, L., van Rompaey, D., Fallani, A., Hahn, D., Perez-Benito, L., Verhoeven, J., Tresadern, G., Wegner, K., Ceulemans, H., & Tkatchenko, A. (2024). Aquamarine: Quantum-Mechanical Exploration of Conformers and Solvent Effects in Large Drug-like Molecules. ChemRxiv. https://doi.org/10.26434/chemrxiv-2024-685qb 
  4.  Fallani, A., Sandonas, L. M., & Tkatchenko, A. (2024). Inverse Mapping of Quantum Properties to Structures for Chemical Space of Small Organic Molecules. Nat. Comm.
  5. Reilly, A. M., Cooper, R. I., Adjiman, C. S., Bhattacharya, S., Boese, A. D., Brandenburg, J. G., Bygrave, P. J., Bylsma, R., Campbell, J. E., Car, R., Case, D. H., Chadha, R., Cole, J. C., Cosburn, K., Cuppen, H. M., Curtis, F., Day, G. M., DiStasio Jr, R. A., Dzyabchenko, A., … Groom, C. R. (2016). Report on the sixth blind test of organic crystal structure prediction methods. Acta Crystallographica Section B Structural Science, Crystal Engineering and Materials, 72(4), 439–459. https://doi.org/10.1107/S2052520616007447 
  6. Hoja, J., & Tkatchenko, A. (2018). First-principles stability ranking of molecular crystal polymorphs with the DFT+MBD approach. Faraday Discussions, 211, 253–274. https://doi.org/10.1039/C8FD00066B 
  7. Hoja, J., Ko, H.-Y., Neumann, M. A., Car, R., DiStasio, R. A., & Tkatchenko, A. (2019). Reliable and practical computational description of molecular crystal polymorphs. Science Advances, 5(1). https://doi.org/10.1126/sciadv.aau3338 
  8. Firaha, D., Liu, Y. M., van de Streek, J., Sasikumar, K., Dietrich, H., Helfferich, J., Aerts, L., Braun, D. E., Broo, A., DiPasquale, A. G., Lee, A. Y., le Meur, S., Nilsson Lill, S. O., Lunsmann, W. J., Mattei, A., Muglia, P., Putra, O. D., Raoui, M., Reutzel-Edens, S. M., … Neumann, M. A. (2023). Predicting crystal form stability under real-world conditions. Nature, 623(7986), 324–328. https://doi.org/10.1038/s41586-023-06587-3