Our strategy for finding hit compounds is based on de novo design of compounds using generative AI we developed (Logiston). We made use of both conventional binding structure prediction models and deep learning-based binding structure generation models including AutoDock vina [1], Diffdock [2], and FABind [3]. However, over-confidence of target-ligand binding of those methods is well-known, which causes compounds unlikely to bind are largely included after screening process. It affects the precision of overall screening process tremendously. To address the issue, we developed a graph neural network named Fabione. Fabione rescored the binding affinity in terms of negative log concentration from the generated binding structures. It was trained to reduce false artifacts generated by binding structure prediction models. As a result, we can retrieve not only highly accurate binding structures but also precise binding affinity between target-ligand in nM scale.
For effective discovery of novel target-binding compounds, our method integrates three essential processes into a single deep learning network. We named this overall network as MIN-T (Molecules Inventing Network for Target binding).
1. Logiston: Reaction-based generation of novel compounds
This process generates novel compounds based on building blocks and reaction templates manually curated by medicinal chemists.
A maximum of 5 steps can be used to build a candidate compound and each reaction step is evaluated by a deep learning model. There are two aspects that make Logiston unique compared to other reaction-based compound generation methods. i) Logiston is trained by reinforcement learning using a reward that represents the practical synthesis accessibility - the RXN2VEC score. ii) Logiston can be trained using multiple rewards, resulting in the generation of compounds that meet multiple criteria simultaneously. For example, compounds may achieve both high binding affinity to the target and synthesis accessibility.
We have developed a practical score for synthesis accessibility, the RXN2VEC score, using over 1M known chemical reactions represented in the SMART format. Vector representations of the SMART strings of the large-scale reactions were trained using deep neural networks. As a result, the generated compounds are highly synthesizable using feasible building blocks and chemical reactions.The success rate of synthesizable compounds generated by Logiston is more than 95% according to the evaluation of medicinal chemists.
We usually start by generating 1M compounds with Logiston, but iteratively generate more compounds after each simulation, so that the compounds can keep getting better until they reach the target binding affinity (about 10 nM). We can achieve the target more efficiently (at least 100-fold) compared to large-scale virtual screening, which requires at least 1B compounds, as we confirmed using known affinity data.
2. Fabione: Binding poses and activity prediction of candidate compounds
After predicting or generating binding structures using tools such as vina, diffdock and FABind, Fabione was used to evaluate each structure to remove untrustworthy structures and predict the accurate binding affinity. Although Fabione can be applied to any structure prediction tool, Fabione can be directly connected to the FABind network as they share some neural network architectures. As FABind itself is extremely fast, we generated structures 100 times for high scoring compounds starting from different initial poses. This allows us to quantitatively evaluate the stability of the binding poses, which allows us to remove unstable artifacts. The binding structures from other tools were also compared to further evaluate structure confidence.
The former network of Fabione was trained using more than 200K structures from the Protein Data Bank, together with augmented false artifacts (1M). The binding structures from other tools were also compared to further evaluate structure confidence. The former network of Fabione was trained using more than 200K structures from the Protein Data Bank, together with augmented false artifacts (1M). The model can predict accurate (RMSD < 2 A) binding structures between targets and ligands even without reference structures. The network is also linked to different deep neural networks that can predict compound activity at the nM level. The following Fabione network was trained using more than 230K real-world activity data from the ChEMBL database.
3. Active site optimization using structure clustering and 3Cpro
To retrieve highly active candidate compounds, the optimization of compound binding site is crucial.
The binding site must not only be feasible for small molecules but also functionally active.
Precise rescoring of binding sites using physicochemical features ensures feasibility of the pockets. The following step isto select crucial binding residues based on the binding poses and predicted affinity of 1M compounds. Then, feasible scaffolds of binding compounds were extracted using substructure clustering to subsequently generate better binding compounds based on the scaffolds afterwards.
In addition, active sites are optimized based on the functionally active region predicted by variant pathogenicity of protein residues. We developed a virtual scoring system for alanine scanning of each binding residues named 3Cpro. 3Cpro was based on our previously developed method 3Cnet [4] but further optimized for active site optimization. 3Cpro showed better performance than Alphamissense [5] in predicting of activity changes caused by genetic variants in the recent CAGI6 challenge.
The above three steps are repeated until hits are found. The generated compounds are iteratively optimized.
References
[1] Autodock vina: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3041641/
[2] Diffdock: https://arxiv.org/abs/2210.01776
[3] FABind: https://arxiv.org/abs/2310.06763
[4] 3Cnet: https://academic.oup.com/bioinformatics/article/37/24/4626/6322986
[5] Alphamissense: https://www.science.org/doi/10.1126/science.adg7492