In summary, we employ a contrastive virtual screening model to sift through extensive chemical libraries and identify the top 1% of molecules. These high-ranking molecules are then grouped using molecular fingerprints such as ECFP4 or MACCS. Subsequently, the clustered molecules, typically around 200, are docked to the target pocket, and all docking poses are assessed based on docking scores, RMSD, and expert evaluation.
The contrastive learning model undergoes a two-step training process. Initially, protein-only data are utilized to pre-train the protein pocket encoder. Protein fragments are treated as ligands, and their surrounding areas are considered pockets. The pocket encoder is then trained to align with a pre-trained molecule encoder (specifically Uni-Mol) using a contrastive distillation approach. After this alignment, we fine-tune the contrastive learning model with experimentally determined ligands and their receptors. This fine-tuning employs the PDBBind dataset, which comprises approximately 20,000 manually reviewed ligand-receptor complex structures with recorded affinities, and the BioLip2 dataset, a larger collection of complexes with unknown affinities gathered semi-automatically. The final model is thus capable of performing virtual screening for given pockets and ligands.
For docking pose evaluation, we incorporate additional criteria beyond docking scores. We use multiple docking software to ensure consistency across poses. Each ligand is docked using three different docking tools, and a ligand is accepted if the root-mean-square deviations (RMSDs) between poses are less than 2 Å.
The integration of a contrastive virtual screening model with advanced clustering and docking techniques provides a robust framework for identifying potential ligands. The meticulous training and fine-tuning process, combined with rigorous pose evaluation criteria, ensure the reliability and accuracy of the screening results. This approach not only enhances the efficiency of virtual screening but also increases the likelihood of discovering promising drug candidates.