Challenge #6

Hit Identification
Method type (check all that applies)
Deep learning
High-throughput docking
Hybrid of the above
We used a hybrid approaches including deep learning scoring and docking.
Description of your approach (min 200 and max 800 words)

In summary, we employ a contrastive virtual screening model to sift through extensive chemical libraries and identify the top 1% of molecules. These high-ranking molecules are then grouped using molecular fingerprints such as ECFP4 or MACCS. Subsequently, the clustered molecules, typically around 200, are docked to the target pocket, and all docking poses are assessed based on docking scores, RMSD, and expert evaluation.

The contrastive learning model undergoes a two-step training process. Initially, protein-only data are utilized to pre-train the protein pocket encoder. Protein fragments are treated as ligands, and their surrounding areas are considered pockets. The pocket encoder is then trained to align with a pre-trained molecule encoder (specifically Uni-Mol) using a contrastive distillation approach. After this alignment, we fine-tune the contrastive learning model with experimentally determined ligands and their receptors. This fine-tuning employs the PDBBind dataset, which comprises approximately 20,000 manually reviewed ligand-receptor complex structures with recorded affinities, and the BioLip2 dataset, a larger collection of complexes with unknown affinities gathered semi-automatically. The final model is thus capable of performing virtual screening for given pockets and ligands.

For docking pose evaluation, we incorporate additional criteria beyond docking scores. We use multiple docking software to ensure consistency across poses. Each ligand is docked using three different docking tools, and a ligand is accepted if the root-mean-square deviations (RMSDs) between poses are less than 2 Å.

The integration of a contrastive virtual screening model with advanced clustering and docking techniques provides a robust framework for identifying potential ligands. The meticulous training and fine-tuning process, combined with rigorous pose evaluation criteria, ensure the reliability and accuracy of the screening results. This approach not only enhances the efficiency of virtual screening but also increases the likelihood of discovering promising drug candidates.

What makes your approach stand out from the community? (<100 words)

First, our method has been validated on multiple targets including GPCRs, transporters, ion channels, and chromatin remodelers with wet-lab experiments. Second, our method has also achieved SOTA performance on benchmarks like DUD-E or LIT-PCBA. Finally, our contrastive-learning-based virtual screening method is extremely fast and can screen billions of compounds within minutes. 

Method Name
DrugCLIP+
Commercial software packages used

Schrodinger suite and CCDC-GOLD

Free software packages used

Python, rdkit, obabel, autodock-vina, biopython, pytorch and unicore.

Relevant publications of previous uses by your group of this software/method

DrugCLIP, contrastive protein-molecule representation learning for virtual screening. Bowen Gao, Bo Qiang, Haichuan Tan, Yinjun Jia, Minsi Ren, Minsi Lu, Jingjing Liu, Wei-Ying Ma, Yanyan Lan. NeurIPS, 2023.

ProFSA, self-supervised pocket pretraining via protein fragment-surroundings alignment. Bowen Gao, Yinjun Jia, YuanLe Mo, Yuyan Ni, Wei-Ying Ma, Zhi-Ming Ma, Yanyan Lan. ICLR, 2024.