Challenge #6

Hit Identification
Method type (check all that applies)
Deep learning
Machine learning
Description of your approach (min 200 and max 800 words)

BIOPTIC is a target-agnostic, potency-based molecule search model for finding structurally dissimilar molecules with similar biological activities. We used best practices to design a fast retrieval system, based on processor-optimized SIMD instructions, to screen 40B Enamine REAL Space with 100% recall rate.

1. Modeling

Our model is a SMILES-based transformer trained in two phases. The Byte-Pair Encoding tokenization is applied to SMILES strings at each phase. During the first phase, the model is pre-trained in an unsupervised fashion on a large corpus of unlabeled molecular data from PubChem and Enamine REAL Space. The training is performed via a masked language modeling following the procedure from RoBERTa. This phase allows the model to learn the SMILES grammar. In the second phase, the transformer model is augmented with a multi-pooling module that aggregates hidden layer output in three different ways: taking classification token output as is, and taking maximum and average along the sequence axis of all remaining tokens. These aggregated vectors are passed into their corresponding linear layers for projection to a lower dimension following up with concatenation to form the final embedding. This embedding is further normalized and passed to the classification layer where the task is to predict which targets a given molecule is active to. Here, the model is trained on BindingDB. As the majority of labels are not available for the BindingDB ligand-target pairs, we mask them out to not introduce additional noise into the training.

We use binary cross-entropy loss with masking as an objective for learning potency-based molecular representations. A masking indicator helps prevent the model from being penalized on unlabeled data pairs, allowing it to learn from labeled ligand-target interactions only.

2. Optimization 

The chosen model is RoBERTa, with vocabulary size of 500, 6 hidden layers with hidden size of 384, 8 attention heads, and intermediate size of 1024, while other parameters are the same as described in RoBERTa. The model has 8.7 million parameters. Initially, it is pre-trained in an unsupervised fashion on 115 million unique molecules from PubChem and 48 million random molecules from Enamine REAL Space. The training is performed using masked language modeling following the procedure from RoBERTa. After pre-training, the hidden states of the last layer are extracted from the CLS token, and the rest of the tokens are max and average pooled. Three independent linear layers with an output size of 20 are then applied. The reduced representations are concatenated into a vector of size 60. The embedding dimensionality is set to 60 as a trade-off between quality and storage requirements. Embeddings are L2-normalized and passed to the classification linear layer, where the number of classes depends on the target-specific data split and averages around 6700 classes.

3. Ultra-fast retrieval system with 100% recall rate.

The neural network produces molecular embeddings as float16 value vectors in a 60-dimensional space. The number of dimensions is tuned to balance quality and retrieval speed. With these compact representations, we construct an index of a given molecular library and store it on SSD disks. To search, we calculate the cosine similarity of a query molecule’s embedding to that of each indexed molecule in a brute-force fashion. Cosine similarity is implemented in C++ as matrix multiplication using Single-Instruction-Multiple-Data (SIMD) instructions. The main advantage of this method is the 100% recall, so if an existing item in the index similar to the query, it must be found. This is crucial for large-scale libraries with billions of molecules. The search infrastructure is set up with 27 nodes having 2-core CPUs and 4 GB RAM connected to 270 SSD disks of 21 GB each, with 1 Gbit/s throughput per disk. This infrastructure allows us to cost-efficiently screen Enamine REAL Space within seconds.

4. Finding novel SETDB1 binders using BIOPTIC

All known SETDB1 binders will be used as query molecules to search Enamine REAL Space of 40 billion molecules for structurally dissimilar molecules having similar binding activity to the query molecules. The top 100,000 ranking hits with sufficient dissimilarity (ECFP4 Tanimoto < 0.4) to any known SETDB1 binder will be extracted. This set will be passed through the REOS filter and the required 550-Dalton MW filter to remove undesirable compounds. The remaining compounds will be clustered into 1000 clusters, and a best scoring compound from each cluster will be retained. This 1000-compound hit list will be checked for solubility < 100 µM using Simulation Plus.

More method details:  https://www.bioptic.io/post/bioptic-a-target-agnostic-potency-based-small-molecules-search-engine

 

What makes your approach stand out from the community? (<100 words)

We developed a single global model for target-agnostic potency-based molecule search, which allows finding structurally dissimilar molecules with similar activities. 

We extensively benchmarked our model against several state-of-the-art (SOTA) models in hit identification settings for their ability to find novel, structurally dissimilar molecules with similar activities. 

We developed an efficient system based on processor-optimized SIMD instructions and deployed it on a cluster of machines connected to a matrix of SSD-disks that make it feasible to scan huge 40B libraries super fast.

 

Method Name
BIOPTIC: A TARGET-AGNOSTIC POTENCY-BASED SEARCH ENGINE FOR SMALL MOLECULES
Commercial software packages used

N/A

Free software packages used

RDKit

REOS

RoBERTa

Relevant publications of previous uses by your group of this software/method

The method is published in the following blog post:

https://www.bioptic.io/post/bioptic-a-target-agnostic-potency-based-small-molecules-search-engine

We’ve also submitted the manuscript to arxiv. The manuscript should be available on arxiv soon.