Jeff Messer, Director of Analytics at GSK, presented insights on the integration of AI and machine learning (ML) with DNA encoded libraries (DEL) technology to enhance drug discovery processes. DEL technology, a screening method, enables the creation of vast libraries of DNA-tagged compounds through combinatorial chemistry, facilitating early hit identification. GSK has developed over 100 libraries with 3.5 billion structures, generating 10 clinical candidates using this technology.
The target-based screening process involves selecting targets with affinity tags, separating binding compounds, and decoding them via DNA barcodes. Enrichment analysis using Poisson statistics identify binders, which were then grouped into chemotypes for further synthesis and testing without DNA tags.
Messer highlighted the integration of ML, where models were trained on DEL selection data to recognise binding patterns and predict new compounds from internal or external sources. This approach scaled efficiently to billions of molecules and aided in triaging chemistry investments. Collaborations such as X-Chem/Google and SGC/Google demonstrated ML models achieving varying hit confirmation rates, including on unprecedented targets without known binders. Challenges included noisy data, overfitting, and difficulty generalising beyond known chemistries.
GSK's ML strategy focuses on exploring chemical space beyond DEL monomers, employing stratification to reduce overfitting by clustering training examples based on enriched building blocks. Hold-out libraries and model ensembling further improves generalisation and prediction quality. Practical outcomes showed that ML predictions correlated with traditional DEL hit confirmation rates; strong DEL hits led to successful ML follow-ups, and vice versa. About half of GSK's projects yielded multiple hit series, while others showed weak or no binding, reflecting typical challenges with low-tractability targets.
A case study in a global health project illustrated the importance of training strategy. Traditional ML training with random splits yielded no hits, whereas using biological replicates as hold-out sets and avoiding overtraining identified two hit series. GSK aimed to leverage DEL ML to address less tractable targets producing limited training data, continuing to refine methodologies to improve hit identification and expand chemical diversity beyond conventional DEL chemistry.