Matt Pharris, Senior Scientist at UCB, discussed how UCB is using multiple data types to enable virtual screening efforts. He specifically focused on how small molecule structure data and cell painting data improve virtual screening models and bioactivity predictions. 

The virtual screening models work well because they have been carefully trained on large volumes of data pooled together from public and internal sources. They are also highly convenient: a scientist can roughly illustrate a molecule, and the model uses this input to create an in-silico representation of the molecule.  

However, these representations of the molecule are not perfect because of human bias in representations of small molecule structures, meaning scientists are unable to convey how the molecule will perform in cell-based assays by encapsulating its structure. 

Pharris highlighted the cell painting assay, a high-content multiplex cell imaging technique, and the significance of the Jump Cell Painting dataset. The latter includes 120,000 compound perturbations and 8,000 genetic perturbations, enabling advanced machine learning applications. The cell painting assay is a high-content imaging technique using five generic stains to capture subcellular structures. It generates hundreds of features per cell, forming rich morphological profiles suitable for machine learning. 

Three key machine learning approaches were outlined in the presentation. Firstly, supervised classification predicts mechanisms of action based on phenotypic profiles. Secondly, unsupervised clustering, which projects phenotypic data into lower dimensions to differentiate molecules, though with limitations in accuracy. Finally, variational autoencoders are used to compress high-dimensional data and integrate multiple data types (e.g., structure, phenotype, assay results) into joint embeddings, thereby improving predictions. The VAEs also enhance the identification of phenocopy relationships and support multi-assay predictions. 

Pharris added that models are trained on both structure and cell painting data, but are designed to function even when one data type is missing. This adaptability allows predictions for molecules without in-house cell painting data, therefore preserving the convenience of virtual screening. 

It is evident that models that incorporate cell painting data outperform those using structure alone. Even when cell painting data is withheld during testing, models trained with it retain predictive power for many assays. 

Despite the promise of these models, the approach faces challenges such as high false positive rates due to dimensionality, and not all assays benefit equally from cell painting data. Looking ahead, Pharris and his team are focused on methods that avoid the need for in-house cell painting while still improving hit nomination and screening.