Y. Shen
Texas A&M University,
United States
Keywords: compound-protein interaction, de novo protein design, deep learning, model interpretability, deep generative models
Summary:
Rapid quantification of compound-protein interactions (CPI) is an important but daunting task for drug discovery, especially considering the enormous chemical and proteomic spaces. Clearly, there is a critical need of computational methods for CPI prediction. However, there was a lack of such methods with wide applicability, high accuracy, and mechanistic interpretability. We have developed DeepAffinity that integrates knowledge- and learning-based approaches to address the challenge using chemical identities and protein sequences alone. Specifically, a semi-supervised deep learning model that unifies recurrent and convolutional neural networks has been proposed to exploit both unlabeled and labeled data, for jointly encoding molecular representations and predicting CPI. Furthermore, attention mechanisms are embedded to our model for its interpretability, as illustrated in case studies for predicting selective drug-target interactions as well as explaining them in binding sites or selectivity origins. Meanwhile, recent progress in compound-protein binding affinity prediction made by machine learning focuses on accuracy but leaves much to be desired for interpretability. Through molecular contacts underlying affinities, our large-scale interpretability assessment finds commonly-used attention mechanisms inadequate. We thus formulate a hierarchical multi-objective learning problem whose predicted contacts form the basis for predicted affinities. We further design a physics-inspired deep relational network, DeepRelations, with intrinsically explainable architecture. Specifically, various atomic-level contacts or “relations” lead to molecular-level affinity prediction. And the embedded attentions are regularized with predicted structural contexts and supervised with partially available training contacts. DeepRelations shows superior interpretability to the state-of-the-art models without compromising affinity prediction. It also represents the first dedicated model development and systematic model assessment for interpretable machine learning of compound-protein affinity. Lastly, we report our recent progress in developing novel deep generative models for de novo protein design toward desired structural folds. We have constructed low-dimensional and generalizable representation of fold space, exploited sequence data with and without paired structures, and developed ultra-fast fold predictor as an oracle providing feedback in the model. The resulting semi-supervised gcWGAN (guided, conditional, Wasserstein Generative Adversarial Networks) is assessed by the oracle over 100 novel folds not in the training set and found to generate more yields and cover more target folds compared to a competing data-driven method (cVAE). Assessed by a structure predictor over representative novel folds, including one novel fold not even part of basis folds, gcWGAN designs are found to have comparable or better fold accuracy yet much more sequence diversity and novelty than cVAE. Furthermore, the ultra-fast gcWGAN is shown to provide useful seed designs and accelerate a physics-based de novo protein design method (Rosetta).