Kousik Kundu
Ph.D. Thesis, University of Freiburg, April 2015
Protein-protein interactions (PPIs) are one of the most essential cellular processes in eukaryotes that control many important biological activities, such as signal transduction, differentiation, growth, cell polarity, apoptosis etc. Many PPIs in cellular signaling are mediated by modular protein domains. Peptide recognition modules (PRMs) are an important subclass of modular protein domains that specifically recognize short linear peptides to facilitate their biological functions. Hence, it is important to understand the intriguing mechanisms by which hundreds of modular domains specifically bind to their target peptides in a complex cellular environment. In recent years, an unprecedented progress has been made in high-throughput technologies to describe the binding specificities of a number of modular protein domain families. Therefore, given the high binding specificity of PRMs, in silico prediction of their cognate partners is of great interest. In the first part of this thesis, we describe the main high-throughput technologies (microarray, phage display etc.) that are widely used for defining the binding specificity of PRMs. Currently, several computational methods have been published for the prediction of domain-peptide interactions. Here, we provide a comprehensive review on these methods and their applications. We also describe the major drawbacks (e.g., linearity problem, peptide alignment problem, data-imbalance problem etc.) of these existing tools that are successfully addressed in our study. In the second part of this thesis, we present three methods for predicting domain-peptide interactions mediated by three diverse PRM families (i.e., SH2, SH3, and PDZ domain). In order to circumvent the linearity problem, our methods use efficient kernel functions, which exploit higher-order dependencies between amino acid positions. For the prediction of SH2-peptide interactions, polynomial kernels are used to train the classifiers. In addition, we show how to handle the data-imbalance problem by using an efficient semi-supervised technique. For the prediction of SH3-peptide interactions, graph kernels are used for training the classifiers. Graph kernel feature representation allows us to include the physico-chemical properties of each amino acid in the peptides, which increases the generalization capacity of the classifier. By using this kernel function, we were able to eliminate the need of an initial peptide alignment, since the alignment of proline-rich peptides targeted by SH3 domains is a hard task and an error-prone alignment can severely affect the predictive performance of the classifier. Moreover, we developed a generative approach for refining the confidence negative data. In the case of PDZ-peptide interactions, we cluster hundreds of PDZ domains from different organisms, i.e., human, mouse, fly, and worm, based on their binding specificity, and build a single comprehensive model for a set of multiple PDZ domains. In this way, we show that the domain coverage can be increased by using an accurate clustering technique. For training the classifier, a Gaussian kernel function is used. Similar to SH2-peptide interactions, a semi-supervised technique was applied to generate high-confidence negative data. In the third part of this thesis, we describe the applications and performance evaluations of our methods. We compared our methods with several other existing tools and achieved a much higher performance, which was measured by sensitivity, specificity, precision, AUC PR, and AUC ROC. Our methods were further evaluated on various experimentally verified datasets and as a predictive result, they outperformed the state-of-the-art approaches. To uncover the novel and biologically relevant interactions, we performed a genome-wide prediction. Furthermore, a term-centric enrichment analysis has been performed to unveil the novel functionalities of the predicted interactions. In the last part of this thesis, we introduce a new and efficient web server, which contains three tools (i.e., SH2PepInt, SH3PepInt, and PDZPepInt), for the prediction of modular domain-peptide interactions. Currently, we offer 51 and 69 single domain models for SH2 and SH3 domains, respectively, and 43 multiple domain models, which cover 227 domains, for PDZ domains across several organisms. In summary, this thesis presents machine learning methods for predicting the binding peptides of three diverse PRM families where the training data was derived from various high-throughput experiments. Most importantly, this thesis addresses the major computational challenges in the field of modular domain-peptide interactions. We offer the largest set of models to date for the prediction of modular domain mediated interactions.
Kousik Kundu, Martin Mann, Fabrizio Costa, Rolf Backofen
In: Bioinformatics, 2014, 30(18), 2668-2669
SUMMARY:: MoDPepInt (Modular Domain Peptide Interaction) is a new easy-to-use web server for the prediction of binding partners for modular protein domains. Currently, we offer models for SH2, SH3 and PDZ domains via the tools SH2PepInt, SH3PepInt and PDZPepInt, respectively. More specifically, our server offers predictions for 51 SH2 human domains and 69 SH3 human domains via single domain models, and predictions for 226 PDZ domains across several species, via 43 multidomain models. All models are based on support vector machines with different kernel functions ranging from polynomial, to Gaussian, to advanced graph kernels. In this way, we model non-linear interactions between amino acid residues. Results were validated on manually curated datasets achieving competitive performance against various state-of-the-art approaches. Availability and implementation: The MoDPepInt server is available under the URL http://modpepint.informatik.uni-freiburg.de/ CONTACT: : backofen@informatik.uni-freiburg.de Supplementary information: Supplementary data are available at Bioinformatics online.
Kousik Kundu, Rolf Backofen
In: BMC Genomics, 2014, 15(Suppl 1), S5
BACKGROUND: PDZ domains are one of the most promiscuous protein recognition modules that bind with short linear peptides and play an important role in cellular signaling. Recently, few high-throughput techniques (e.g. protein microarray screen, phage display) have been applied to determine in-vitro binding specificity of PDZ domains. Currently, many computational methods are available to predict PDZ-peptide interactions but they often provide domain specific models and/or have a limited domain coverage. RESULTS: Here, we composed the largest set of PDZ domains derived from human, mouse, fly and worm proteomes and defined binding models for PDZ domain families to improve the domain coverage and prediction specificity. For that purpose, we first identified a novel set of 138 PDZ families, comprising of 548 PDZ domains from aforementioned organisms, based on efficient clustering according to their sequence identity. For 43 PDZ families, covering 226 PDZ domains with available interaction data, we built specialized models using a support vector machine approach. The advantage of family-wise models is that they can also be used to determine the binding specificity of a newly characterized PDZ domain with sufficient sequence identity to the known families. Since most current experimental approaches provide only positive data, we have to cope with the class imbalance problem. Thus, to enrich the negative class, we introduced a powerful semi-supervised technique to generate high confidence non-interaction data. We report competitive predictive performance with respect to state-of-the-art approaches. CONCLUSIONS: Our approach has several contributions. First, we show that domain coverage can be increased by applying accurate clustering technique. Second, we developed an approach based on a semi-supervised strategy to get high confidence negative data. Third, we allowed high order correlations between the amino acid positions in the binding peptides. Fourth, our method is general enough and will easily be applicable to other peptide recognition modules such as SH2 domains and finally, we performed a genome-wide prediction for 101 human and 102 mouse PDZ domains and uncovered novel interactions with biological relevance. We make all the predictive models and genome-wide predictions freely available to the scientific community.
Kousik Kundu, Fabrizio Costa, Rolf Backofen
In: Bioinformatics, 2013, 29(13), i335-i343
MOTIVATION: State-of-the-art experimental data for determining binding specificities of peptide recognition modules (PRMs) is obtained by high-throughput approaches like peptide arrays. Most prediction tools applicable to this kind of data are based on an initial multiple alignment of the peptide ligands. Building an initial alignment can be error-prone, especially in the case of the proline-rich peptides bound by the SH3 domains. RESULTS: Here, we present a machine-learning approach based on an efficient graph-kernel technique to predict the specificity of a large set of 70 human SH3 domains, which are an important class of PRMs. The graph-kernel strategy allows us to (i) integrate several types of physico-chemical information for each amino acid, (ii) consider high-order correlations between these features and (iii) eliminate the need for an initial peptide alignment. We build specialized models for each human SH3 domain and achieve competitive predictive performance of 0.73 area under precision-recall curve, compared with 0.27 area under precision-recall curve for state-of-the-art methods based on position weight matrices. We show that better models can be obtained when we use information on the noninteracting peptides (negative examples), which is currently not used by the state-of-the art approaches based on position weight matrices. To this end, we analyze two strategies to identify subsets of high confidence negative data. The techniques introduced here are more general and hence can also be used for any other protein domains, which interact with short peptides (i.e. other PRMs). AVAILABILITY: The program with the predictive models can be found at http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/SH3PepInt.tar.gz. We also provide a genome-wide prediction for all 70 human SH3 domains, which can be found under http://www.bioinf.uni-freiburg.de/Software/SH3PepInt/Genome-Wide-Predictions% .tar. gz. CONTACT: backofen@informatik.uni-freiburg.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Kousik Kundu, Fabrizio Costa, Michael Huber, Michael Reth, Rolf Backofen
In: PLoS One, 2013, 8(5), e62732
Src homology 2 (SH2) domains are the largest family of the peptide-recognition modules (PRMs) that bind to phosphotyrosine containing peptides. Knowledge about binding partners of SH2-domains is key for a deeper understanding of different cellular processes. Given the high binding specificity of SH2, in-silico ligand peptide prediction is of great interest. Currently however, only a few approaches have been published for the prediction of SH2-peptide interactions. Their main shortcomings range from limited coverage, to restrictive modeling assumptions (they are mainly based on position specific scoring matrices and do not take into consideration complex amino acids inter-dependencies) and high computational complexity. We propose a simple yet effective machine learning approach for a large set of known human SH2 domains. We used comprehensive data from micro-array and peptide-array experiments on 51 human SH2 domains. In order to deal with the high data imbalance problem and the high signal-to-noise ration, we casted the problem in a semi-supervised setting. We report competitive predictive performance w.r.t. state-of-the-art. Specifically we obtain 0.83 AUC ROC and 0.93 AUC PR in comparison to 0.71 AUC ROC and 0.87 AUC PR previously achieved by the position specific scoring matrices (PSSMs) based SMALI approach. Our work provides three main contributions. First, we showed that better models can be obtained when the information on the non-interacting peptides (negative examples) is also used. Second, we improve performance when considering high order correlations between the ligand positions employing regularization techniques to effectively avoid overfitting issues. Third, we developed an approach to tackle the data imbalance problem using a semi-supervised strategy. Finally, we performed a genome-wide prediction of human SH2-peptide binding, uncovering several findings of biological relevance. We make our models and genome-wide predictions, for all the 51 SH2-domains, freely available to the scientific community under the following URLs: http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/SH2PepInt.tar.gz and http://www.bioinf.uni-freiburg.de/Software/SH2PepInt/Genome-wide-predictions% .tar. gz, respectively.