In silico protein science: from sequence to network

Miguel Correa Marrero

Research output: Thesisinternal PhD, WU

Abstract

Proteins are the workhorses of the cell, carrying out virtually every biological process and exhibiting a great diversity of functions. They are the devices through which genetic information is expressed. Their study is central to biology; nevertheless, there are so many proteins that it is impractical to experimentally characterize them all. The work described in this thesis contributes both by developing computational methodology that can be used to characterize large numbers of proteins, as well as by exploring novel biology from a data-driven perspective by proposing testable hypotheses based on computational analysis. Chapter 1 gives an overview of developments in protein science, from high-throughput data collection to how to make sense of this data deluge through protein bioinformatics.

In Chapter 2, we study a long-standing but understudied problem in biochemistry, the regulation of protein degradation. We approach this by training multivariate regression models to predict protein degradation rates. We carefully curated and filtered datasets of protein degradation rates to study how different intrinsic characteristics play a role in this process. We find that disordered regions may play a larger role than suggested by previous studies. Additionally, our results suggest that PEST regions, long thought to be important in protein degradation, may play a role simply due to their intrinsic disorder, rather than as a specific sequence signal.

In Chapter 3, we obtain a systems-level view of the molecular mechanisms through which phytoplasma, a bacterial pathogen, infects its host and manipulates its development. We do so by studying the network of protein-protein interactions between phytoplasma effectors and Arabidopsis thaliana transcriptional regulators. We find widespread interactions between effectors and plant transcription factors, especially those related to development. We observe that many unrelated effectors interact with multiple TCP transcription factors, which are at the crossroads between development, growth and immunity. Comparison with previously determined plant pathogen-host PPI networks shows a clear overrepresentation on targeting growth and development by phytoplasma effectors, indicating that phytoplasmas have evolved idiosyncratic infection strategies. Our analysis and the data provide a base for detailed studies of individual effectors.

In Chapter 4, we address the problem of predicting residue contacts in protein-protein interaction interfaces. These can be inferred by correlated mutation analysis of multiple sequence alignments of interacting homologs. This requires that interacting pairs are correctly identified across many species, since introducing non-interacting pairs will decrease predictive performance. We introduce Ouroboros, a novel algorithm that combines coevolutionary analysis with expectation-maximization to simultaneously infer protein-protein interaction status and residue contacts. This is done without any prior knowledge of neither interactions nor contacts. We show that our approach can accurately distinguish between interacting and non-interacting proteins, and that it improves the prediction of intermolecular contact residues.

In Chapter 5, we study specific regions in an important family of plant transcription factors, the MIKC MADS-box family. MIKCs are amongst the best studied plant proteins, yet they contain poorly studied highly variable regions. In order to discover conserved sequence elements with potential novel functions, we use an alignment-free strategy to identify motifs in a large set of MIKCs. We find a broad set of motifs, many of which have been conserved for hundreds of millions of years across plant evolution. Amongst these, we identify the few known functional regions. Additionally, many of these motifs are specific signatures of particular MIKC subfamilies. This study serves as a guide for further characterization of these obscure regions.

The thesis is concluded in Chapter 6, where we discuss exciting challenges and opportunities in protein bioinformatics.

Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
Supervisors/Advisors
  • de Ridder, Dick, Promotor
  • Immink, Richard, Promotor
  • van Dijk, Aalt-Jan, Co-promotor
Award date1 Sept 2021
Place of PublicationWageningen
Publisher
Print ISBNs9789463958202
DOIs
Publication statusPublished - Sept 2021

Fingerprint

Dive into the research topics of 'In silico protein science: from sequence to network'. Together they form a unique fingerprint.

Cite this