Computational approaches to discover novel enzymes for fragrance and flavour

Janani Durairaj

Research output: Thesisinternal PhD, WU


Plant specialized metabolites (SMs) are crucial to plants and to humanity, with numerous applications in food, healthcare, agriculture, and cosmetics. The enzyme families involved in producing SMs, such as the terpene synthases, are very diverse, both across and within families. Understanding and predicting compound specificity of these enzymes is critical for biotechnological applications and protein engineering. The growing availability of structure data and improved computational modelling techniques puts us in the position to use structural bioinformatics and machine learning (ML) techniques to learn patterns across all enzymes in an SM family, instead of focusing on a few structures or mutants. In this thesis I explore new algorithms and approaches to analyse datasets of SM families and take advantage of their complex structural data.

In Chapter 1 I introduce the terpene synthases and place them in context among the wider field of plant specialized metabolism. Their importance in both the plant and human worlds is discussed along with a history of the elucidation of their catalytic mechanisms via structural and mutational studies. I explore the various opportunities and challenges offered by computational techniques, found in the structural bioinformatics and ML fields, to better understand such elusive SM enzyme families. In Chapter 2 I describe the creation of a database of experimentally characterized plant sesquiterpene synthases (STSs), collected from literature studies, covering over 250 enzymes collectively responsible for the production of over a hundred sesquiterpene compounds. These proteins are analysed from a sequence perspective leading to interesting results on previously studied motifs, as well as the conclusion that phylogeny plays a larger role in STS sequence similarity than product specificity. This further expedited the need for protein structure information, extracted using homology modelling. In Chapter 3 I put forth an analysis of STS major and minor products, demonstrating that sesquiterpenes produced by an STS tend to be derived from the same reaction path. This enabled us to simplify the idea of product prediction to parent cation prediction, where I show that ML on the modelled STS structures out-performs sequence-based approaches.

To make further use of this structural information, in Chapters 4, 5 and 6 I developed structural bioinformatics embeddings for ML applications, resulting in an embedding allowing alignment-free comparison of the topologies and shapes contained in a structure, and a multiple structure alignment algorithm for structural features. The former, termed Geometricus and presented in Chapters 5 and 6, uses a concept from computer vision called rotation invariant moments to extract and count “shape-mers”, structural analogues to sequence k-mers. The latter, Caretta, presented in Chapters 4 and 6 is a multiple structure aligner that incorporates Geometricus shape-mer counting to scale to many thousands of proteins, and includes a feedback loop between single proteins and the progressively created alignment to return accurate and high-coverage alignments. To enable downstream ML analyses, Caretta also extracts and outputs aligned feature matrices, including the moment invariants used by Geometricus as a novel feature source describing protein shape and topology.

This novel feature extraction and alignment approach is applied in Chapter 7 to the task of predicting STS product specificity. To increase our coverage of STS sequence and compound space we use what we learned in Chapters 2 and 3 to select and experimentally characterize over 60 new STSs. As the number of possible products precludes the classification approach in Chapter 3, I create a joint protein-compound framework combining aligned protein structural features with chemical compound features to both successfully predict product specificity, and pinpoint residues involved in the formation of each sesquiterpene.

Many of the analyses and techniques used in this thesis are common across protein biology and bioinformatics. To allow life scientists to explore the interconnected properties of their protein family of interest from a variety of different perspectives, and share these findings across the web, in Chapter 8 I present Turterra, an interactive data visualization portal.

Chapter 9 concludes this thesis by describing ongoing challenges in studying SM enzyme families and their potential solutions from an ML perspective. I expand the discussion to the broader field of protein structure bioinformatics and the many opportunities it holds for enhancing our understanding of biological function.

Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
  • de Ridder, Dick, Promotor
  • van Dijk, Aalt-Jan, Co-promotor
Award date15 Sept 2021
Place of PublicationWageningen
Print ISBNs9789463959230
Publication statusPublished - Sept 2021


  • Cum laude


Dive into the research topics of 'Computational approaches to discover novel enzymes for fragrance and flavour'. Together they form a unique fingerprint.

Cite this