Projects per year
For comparative genomics, relative gene orders or synteny holds key information to assess genomic innovations such as gene duplications, gene loss, or transpositions. While the number of reference genomes is growing exponentially, a major challenge is how to detect, represent, and visualize synteny relations of any genes of interest effectively across a large number of genomes.
In this thesis, I present six chapters centering on a network approach for large-scale phylogenomic synteny analysis, and discuss how such a network approach can enhance our understanding of the evolutionary history of genes and genomes across broad phylogenetic groups and divergence times.
In Chapter 1, I stress that synteny information is becoming more important at this genomics age with rapidly developing DNA sequencing technologies. It provides us another layer of data besides merely sequences, and could potentially be better used to improve phylogeny. I also summarized current available tools and gave an example of popular websites for synteny detection.
In Chapter 2, I propose an outline performing synteny network analysis, which is based on three primary steps: pairwise whole genome comparisons, syntenic block detection and data fusion, and network visualization. Then with comparison to a previous synteny comparison result which use traditional parallel coordinate plots, I show that the network approach could present us a much clear, strong, and systematic graph, with integrated synteny information from 101 broadly distributed species.
In Chapter 3, we analyzed synteny networks of the entire MADS-box transcription factor gene family from fifty-one completed plant genomes. We applied a k-cliques percolation method to cluster the synteny network. We found lineage-specific clusters that derive from transposition events for the regulators of floral development (APETALA3 and PI) and flowering-time (FLC) in the Brassicales and for the regulators of root-development (AGL17) in Poales. We also visualized big difference of synteny properties between Type I MADS-box genes and Type II MADS-box genes. We identified two large gene clusters that jointly encompass many key phenotypic regulatory Type II MADS-box gene clades (SEP1, SQUA, TM8, SEP3, FLC, AGL6 and TM3). This allows for a better understanding of how evolution has acted on a key regulatory gene family in the plant kingdom.
In Chapter 4, we performed synteny network analysis of LEA gene families, which includes eight different subfamilies (LEA_1 to LEA_6, SMP, and DHN) and has a relatively chaotic classification. Synteny clusters provide us better pictures of genomic innovations and function diversification. For example recurrent tandem duplications contributed to LEA_2 family expansion, whereas synteny and protein sequence were highly conserved during the evolution of LEA_5.
In Chapter 5, instead of the analysis of a particular gene family, I scale up the analysis to all the genes from all available genomes across kingdoms over significant evolutionary timescales. We used available genomes of 87 mammals and 107 flowering plants. We first compare synteny percentage with popular genome metrics such as BUSCO and N50, which reveal genomic architecture conservation and variation across kingdoms. We characterized and compare the properties of the whole network, using degree distribution and clustering results. Through phylogenomic profiling of size, degree and compositions of all clusters, we identified many phylogenomic genomic innovations (i.e. duplications, gene transpositions, gene loss), at the individual gene level, from tested mammal and angiosperm genomes.
In Chapter 6, I summarize the merits of taking a network-based approach for synteny comparisons, and discuss current clustering methods for synteny data. I also mentioned several weakness, which could be further complemented in the future.
|Qualification||Doctor of Philosophy|
|Award date||17 Sept 2018|
|Place of Publication||Wageningen|
|Publication status||Published - 2018|