Comparative genomics of the relationship between gene structure and expression

X. Ren

Research output: Thesisinternal PhD, WU


The relationship between the structure of genes and their expression is a relatively new aspect of genome organization and regulation. With more genome sequences and expression data becoming available, bioinformatics approaches can help the further elucidation of the relationships between gene structure and gene expression. This will contribute to our understanding of a yet deeper level of gene regulation in higher eukaryotes. This thesis focuses on two issues of genome organization in relationship to expression. The genomic configuration involved in coexpression of neighboring genes is investigated (chapters 3 + 4) and the genome-wide relationships between structural parameters of a gene and its expression are analyzed (chapters 5 + 6). A short introduction (chapter 1) outlines the motivation and structure of this thesis. This is followed by an overview of issues that need to be considered in the study of gene and genome structure in relation to gene expression (chapter 2). DNA configuration in the nucleus is summarized and concepts as gene, chromatin and higher order domains are presented in the context of the measurement of gene expression and gene regulation. Special attention is given to the characteristics and functions of introns in the genomes of higher eukaryotes.Expression of genes in eukaryotic genomes is known to cluster in domains, but domain size is generally loosely defined and highly variable.The concept of local coexpression domain is introduced and definedas set of physically adjacent genes that are highly coexpressed (chapter 3).The Arabidopsis thaliana genome was analyzed for the presence of such local coexpression domains and their functional characteristics were investigated.Public domain expression data from the Massively Parallel Signature Sequencing (MPSS) repository that cover a range of different experimental conditions, organs, tissues and cells and microarray data (Affymetrix) from a detailed analysis of gene expression in root were used. With these expression data, we identified 689 (MPSS) and 1481 (microarray) local coexpression domains consisting of 2 to 4 genes with a pair-wise Pearson's correlation coefficient larger than 0.7.This number is about 2 to 5-fold higher than the numbers expected by chance on the basis of genome randomizations. A small (5-10%) yet significant fraction of genes in the Arabidopsis genome is therefore organized into local coexpression domains.These local coexpression domains were apparently randomly distributed over the genome.Genes in such local domains were for the major part not categorized in the same functional category (GOslim).Neither tandemly duplicated genes, nor a shared promoter sequence, or gene distance fully explained the occurrence of coexpression of genes in such chromosomal domains. This indicates that other parameters in genes or gene positions are important to establish coexpression of genes in local domains of Arabidopsis.The analytical approach was extended to the analysis of the occurrence of local coexpression domains in the genome of rice ( Oryza sativa ), the monocotyledonous model plant (chapter 4). Also in the rice genome, thereisa small, yet significant number of local coexpression domains that for the major part were not categorized in the same functional category (GOslim). The various configuration parameters studies could not fully explain the occurrence of local coexpression domains.The regulation of coexpression is therefore thought to be regulated at the level of chromatin structure. The characteristics of the local coexpression domains in rice are strikingly similar to such domains in the Arabidopsis genome. Yet, no microsynteny between local coexpresion domains in Arabidopsis and rice could be identified (chapter 4). Although the rice genome is not yet as extensively annotated as the Arabidopsis genome, the lack of conservation of local coexpression domains indicates that such domains have not played a major role in evolution.In chapter 5, the relationships between the structure of a primary transcript and the expression level of the gene were investigated to identify the parameters and mechanisms that have helped shaping such relationships. In both monocotyledonous rice and dicotyledonous Arabidopsis, highly expressed genes were shown to have more and longer introns, as well as a larger primary transcript than lowly expressed genes. It is concluded that higher expressed genes tend to be less compact than lower expressed genes. In animal genomes, it is reported to be the other way round. Although the length differences in plant genes are much smaller than in animals, these findings indicate that plant genes are in this respect different from animal genes. Explanations for the relationship between gene configuration and gene expression in animals may be (or may have been) less important in plants. We speculate that selection, if any, on genome configuration has taken a different turn after the divergence of plants and animals.To be able to exclude that the methodological differences were the reason for the reported differences between plant and animal gene structure and expression relationships,a comparative genomics study of five widely diverged genomes was undertaken (chapter 6).The relationships between gene structure and gene expression were analyzed for five genomes (Arabidopsis, rice, worm, mouse, human), using public domain MPSS and affymetrix microarray (for worm) expression data sets that cover a wide variety of tissues and conditions. Five different parameters of gene structure were examined with the help of rank-based methods: the number of introns, as well as the total length of introns, combined untranslated regions, coding sequence and the combined total length of the primary transcript. In addition, the broadness or breadth of expression is evaluated. The methods of analyses were identical for all genomes considered. It was found that tissue specific genes, defined as genes that are expressed in only one (or at most a few) tissues/conditions, are among the more compact genes in all genomes evaluated. Moreover, in plants the higher expressed genes tend to be longer and less compact than the lower expressed genes, whereas in the mammalian genomes analyzed the trend is the opposite. Worm takes an intermediate position. The different genomes differ markedly in the details of the relationship between expression and structure for the genes that are in the middle class of expression level. As the major difference in genome configuration is the absolute length of introns, possible explanations for the contrasting trends in plant and mammalian genomes question the role and evolutionary history of introns. Possibly there is a threshold amount of intron number and/or size upon which selection acts that differs between genomes. Alternatively, some groups of plant introns have been introduced in plant genomes well after the split between animals and plants.The results of the research presented in this thesis are considered in the context and future prospects of the wider, more detailed and more comparative analyses of the relationships between gene structure and gene expression in the genomes of higher eukaryotes (chapter 7).
Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
  • Stiekema, W., Promotor, External person
  • Nap, Jan-Peter (Jp), Co-promotor
Award date20 Dec 2006
Place of Publication[S.l.]
Print ISBNs9789085045540
Publication statusPublished - 2006


  • gene expression
  • bioinformatics
  • molecular genetics
  • dna sequencing
  • genomics


Dive into the research topics of 'Comparative genomics of the relationship between gene structure and expression'. Together they form a unique fingerprint.

Cite this