Utilization of complete chloroplast genomes for phylogenetic studies

Shairul Izan Binti Ramlee

Research output: Thesisinternal PhD, WUAcademic

Abstract

Chloroplast DNA sequence polymorphisms are a primary source of data in many plant phylogenetic studies. The chloroplast genome is relatively conserved in its evolution making it an ideal molecule to retain phylogenetic signals. The chloroplast genome is also largely, but not completely, free from other evolutionary processes such as gene duplication, concerted evolution, pseudogene formation and genome rearrangements. The conservation of the chloroplast genome sequence allows designing primers targeting regions conserved well beyond species boundaries, and amplification of these targets.  

The small size together with their high copy number in leaf cells greatly facilitates chloroplast genome sequencing. In this thesis, chloroplast phylogenomics was conducted using complete chloroplast DNA genomes obtained by a newly developed method of de novo assembly. The method was not only cost-effective but also has the potential to extract a wealth of useful information of thousands of chloroplast genomes from Whole Genome Shotgun (WGS) data. We used k-mer frequency tables to identify and extract the chloroplast reads from the WGS reads and assemble these using a highly integrated and automated custom pipeline. This pipeline includes steps aimed at optimizing assemblies and filling gaps that are left due to coverage variation in the WGS dataset. The pipeline enabled successful de novo assembly across a range of nuclear genome sizes, from Solanum lycopersicon (tomato, 0.9 Gb) to Paphiopedilum heryanum (slipper orchid, 35 Gb).

The pipeline is suitable for studying structural variation in the chloroplast genome, as opposed to the common procedure of read mapping against a reference genome. To support the putative rearrangements, a flexible assembly quality comparison tool was created that combines and visualizes read mapping and alignment results in a two-dimensional plot. We have evaluated the ability of this tool using the de novo assemblies of S. lycopersicon and P. henryanum chloroplasts. The results show that not only we can immediately select the best of two options, but also determine the location of specific artefacts.

In order to explore and evaluate the utility of complete chloroplast phylogenomics, tomato and Paphiopedilum spp were used to conduct phylogenetic inferences based on the complete chloroplast genome. In total 84 tomato chloroplast genomes within the section Lycopersicon were assembled and phylogenetic trees produced. The analyses revealed that next to the chloroplast regions and spacers traditionally used for phylogenetics, additional regions of protein coding and non-coding DNA may be exploited for intraspecific phylogenetic studies. In particular, more than 50% of all phylogenetically relevant information could be included by just using four genes (ycf1, ndhF, ndhA, and ndhH), of which 34% in ycf1 alone. The topology of the phylogenetic tree inferred from ycf1 was the same as that of trees based on all other protein coding genes, although with lower bootstrap values. The phylogenetic analyses based on 32 complete Paphiopedilum spp. chloroplast genomes confirmed the division of the genus into three subgenera Parvisepalum, Brachypetalum and Paphiopedilum. The division of five sections of subgenus Paphiopedilum was also recovered. The de novo assemblies revealed several structural rearrangements including gene loss and inversion. In addition, the chloroplast genome of Paphiopedilum has experienced extreme IR expansion that has included part of or the entire SSC region, resulting in larger IR regions than commonly observed among monocots.

In conclusion, WGS data offer opportunities to generate partial or entire chloroplast genomes for phylogenetic studies. Species discrimination can be achieved already with partial data (subsets of genes), but evolutionarily young lineages may require more informative characters. Therefore, it is expected that many complete chloroplast genomes will be produced in the years to come. While generating these genomes, the urge for de novo assembly of chloroplast genomes rather than mapping against reference genomes is adamant in order to also uncover structural rearrangements in chloroplast genome.

Original languageEnglish
QualificationDoctor of Philosophy
Awarding Institution
  • Wageningen University
Supervisors/Advisors
  • Visser, Richard, Promotor
  • Smulders, Rene, Co-promotor
  • Borm, Theo, Co-promotor
Award date28 Oct 2016
Place of PublicationWageningen
Publisher
Print ISBNs9789462579354
DOIs
Publication statusPublished - 2016

Fingerprint

Paphiopedilum
phylogeny
genome
chloroplasts
Solanum lycopersicum
tomatoes
chloroplast genome
chloroplast DNA
genes
concerted evolution
pseudogenes
extracts
Solanum
gene duplication
Liliopsida
intergenic DNA
nuclear genome
topology
open reading frames
genetic polymorphism

Keywords

  • phylogenetics
  • genomes
  • chloroplasts
  • models
  • solanum
  • orchidaceae
  • phylogenomics
  • dna sequencing

Cite this

Ramlee, Shairul Izan Binti. / Utilization of complete chloroplast genomes for phylogenetic studies. Wageningen : Wageningen University, 2016. 186 p.
@phdthesis{6717f1a2bdca43fca1224fc699dff5c5,
title = "Utilization of complete chloroplast genomes for phylogenetic studies",
abstract = "Chloroplast DNA sequence polymorphisms are a primary source of data in many plant phylogenetic studies. The chloroplast genome is relatively conserved in its evolution making it an ideal molecule to retain phylogenetic signals. The chloroplast genome is also largely, but not completely, free from other evolutionary processes such as gene duplication, concerted evolution, pseudogene formation and genome rearrangements. The conservation of the chloroplast genome sequence allows designing primers targeting regions conserved well beyond species boundaries, and amplification of these targets.   The small size together with their high copy number in leaf cells greatly facilitates chloroplast genome sequencing. In this thesis, chloroplast phylogenomics was conducted using complete chloroplast DNA genomes obtained by a newly developed method of de novo assembly. The method was not only cost-effective but also has the potential to extract a wealth of useful information of thousands of chloroplast genomes from Whole Genome Shotgun (WGS) data. We used k-mer frequency tables to identify and extract the chloroplast reads from the WGS reads and assemble these using a highly integrated and automated custom pipeline. This pipeline includes steps aimed at optimizing assemblies and filling gaps that are left due to coverage variation in the WGS dataset. The pipeline enabled successful de novo assembly across a range of nuclear genome sizes, from Solanum lycopersicon (tomato, 0.9 Gb) to Paphiopedilum heryanum (slipper orchid, 35 Gb). The pipeline is suitable for studying structural variation in the chloroplast genome, as opposed to the common procedure of read mapping against a reference genome. To support the putative rearrangements, a flexible assembly quality comparison tool was created that combines and visualizes read mapping and alignment results in a two-dimensional plot. We have evaluated the ability of this tool using the de novo assemblies of S. lycopersicon and P. henryanum chloroplasts. The results show that not only we can immediately select the best of two options, but also determine the location of specific artefacts. In order to explore and evaluate the utility of complete chloroplast phylogenomics, tomato and Paphiopedilum spp were used to conduct phylogenetic inferences based on the complete chloroplast genome. In total 84 tomato chloroplast genomes within the section Lycopersicon were assembled and phylogenetic trees produced. The analyses revealed that next to the chloroplast regions and spacers traditionally used for phylogenetics, additional regions of protein coding and non-coding DNA may be exploited for intraspecific phylogenetic studies. In particular, more than 50{\%} of all phylogenetically relevant information could be included by just using four genes (ycf1, ndhF, ndhA, and ndhH), of which 34{\%} in ycf1 alone. The topology of the phylogenetic tree inferred from ycf1 was the same as that of trees based on all other protein coding genes, although with lower bootstrap values. The phylogenetic analyses based on 32 complete Paphiopedilum spp. chloroplast genomes confirmed the division of the genus into three subgenera Parvisepalum, Brachypetalum and Paphiopedilum. The division of five sections of subgenus Paphiopedilum was also recovered. The de novo assemblies revealed several structural rearrangements including gene loss and inversion. In addition, the chloroplast genome of Paphiopedilum has experienced extreme IR expansion that has included part of or the entire SSC region, resulting in larger IR regions than commonly observed among monocots. In conclusion, WGS data offer opportunities to generate partial or entire chloroplast genomes for phylogenetic studies. Species discrimination can be achieved already with partial data (subsets of genes), but evolutionarily young lineages may require more informative characters. Therefore, it is expected that many complete chloroplast genomes will be produced in the years to come. While generating these genomes, the urge for de novo assembly of chloroplast genomes rather than mapping against reference genomes is adamant in order to also uncover structural rearrangements in chloroplast genome.",
keywords = "phylogenetics, genomes, chloroplasts, models, solanum, orchidaceae, phylogenomics, dna sequencing, fylogenetica, genomen, chloroplasten, modellen, solanum, orchidaceae, phylogenomica, dna-sequencing",
author = "Ramlee, {Shairul Izan Binti}",
note = "WU thesis 6484 Includes bibliographic references. - With summary in English",
year = "2016",
doi = "10.18174/390196",
language = "English",
isbn = "9789462579354",
publisher = "Wageningen University",
school = "Wageningen University",

}

Ramlee, SIB 2016, 'Utilization of complete chloroplast genomes for phylogenetic studies', Doctor of Philosophy, Wageningen University, Wageningen. https://doi.org/10.18174/390196

Utilization of complete chloroplast genomes for phylogenetic studies. / Ramlee, Shairul Izan Binti.

Wageningen : Wageningen University, 2016. 186 p.

Research output: Thesisinternal PhD, WUAcademic

TY - THES

T1 - Utilization of complete chloroplast genomes for phylogenetic studies

AU - Ramlee, Shairul Izan Binti

N1 - WU thesis 6484 Includes bibliographic references. - With summary in English

PY - 2016

Y1 - 2016

N2 - Chloroplast DNA sequence polymorphisms are a primary source of data in many plant phylogenetic studies. The chloroplast genome is relatively conserved in its evolution making it an ideal molecule to retain phylogenetic signals. The chloroplast genome is also largely, but not completely, free from other evolutionary processes such as gene duplication, concerted evolution, pseudogene formation and genome rearrangements. The conservation of the chloroplast genome sequence allows designing primers targeting regions conserved well beyond species boundaries, and amplification of these targets.   The small size together with their high copy number in leaf cells greatly facilitates chloroplast genome sequencing. In this thesis, chloroplast phylogenomics was conducted using complete chloroplast DNA genomes obtained by a newly developed method of de novo assembly. The method was not only cost-effective but also has the potential to extract a wealth of useful information of thousands of chloroplast genomes from Whole Genome Shotgun (WGS) data. We used k-mer frequency tables to identify and extract the chloroplast reads from the WGS reads and assemble these using a highly integrated and automated custom pipeline. This pipeline includes steps aimed at optimizing assemblies and filling gaps that are left due to coverage variation in the WGS dataset. The pipeline enabled successful de novo assembly across a range of nuclear genome sizes, from Solanum lycopersicon (tomato, 0.9 Gb) to Paphiopedilum heryanum (slipper orchid, 35 Gb). The pipeline is suitable for studying structural variation in the chloroplast genome, as opposed to the common procedure of read mapping against a reference genome. To support the putative rearrangements, a flexible assembly quality comparison tool was created that combines and visualizes read mapping and alignment results in a two-dimensional plot. We have evaluated the ability of this tool using the de novo assemblies of S. lycopersicon and P. henryanum chloroplasts. The results show that not only we can immediately select the best of two options, but also determine the location of specific artefacts. In order to explore and evaluate the utility of complete chloroplast phylogenomics, tomato and Paphiopedilum spp were used to conduct phylogenetic inferences based on the complete chloroplast genome. In total 84 tomato chloroplast genomes within the section Lycopersicon were assembled and phylogenetic trees produced. The analyses revealed that next to the chloroplast regions and spacers traditionally used for phylogenetics, additional regions of protein coding and non-coding DNA may be exploited for intraspecific phylogenetic studies. In particular, more than 50% of all phylogenetically relevant information could be included by just using four genes (ycf1, ndhF, ndhA, and ndhH), of which 34% in ycf1 alone. The topology of the phylogenetic tree inferred from ycf1 was the same as that of trees based on all other protein coding genes, although with lower bootstrap values. The phylogenetic analyses based on 32 complete Paphiopedilum spp. chloroplast genomes confirmed the division of the genus into three subgenera Parvisepalum, Brachypetalum and Paphiopedilum. The division of five sections of subgenus Paphiopedilum was also recovered. The de novo assemblies revealed several structural rearrangements including gene loss and inversion. In addition, the chloroplast genome of Paphiopedilum has experienced extreme IR expansion that has included part of or the entire SSC region, resulting in larger IR regions than commonly observed among monocots. In conclusion, WGS data offer opportunities to generate partial or entire chloroplast genomes for phylogenetic studies. Species discrimination can be achieved already with partial data (subsets of genes), but evolutionarily young lineages may require more informative characters. Therefore, it is expected that many complete chloroplast genomes will be produced in the years to come. While generating these genomes, the urge for de novo assembly of chloroplast genomes rather than mapping against reference genomes is adamant in order to also uncover structural rearrangements in chloroplast genome.

AB - Chloroplast DNA sequence polymorphisms are a primary source of data in many plant phylogenetic studies. The chloroplast genome is relatively conserved in its evolution making it an ideal molecule to retain phylogenetic signals. The chloroplast genome is also largely, but not completely, free from other evolutionary processes such as gene duplication, concerted evolution, pseudogene formation and genome rearrangements. The conservation of the chloroplast genome sequence allows designing primers targeting regions conserved well beyond species boundaries, and amplification of these targets.   The small size together with their high copy number in leaf cells greatly facilitates chloroplast genome sequencing. In this thesis, chloroplast phylogenomics was conducted using complete chloroplast DNA genomes obtained by a newly developed method of de novo assembly. The method was not only cost-effective but also has the potential to extract a wealth of useful information of thousands of chloroplast genomes from Whole Genome Shotgun (WGS) data. We used k-mer frequency tables to identify and extract the chloroplast reads from the WGS reads and assemble these using a highly integrated and automated custom pipeline. This pipeline includes steps aimed at optimizing assemblies and filling gaps that are left due to coverage variation in the WGS dataset. The pipeline enabled successful de novo assembly across a range of nuclear genome sizes, from Solanum lycopersicon (tomato, 0.9 Gb) to Paphiopedilum heryanum (slipper orchid, 35 Gb). The pipeline is suitable for studying structural variation in the chloroplast genome, as opposed to the common procedure of read mapping against a reference genome. To support the putative rearrangements, a flexible assembly quality comparison tool was created that combines and visualizes read mapping and alignment results in a two-dimensional plot. We have evaluated the ability of this tool using the de novo assemblies of S. lycopersicon and P. henryanum chloroplasts. The results show that not only we can immediately select the best of two options, but also determine the location of specific artefacts. In order to explore and evaluate the utility of complete chloroplast phylogenomics, tomato and Paphiopedilum spp were used to conduct phylogenetic inferences based on the complete chloroplast genome. In total 84 tomato chloroplast genomes within the section Lycopersicon were assembled and phylogenetic trees produced. The analyses revealed that next to the chloroplast regions and spacers traditionally used for phylogenetics, additional regions of protein coding and non-coding DNA may be exploited for intraspecific phylogenetic studies. In particular, more than 50% of all phylogenetically relevant information could be included by just using four genes (ycf1, ndhF, ndhA, and ndhH), of which 34% in ycf1 alone. The topology of the phylogenetic tree inferred from ycf1 was the same as that of trees based on all other protein coding genes, although with lower bootstrap values. The phylogenetic analyses based on 32 complete Paphiopedilum spp. chloroplast genomes confirmed the division of the genus into three subgenera Parvisepalum, Brachypetalum and Paphiopedilum. The division of five sections of subgenus Paphiopedilum was also recovered. The de novo assemblies revealed several structural rearrangements including gene loss and inversion. In addition, the chloroplast genome of Paphiopedilum has experienced extreme IR expansion that has included part of or the entire SSC region, resulting in larger IR regions than commonly observed among monocots. In conclusion, WGS data offer opportunities to generate partial or entire chloroplast genomes for phylogenetic studies. Species discrimination can be achieved already with partial data (subsets of genes), but evolutionarily young lineages may require more informative characters. Therefore, it is expected that many complete chloroplast genomes will be produced in the years to come. While generating these genomes, the urge for de novo assembly of chloroplast genomes rather than mapping against reference genomes is adamant in order to also uncover structural rearrangements in chloroplast genome.

KW - phylogenetics

KW - genomes

KW - chloroplasts

KW - models

KW - solanum

KW - orchidaceae

KW - phylogenomics

KW - dna sequencing

KW - fylogenetica

KW - genomen

KW - chloroplasten

KW - modellen

KW - solanum

KW - orchidaceae

KW - phylogenomica

KW - dna-sequencing

U2 - 10.18174/390196

DO - 10.18174/390196

M3 - internal PhD, WU

SN - 9789462579354

PB - Wageningen University

CY - Wageningen

ER -