Exploring the unmapped DNA and RNA reads in a songbird genome

Veronika N. Laine*, Toni I. Gossmann, Kees van Oers, Marcel E. Visser, Martien A.M. Groenen

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

2 Citations (Scopus)

Abstract

Background: A widely used approach in next-generation sequencing projects is the alignment of reads to a reference genome. Despite methodological and hardware improvements which have enhanced the efficiency and accuracy of alignments, a significant percentage of reads frequently remain unmapped. Usually, unmapped reads are discarded from the analysis process, but significant biological information and insights can be uncovered from these data. We explored the unmapped DNA (normal and bisulfite treated) and RNA sequence reads of the great tit (Parus major) reference genome individual. From the unmapped reads we generated de novo assemblies, after which the generated sequence contigs were aligned to the NCBI non-redundant nucleotide database using BLAST, identifying the closest known matching sequence. Results: Many of the aligned contigs showed sequence similarity to different bird species and genes that were absent in the great tit reference assembly. Furthermore, there were also contigs that represented known P. major pathogenic species. Most interesting were several species of blood parasites such as Plasmodium and Trypanosoma. Conclusions: Our analyses revealed that meaningful biological information can be found when further exploring unmapped reads. For instance, it is possible to discover sequences that are either absent or misassembled in the reference genome, and sequences that indicate infection or sample contamination. In this study we also propose strategies to aid the capture and interpretation of this information from unmapped reads.

Original languageEnglish
Article number19
JournalBMC Genomics
Volume20
Issue number1
DOIs
Publication statusPublished - 8 Jan 2019

Fingerprint

Songbirds
Genome
RNA
DNA
Trypanosoma
Biological Phenomena
Plasmodium
Birds
Parasites
Nucleotides
Databases
Infection
Genes

Keywords

  • Contamination
  • DNA sequencing
  • Pathogens
  • RNA sequencing
  • Unmapped reads

Cite this

@article{e0c0de3457f849f69d56e3fd1d0d73da,
title = "Exploring the unmapped DNA and RNA reads in a songbird genome",
abstract = "Background: A widely used approach in next-generation sequencing projects is the alignment of reads to a reference genome. Despite methodological and hardware improvements which have enhanced the efficiency and accuracy of alignments, a significant percentage of reads frequently remain unmapped. Usually, unmapped reads are discarded from the analysis process, but significant biological information and insights can be uncovered from these data. We explored the unmapped DNA (normal and bisulfite treated) and RNA sequence reads of the great tit (Parus major) reference genome individual. From the unmapped reads we generated de novo assemblies, after which the generated sequence contigs were aligned to the NCBI non-redundant nucleotide database using BLAST, identifying the closest known matching sequence. Results: Many of the aligned contigs showed sequence similarity to different bird species and genes that were absent in the great tit reference assembly. Furthermore, there were also contigs that represented known P. major pathogenic species. Most interesting were several species of blood parasites such as Plasmodium and Trypanosoma. Conclusions: Our analyses revealed that meaningful biological information can be found when further exploring unmapped reads. For instance, it is possible to discover sequences that are either absent or misassembled in the reference genome, and sequences that indicate infection or sample contamination. In this study we also propose strategies to aid the capture and interpretation of this information from unmapped reads.",
keywords = "Contamination, DNA sequencing, Pathogens, RNA sequencing, Unmapped reads",
author = "Laine, {Veronika N.} and Gossmann, {Toni I.} and {van Oers}, Kees and Visser, {Marcel E.} and Groenen, {Martien A.M.}",
year = "2019",
month = "1",
day = "8",
doi = "10.1186/s12864-018-5378-2",
language = "English",
volume = "20",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "Springer Verlag",
number = "1",

}

Exploring the unmapped DNA and RNA reads in a songbird genome. / Laine, Veronika N.; Gossmann, Toni I.; van Oers, Kees; Visser, Marcel E.; Groenen, Martien A.M.

In: BMC Genomics, Vol. 20, No. 1, 19, 08.01.2019.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Exploring the unmapped DNA and RNA reads in a songbird genome

AU - Laine, Veronika N.

AU - Gossmann, Toni I.

AU - van Oers, Kees

AU - Visser, Marcel E.

AU - Groenen, Martien A.M.

PY - 2019/1/8

Y1 - 2019/1/8

N2 - Background: A widely used approach in next-generation sequencing projects is the alignment of reads to a reference genome. Despite methodological and hardware improvements which have enhanced the efficiency and accuracy of alignments, a significant percentage of reads frequently remain unmapped. Usually, unmapped reads are discarded from the analysis process, but significant biological information and insights can be uncovered from these data. We explored the unmapped DNA (normal and bisulfite treated) and RNA sequence reads of the great tit (Parus major) reference genome individual. From the unmapped reads we generated de novo assemblies, after which the generated sequence contigs were aligned to the NCBI non-redundant nucleotide database using BLAST, identifying the closest known matching sequence. Results: Many of the aligned contigs showed sequence similarity to different bird species and genes that were absent in the great tit reference assembly. Furthermore, there were also contigs that represented known P. major pathogenic species. Most interesting were several species of blood parasites such as Plasmodium and Trypanosoma. Conclusions: Our analyses revealed that meaningful biological information can be found when further exploring unmapped reads. For instance, it is possible to discover sequences that are either absent or misassembled in the reference genome, and sequences that indicate infection or sample contamination. In this study we also propose strategies to aid the capture and interpretation of this information from unmapped reads.

AB - Background: A widely used approach in next-generation sequencing projects is the alignment of reads to a reference genome. Despite methodological and hardware improvements which have enhanced the efficiency and accuracy of alignments, a significant percentage of reads frequently remain unmapped. Usually, unmapped reads are discarded from the analysis process, but significant biological information and insights can be uncovered from these data. We explored the unmapped DNA (normal and bisulfite treated) and RNA sequence reads of the great tit (Parus major) reference genome individual. From the unmapped reads we generated de novo assemblies, after which the generated sequence contigs were aligned to the NCBI non-redundant nucleotide database using BLAST, identifying the closest known matching sequence. Results: Many of the aligned contigs showed sequence similarity to different bird species and genes that were absent in the great tit reference assembly. Furthermore, there were also contigs that represented known P. major pathogenic species. Most interesting were several species of blood parasites such as Plasmodium and Trypanosoma. Conclusions: Our analyses revealed that meaningful biological information can be found when further exploring unmapped reads. For instance, it is possible to discover sequences that are either absent or misassembled in the reference genome, and sequences that indicate infection or sample contamination. In this study we also propose strategies to aid the capture and interpretation of this information from unmapped reads.

KW - Contamination

KW - DNA sequencing

KW - Pathogens

KW - RNA sequencing

KW - Unmapped reads

UR - https://doi.org/10.6084/m9.figshare.c.4359863

U2 - 10.1186/s12864-018-5378-2

DO - 10.1186/s12864-018-5378-2

M3 - Article

VL - 20

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 19

ER -