Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

1 Citation (Scopus)

Abstract

Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.
Original languageEnglish
Title of host publicationBDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
PublisherACM
Pages219-226
ISBN (Electronic)9781450355490
DOIs
Publication statusPublished - 2017
EventFourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies - Austin, United States
Duration: 5 Dec 20178 Dec 2017
Conference number: 4

Conference

ConferenceFourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
CountryUnited States
CityAustin
Period5/12/178/12/17

Fingerprint

Electric sparks
Big data

Keywords

  • Big Data
  • bioinformatics
  • variant calling
  • Hadoop
  • HDFS
  • Apache Spark
  • Apache Parquet

Cite this

Boufea, A., Finkers, H. J., van Kaauwen, M. P. W., Kramer, M. R., & Athanasiadis, I. N. (2017). Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet. In BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 219-226). ACM. https://doi.org/10.1145/3148055.3148060
Boufea, Aikaterini ; Finkers, H.J. ; van Kaauwen, M.P.W. ; Kramer, M.R. ; Athanasiadis, I.N. / Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet. BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. ACM, 2017. pp. 219-226
@inbook{b9fb90d9894244c9a49dc5343aa8ae23,
title = "Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet",
abstract = "Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.",
keywords = "Big Data, bioinformatics, variant calling, Hadoop, HDFS, Apache Spark, Apache Parquet",
author = "Aikaterini Boufea and H.J. Finkers and {van Kaauwen}, M.P.W. and M.R. Kramer and I.N. Athanasiadis",
year = "2017",
doi = "10.1145/3148055.3148060",
language = "English",
pages = "219--226",
booktitle = "BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies",
publisher = "ACM",

}

Boufea, A, Finkers, HJ, van Kaauwen, MPW, Kramer, MR & Athanasiadis, IN 2017, Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet. in BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. ACM, pp. 219-226, Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, Austin, United States, 5/12/17. https://doi.org/10.1145/3148055.3148060

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet. / Boufea, Aikaterini; Finkers, H.J.; van Kaauwen, M.P.W.; Kramer, M.R.; Athanasiadis, I.N.

BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. ACM, 2017. p. 219-226.

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

TY - CHAP

T1 - Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

AU - Boufea, Aikaterini

AU - Finkers, H.J.

AU - van Kaauwen, M.P.W.

AU - Kramer, M.R.

AU - Athanasiadis, I.N.

PY - 2017

Y1 - 2017

N2 - Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.

AB - Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.

KW - Big Data

KW - bioinformatics

KW - variant calling

KW - Hadoop

KW - HDFS

KW - Apache Spark

KW - Apache Parquet

U2 - 10.1145/3148055.3148060

DO - 10.1145/3148055.3148060

M3 - Chapter

SP - 219

EP - 226

BT - BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

PB - ACM

ER -

Boufea A, Finkers HJ, van Kaauwen MPW, Kramer MR, Athanasiadis IN. Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet. In BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. ACM. 2017. p. 219-226 https://doi.org/10.1145/3148055.3148060