Experimental results of "Managing variant calling datasets the big data way"

Dataset

Description

Tomatula was demonstrated for retrieving the allele frequencies for a given region in the data from Aflitos et al (2014). We developed scripts to retrieve allele frequencies, either from the VCF file storage or Apache Parquet. We executed a series of experiments, querying for a region of 2000 bases in the file of chromosome 6, that corresponds to the approximate length of a gene. We compared both storage formats (VCF files and Parquet), two input sizes (104 and 1144 individuals), different cluster sizes varying between 2 and 150 executor nodes, and HDFS replication factor was set to 3, 5, 7, and 9, in order to examine four main factors that can affect the performance of a Big Data cluster: (a) the storage format, (b) the size of the input files, (c) the number of computing nodes of the cluster, and (d) the replication factor of HDFS. The block size of the HDFS was kept at the default value of 128MB. All experiments were executed five times and the detailed results are provided here, along with a script that produces the corresponding figures.
Date made available22 May 2017
PublisherZenodo

Research Output

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

Boufea, A., Finkers, H. J., van Kaauwen, M. P. W., Kramer, M. R. & Athanasiadis, I. N., 2017, BDCAT '17 Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies. ACM, p. 219-226

Research output: Chapter in Book/Report/Conference proceedingChapter

Open Access
  • 1 Citation (Scopus)

    Tomatula

    Boufea, K. & Athanasiadis, I. N., 2017

    Research output: Non-textual formSoftwareOther research output

    Open Access

    Cite this

    Boufea, K. (Creator), Athanasiadis, I. (Creator) (22 May 2017). Experimental results of "Managing variant calling datasets the big data way". Zenodo. 10.5281/zenodo.582145