TY - CHAP
T1 - Scalable workflows and reproducible data analysis for genomics
AU - Strozzi, Francesco
AU - Janssen, Roel
AU - Wurmus, Ricardo
AU - Crusoe, Michael R.
AU - Githinji, George
AU - Di Tommaso, Paolo
AU - Belhachemi, Dominique
AU - Möller, Steffen
AU - Smant, Geert
AU - de Ligt, Joep
AU - Prins, Pjotr
PY - 2019/7/6
Y1 - 2019/7/6
N2 - Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer. In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: The Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel. We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters. By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.
AB - Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer. In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: The Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel. We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters. By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.
KW - Big data
KW - Bioconda
KW - Bioinformatics
KW - Cloud computing
KW - Cluster computing
KW - Common Workflow Language
KW - CWL
KW - Debian Linux
KW - Evolutionary biology
KW - GNU Guix
KW - Guix Workflow Language
KW - MPI
KW - MrBayes
KW - Nextflow
KW - Parallelization
KW - Snakemake
KW - Virtual machine
U2 - 10.1007/978-1-4939-9074-0_24
DO - 10.1007/978-1-4939-9074-0_24
M3 - Chapter
C2 - 31278683
AN - SCOPUS:85068863791
SN - 9781493990733
T3 - Methods in Molecular Biology
SP - 723
EP - 745
BT - Evolutionary Genomics
PB - Humana Press Inc.
ER -