Fast probabilistic file fingerprinting for big data

K. Tretyakov, S. Laur, G. Smant, J. Vilo, J.C.P. Prins

Research output: Contribution to journalArticleAcademicpeer-review

7 Citations (Scopus)

Abstract

Background: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. Results: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. Conclusions: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.
Original languageEnglish
Article numberS8
Number of pages8
JournalBMC Genomics
Volume14
Issue numbersuppl. 2
DOIs
Publication statusPublished - 2013

Fingerprint

Information Storage and Retrieval
Reading
Workflow
Dermatoglyphics
Libraries
Technology

Cite this

Tretyakov, K., Laur, S., Smant, G., Vilo, J., & Prins, J. C. P. (2013). Fast probabilistic file fingerprinting for big data. BMC Genomics, 14(suppl. 2), [S8]. https://doi.org/10.1186/1471-2164-14-S2-S8
Tretyakov, K. ; Laur, S. ; Smant, G. ; Vilo, J. ; Prins, J.C.P. / Fast probabilistic file fingerprinting for big data. In: BMC Genomics. 2013 ; Vol. 14, No. suppl. 2.
@article{5098444abe51451fab988755172130b9,
title = "Fast probabilistic file fingerprinting for big data",
abstract = "Background: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. Results: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. Conclusions: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.",
author = "K. Tretyakov and S. Laur and G. Smant and J. Vilo and J.C.P. Prins",
year = "2013",
doi = "10.1186/1471-2164-14-S2-S8",
language = "English",
volume = "14",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "Springer Verlag",
number = "suppl. 2",

}

Tretyakov, K, Laur, S, Smant, G, Vilo, J & Prins, JCP 2013, 'Fast probabilistic file fingerprinting for big data' BMC Genomics, vol. 14, no. suppl. 2, S8. https://doi.org/10.1186/1471-2164-14-S2-S8

Fast probabilistic file fingerprinting for big data. / Tretyakov, K.; Laur, S.; Smant, G.; Vilo, J.; Prins, J.C.P.

In: BMC Genomics, Vol. 14, No. suppl. 2, S8, 2013.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - Fast probabilistic file fingerprinting for big data

AU - Tretyakov, K.

AU - Laur, S.

AU - Smant, G.

AU - Vilo, J.

AU - Prins, J.C.P.

PY - 2013

Y1 - 2013

N2 - Background: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. Results: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. Conclusions: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.

AB - Background: Biological data acquisition is raising new challenges, both in data analysis and handling. Not only is it proving hard to analyze the data at the rate it is generated today, but simply reading and transferring data files can be prohibitively slow due to their size. This primarily concerns logistics within and between data centers, but is also important for workstation users in the analysis phase. Common usage patterns, such as comparing and transferring files, are proving computationally expensive and are tying down shared resources. Results: We present an efficient method for calculating file uniqueness for large scientific data files, that takes less computational effort than existing techniques. This method, called Probabilistic Fast File Fingerprinting (PFFF), exploits the variation present in biological data and computes file fingerprints by sampling randomly from the file instead of reading it in full. Consequently, it has a flat performance characteristic, correlated with data variation rather than file size. We demonstrate that probabilistic fingerprinting can be as reliable as existing hashing techniques, with provably negligible risk of collisions. We measure the performance of the algorithm on a number of data storage and access technologies, identifying its strengths as well as limitations. Conclusions: Probabilistic fingerprinting may significantly reduce the use of computational resources when comparing very large files. Utilisation of probabilistic fingerprinting techniques can increase the speed of common file-related workflows, both in the data center and for workbench analysis. The implementation of the algorithm is available as an open-source tool named pfff, as a command-line tool as well as a C library. The tool can be downloaded from http://biit.cs.ut.ee/pfff.

U2 - 10.1186/1471-2164-14-S2-S8

DO - 10.1186/1471-2164-14-S2-S8

M3 - Article

VL - 14

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - suppl. 2

M1 - S8

ER -