Chemometrics Analysis of Big Data

José Camacho*, Edoardo Saccenti*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingChapterAcademicpeer-review

1 Citation (Scopus)

Abstract

This article presents the extension of chemometric visualization tools and data analysis procedures to Big Data sets, with an unlimited number of observations or variables. We restrict this extension to Principal Component Analysis and Partial Least Squares, and related tools like score and loading plots. The solution is based on the iterative computation of cross-product matrices for model fitting. Furthermore, clustering techniques are used to visualize the elements in the Big Data mode, whether this is the rows or the columns of the data matrix. The objective is to retain the visualization capabilities of traditional score and loading plots while making the user-supervised analysis of massive data sets affordable with common computers. Efficient processing and updating approaches, visualization techniques and challenges for future research are identified throughout the article. The extension is illustrated with several data sets, including data from an industrial process, a well-known multivariate benchmark with 5-million observations and a DNA methylation data set with more than 500K variables. Examples are performed with freely available software: the Multivariate Exploratory Data Analysis (MEDA) Toolbox for Matlab.

Original languageEnglish
Title of host publicationComprehensive Chemometrics
Subtitle of host publicationChemical and Biochemical Data Analysis
EditorsSteven Brown, Beata Walczak, Romà Tauler
PublisherElsevier
Chapter4.18
Pages437-458
Number of pages22
Volume4
Edition2
ISBN (Electronic)9780444641656
ISBN (Print)9780444641663
DOIs
Publication statusPublished - 2020

Keywords

  • Big Data
  • Chemometrics
  • Clustering
  • Compressed score plots
  • Cross-product
  • Parallelization
  • Partial least squares
  • Principal component analysis
  • Visualization

Fingerprint

Dive into the research topics of 'Chemometrics Analysis of Big Data'. Together they form a unique fingerprint.

Cite this