Improving hierarchical clustering of genotypic data via principal component analysis

Research output: Contribution to journalArticleAcademicpeer-review

13 Citations (Scopus)

Abstract

Understanding the genetic structure of germplasm collections is a prerequisite for effective and efficient use of crop genetic resources in genebanks. Currently, hierarchical clustering techniques are most popular for describing genetic structure in germplasm collections. Traditionally performed using dissimilarities based on raw genotypic data, recent studies have shown that cluster analysis can be improved by first condensing the genotypic data using principal component analysis (PCA). Although the two-step approach (PCA followed by cluster analysis) is gaining popularity, no systematic study into its benefits over traditional clustering methods has been performed. In particular, the relationship between the number of principal components (PCs) to be retained and the performance of cluster analysis have not been established. It is also not clear whether genetic data should be scaled before performing PCA. Here we present a detailed study comparing cluster analysis using distances based on condensed data using significant PCs and clustering based on the full dataset. We also studied the effect of data scaling on PCA-based clustering. Using simulations, we show that in discretely subdivided populations, maximum clustering performance is attained by using a subset of PCs that relate to differentiation between subpopulations and that scaling of the data is key to achieving improvement in PCA-based clustering. For scaled data, we report consistently higher clustering success for PCA, particularly at lower levels of population differentiation, while gains for unscaled data are minor. This is confirmed by real data, where PCA-based clustering of scaled genotypic data leads to visible improvements in resolving finer patterns of geographic subdivision. Our results show clearly that proper scaling and reduction of genotypic data is key to improving clustering performance
Original languageEnglish
Pages (from-to)1546-1554
JournalCrop Science
Volume53
DOIs
Publication statusPublished - 2013

Keywords

  • genome-wide association
  • molecular marker data
  • genetic-structure
  • core collections

Fingerprint

Dive into the research topics of 'Improving hierarchical clustering of genotypic data via principal component analysis'. Together they form a unique fingerprint.

Cite this