HierarchicalClustering (v8)

Agglomerative hierarchical clustering of samples and/or genes

Author: Edwin Juarez, University of California San Diego

Algorithm Version:

Summary

Cluster analysis is a means of discovering, within a body of data, groups whose members are similar for some property.  Clustering of gene expression data is geared toward finding genes that are expressed or not expressed in similar ways under certain conditions.

Given a set of items to be clustered (items can be either genes or samples), agglomerative hierarchical clustering (HC) recursively merges items with other items or with the result of previous merges, according to the distance between each pair of items, with the closest item pairs being merged first. As a result, it produces a tree structure, referred to as dendogram, whose nodes correspond to:

  • the original items (these are the leaves of the tree)
  • the merging of other nodes (these are the internal nodes of the tree)

The HierarchicalClustering module produces a CDT file that contains the original data, but reordered to reflect the clustering. Additionally, either a dendrogram or two dendrogram files are created (one for clustering rows and one for clustering columns). The row dendrogram has the extension GTR, while the column dendrogram has the extension ATR. These files describe the order in which nodes were joined during the clustering.

The module includes several preprocessing options. The order of the preprocessing operations is:

  1. Row (gene) center
  2. Row (gene) normalize
  3. Column (sample) center
  4. Column (sample) normalize

References

  • Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998;95:14863-14868.
  • de Hoon MJ, Imoto S, Nolan J, Miyano S. Open source clustering software. Bioinformatics. 2004;20:1453-1454.
  • Kim JW et al. Characterizing genomic alterations in cancer by complementary functional associations. Nature Biotechnology volume 34, pages 539–546 (2016). doi:10.1038/nbt.3527
  • Aghagolzadeh M et al. A Hierarchical Clustering Based on Mutual Information Maximization, 2007 IEEE International Conference on Image Processing, San Antonio, TX, 2007, pp. I - 277-I - 280. doi: 10.1109/ICIP.2007.4378945
  • A. Kraskov et al. Hierarchical clustering using mutual information. Europhysics Letters. 2005; 70 278. doi:10.1209/epl/i2004-10483-y

 

Functionality yet to be implemented:

  • Preprocess data by performing Log Base 2 Transform
  • Cluster using pairwise single-linkage:  The distance between two clusters is computed as the distance between the two closest elements in the two clusters.
  • Cluster using pairwise centroid-linkage: The distance between two clusters is computed as the (squared) Euclidean distance between their centroids or means.

Note that all three of these parameters can be found in HierarchicalClustering V6. If you would like to see them added to HierarchicalClustering V7 feel free to request it on the GenePattern help forum: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!forum/genepattern-help

 

Technical notes:

  • This module has been tested to run in the Docker container genepattern/docker-python36:0.5 which has build code b56kefa53bu9hv2vo2bbpwa
  • To create a conda environment (called GP_hierarchical_clustering_env) with the required dependencies download the requirements.txt file from the github repository named genepattern/docker-python36 (here is the url of the file: https://raw.githubusercontent.com/genepattern/docker-python36/master/requirements.txt) and run this three commands in the same folder where requirements.txt is located:

conda create --name GP_hierarchical_clustering_env pip
source activate GP_hierarchical_clustering_env
pip install -r requirements.txt

  • When calling this module from the command line, takes a few seconds (1-4) of wall time to run to completion in a conda enviroment using default parameters and this input: https://datasets.genepattern.org/data/test_data/BRCA_minimal_60x19.gct
  • When calling this module from the command line, takes a few seconds (1-5) of wall time to run to completion in a conda enviroment using default parameters and this input: https://datasets.genepattern.org/data/test_data/BRCA_large_20783x40.gct
  • When calling this module from the command line, takes a few minutes (3-5) of wall time to run to completion in a conda enviroment using default parameters except for row_distance_metric = Pearson correlation and this input: https://datasets.genepattern.org/data/test_data/BRCA_large_20783x40.gct
  • When attempting row clustering it takes significantly longer especially if using the information coefficient.

Parameters

Name Description
input filename * input data file name - .gct, .res, .pcl
column distance measure *

Distance measure for column (sample) clustering.  Options include:

  • No column clustering
  • Pearson correlation (default): Pearson's correlation coefficient between two variables is defined as the covariance of the two variables divided by the product of their standard deviations. It is a measure for how well a straight line can be fitted to a scatter plot of x and y. If all the points in the scatter plot lie on a straight line, the Pearson correlation coefficient is either +1 or -1, depending on whether the slope of line is positive or negative. If it is equal to zero, there is no correlation between x and y.
  • The information coefficient is a measure of the informaton-theoretic similarity between two variables. The information coefficient (IC) is +1 or -1 if two variables contain essentially the same information (e.g., one variable is a scaled version of the other or they are a mirror of each other). The IC is 0 only when two variables are statistically independent. Unlike the Pearson correlation, this metric is sensitive to nonlinear relationships between variables. Note that this metric is very  computationally intensive, hece it wil take is significantly longer to run than any other metric in this list. For more information refer to: https://doi.org/10.1038/nbt.3527
  • Uncentered correlation: The same as the Pearson correlation, except that the sample means are set to zero in the expression for uncentered correlation. The uncentered correlation coefficient lies between –1 and +1; hence the distance lies between 0 and 2.
  • Uncentered correlation, absolute value: The same as the absolute Pearson correlation, except that the sample means are set to zero in the expression for uncentered correlation. The uncentered correlation coefficient lies between 0 and +1; hence the distance lies between 0 and 1.
  • Pearson correlation, absolute value:  The absolute value of the Pearson correlation coefficient is used; hence the corresponding distance lies between 0 and 1, just like the correlation coefficient.
  • Spearman’s rank correlation: Nonparametric version of the Pearson correlation that measures the strength of association between two ranked variables. To calculate the Spearman rank correlation, each data value is replaced by their rank if the data in each vector is ordered by their value. Then the Pearson correlation between the two rank vectors instead of the data vectors is calculated. It is useful because it is more robust against outliers than the Pearson correlation.
  • Kendall’s tau: The Kendall tau distance is a metric that counts the number of pairwise disagreements between two lists. The larger the distance, the more dissimilar the two lists are.
  • Euclidean distance: Corresponds to the length of the shortest path between two points. Takes into account the difference between two samples directly, based on the magnitude of changes in the sample levels. This distance type is usually used for data sets that are normalized or without any special distribution problem.
  • City-block distance: Also known as the Manhattan or taxi cab distance; the city-block distance is the sum of distances along each dimension between two points.
row distance measure *

Distance measure for row (gene) clustering.  Options include:

  • No row clustering (default)
  • Pearson correlation
  • Information coefficient
  • Uncentered correlation
  • Uncentered correlation, absolute value
  • Pearson correlation, absolute value
  • Spearman’s rank correlation
  • Kendall’s tau
  • Euclidean distance
  • City-block distance
NOTE: Filtering beforehand is recommended since row clustering is computationally intensive.
clustering method *

Hierarchical clustering method to use.  Options include:

  • Pairwise complete-linkage: The distance between two clusters is computed as the maximum distance between a pair of objects, one in one cluster and one in another.
  • Pairwise average-linkage (default): The distance between two clusters is computed as the average distance between the elements in the two clusters.
row center  Specifies whether to center each row (gene) in the data.  Centering each row subtracts the row-wise mean or median from the values in each row of data, so that the mean or median value of each row is 0. Default: no
row normalize  Specifies whether to normalize each row (gene) in the data. Normalizing each row multiplies all values in each row of data by a scale factor S so that the sum of the squares of the values in each row is 1.0 (a separate S is computed for each row).  Default: no
column center  Specifies whether to center each column (sample) in the data. Centering each column subtracts the column-wise mean or median from the values in each column of data, so that the mean or median value of each column is 0. Default: no
column normalize  Specifies whether to normalize each column (sample) in the data. Normalizing each column multiplies all values in each column of data by a scale factor S so that the sum of the squares of the values in each column is 1.0 (a separate S is computed for each column). Default: no
output base name *

Base name for the output files

output distance matrix Whether or not output the pair-wise distance matrix. If true, the distance between each column will be computed, which can be very computationally intensive. If unsure, leave as False. Default: False.

* - required

Output Files

  1. CDT file
    Contains the original data, but reordered to reflect the clustering.
  2. ATR file (if clustering by columns/samples) or GTR file (if clustering by rows/genes)
    These files describe the order in which nodes were joined during the clustering.
  3. distance_matrix.txt
    This is a tab-separated file which contains all the distances used to compute the clustering.

License

HierarchicalClustering is distributed under a modified BSD license available at https://raw.githubusercontent.com/genepattern/HierarchicalClustering/develop/LICENSE

Platform Dependencies

Task Type:
Clustering

CPU Type:
any

Operating System:
any

Language:
Python 3.6

Version Comments

Version Release Date Description
8.1 2018-05-09 Improving performance
8 2018-02-12 Updating the docs to meet automatic build requirements
7 2018-02-01 Ported to Python 3.6
6 2013-03-13 Updated for Java 7
5 2009-02-10 Row clustering turned off by default
4 2008-08-20 Report error when out of memory. Added 64-bit Linux support. Fixed bug that caused mean centering to be performed when median centering was selected.
3 2007-03-08 Changed default distance measure
2 2005-12-16 Fixes bugs in previous version