# Hierarchical Clustering Package

The function

FindClusters finds clusters in a dataset based on a distance or dissimilarity function. This package contains functions for generating cluster hierarchies and visualizing the mergers in the hierarchical clustering.

Agglomerate[data] | collect the elements of data into a hierarchy of clusters |

Hierarchical clustering function.

The

Agglomerate function computes a cluster hierarchy of a dataset.

Agglomerate accepts data in the same forms accepted by

FindClusters. The output from

Agglomerate is a nested

Cluster object representing the hierarchical clustering.

Here is a small numerical dataset.

This constructs a hierarchical clustering of the data.

Out[3]= | |

Here is some Boolean data.

This clusters the Boolean data.

Out[5]= | |

Here is some string data.

This clusters the string data.

Out[7]= | |

Cluster[c_{1},c_{2},d,n_{1},n_{2}] | represent a merger in the cluster hierarchy where the elements and are the subclusters merged with distance or dissimilarity value d, and the subclusters contain and data elements, respectively |

An element of the cluster hierarchy.

ClusterFlatten[cluster] | flatten cluster returning a list of the data elements contained in the cluster |

ClusterSplit[cluster,n] | split cluster into n clusters, effectively undoing the last n-1 mergers |

Functions for manipulating Cluster expressions.

The

ClusterFlatten and

ClusterSplit functions are utilities for manipulating

Cluster objects.

This splits the clustering from the numerical data into the top 3 subclusters.

Out[8]= | |

This flattens the cluster.

Out[9]= | |

Options for Agglomerate.

The

DistanceFunction option is the same as for

FindClusters.

DistanceFunction defines the distance or dissimilarity between data points, while

Linkage defines the dissimilarity between clusters of data points.

"Single" | smallest intercluster dissimilarity |

"Average" | average intercluster dissimilarity |

"Complete" | largest intercluster dissimilarity |

"WeightedAverage" | weighted average intercluster dissimilarity |

"Centroid" | distance from cluster centroids |

"Median" | distance from cluster medians |

"Ward" | Ward's minimum variance dissimilarity |

f | a pure function |

Possible values for the Linkage option.

Linkage methods determine the intercluster dissimilarity, or fusion level, given the dissimilarities between member elements. Common algorithms include single linkage, which selects the smallest distance between elements; complete linkage, which selects the largest distance between elements; and centroid linkage, which uses the dissimilarity between cluster centroids.

With

Linkage->f,

f is a pure function that defines the linkage algorithm. Distances or dissimilarities between clusters are determined recursively using information about the distances or dissimilarities of the unmerged clusters and their counterparts to determine the distances or dissimilarities for the newly merged cluster. If

i and

j represent clusters to be merged, new distances or dissimilarities are recursively calculated between this merged cluster and the remaining

k clusters. The function

f defines the recursion and is passed arguments

, where

d represents the distances or dissimilarities between clusters and

n represents the number of data elements in a cluster. The function returns the dissimilarity between

k and the cluster formed by merging clusters

i and

j.

This clusters

data using the average linkage method.

Out[10]= | |

The following uses a pure function equivalent to the average linkage.

Out[11]= | |

Elements can be labeled using a list of rules or a single rule. Distances or dissimilarities between elements are computed using the expression on the left-hand sides of the rules. The output

Cluster object contains the right-hand sides of the rules.

This clusters the data showing results by label.

Out[12]= | |

It is also possible to build a cluster hierarchy directly using a distance or dissimilarity matrix—a matrix that provides the pairwise distances or dissimilarities between all data elements in lieu of a distance or dissimilarity function. In a distance matrix the element in the

i row and

j column stores the distance value between the

i and

j data elements. Note that only the upper-triangular portion of the matrix is used, that is,

, since distances and dissimilarities are symmetric.

DistanceMatrix[data] | compute the symmetric matrix of distance or dissimilarity |

DirectAgglomerate[mat] | construct a cluster hierarchy based on the square distance matrix mat |

DirectAgglomerate[mat,list] | associate the elements of list with the corresponding rows in the distance matrix mat |

Functions for clustering using distance or dissimilarity matrices.

The

DistanceMatrix function can be given a

DistanceFunction option to define the distance measure.

Here is an example distance matrix.

Out[13]= | |

This clusters the distance matrix directly, representing data elements by row number.

Out[14]= | |

This clusters the distance matrix specifying the corresponding data elements.

Out[15]= | |

The binary tree structure of the hierarchical cluster leads naturally to a graphical representation—the dendrogram. A dendrogram plot shows a visual representation of the merger history by connecting two lines representing clusters with a bar at their fusion level.

Dendrogram plotting functions.

This generates a basic dendrogram plot.

Out[16]= | |

This generates a dendrogram using the Manhattan distance.

Out[17]= | |

Options for DendrogramPlot.

This generates a dendrogram using Ward's linkage method to cluster the data.

Out[18]= | |

The option

LeafLabels is provided to support labeling of the dendrogram. For data, a value of

LeafLabels->Automatic will use the data element position for the label, but the option can also take a list of expressions or a function. The function will be applied to each data element to generate the label expression. For

Cluster objects, the

LeafLabels option can take a function but not a list or

Automatic, as there is no unambiguous mapping between the labels and data points in a

Cluster object.

This generates a dendrogram with automatically generated labels.

Out[19]= | |

This generates a dendrogram using a list of labels.

Out[20]= | |

This generates a dendrogram using a label function.

Out[21]= | |

For large datasets, only a summary of the full dendrogram may be desired. Dendrograms may be truncated using the

TruncateDendrogram option by providing an integer or a list of two integers. If given as a list of two integers, the first value specifies the fusion level above which mergers should not be shown, while the second indicates the fusion level under which the dendrogram should be truncated. For a single integer

n,

TruncateDendrogram->n is equivalent to

TruncateDendrogram. A value of

∞ may be given as a second value to show the full clustering without truncation from below. When a cluster is truncated and labels are specified, a box is substituted for a label indicating the size of the cluster that has been truncated.

Here is a larger dataset to cluster. It is based on Fisher's iris data.

This turns off tie warnings.

This generates a dendrogram truncated to show only the highest 20 mergers.

Out[24]= | |

This highlights the two top clusters.

Out[25]= | |

This generates a dendrogram truncated between levels 2 and 20.

Out[26]= | |