Hierarchical Clustering Package
The function
FindClusters finds clusters in a dataset based on a distance or dissimilarity function. This package contains functions for generating cluster hierarchies and visualizing the mergers in the hierarchical clustering.
| Agglomerate[data] | collects the elements of data into a hierarchy of clusters |
Hierarchical clustering function.
The
Agglomerate function computes a cluster hierarchy of a dataset.
Agglomerate accepts data in the same forms accepted by
FindClusters. The output from
Agglomerate is a nested
Cluster object representing the hierarchical clustering.
Here is a small numerical dataset. |
This constructs a hierarchical clustering of the data.
| Out[3]= |  |
|
Here is some Boolean data. |
This clusters the Boolean data.
| Out[5]= |  |
|
Here is some string data. |
This clusters the string data.
| Out[7]= |  |
|
| Cluster[c1,c2,d,n1,n2] | represents a merger in the cluster hierarchy where the elements c1 and c2 are the subclusters merged with distance or dissimilarity value d, and the subclusters contain n1 and n2 data elements respectively |
An element of the cluster hierarchy.
| ClusterFlatten[cluster] | flattens cluster returning a list of the data elements contained in the cluster |
| ClusterSplit[cluster,n] | splits cluster into n clusters, effectively undoing the last n-1 mergers |
Functions for manipulating Cluster expressions.
The
ClusterFlatten and
ClusterSplit functions are utilities for manipulating
Cluster objects.
This splits the clustering from the numerical data into the top 3 subclusters.
| Out[8]= |  |
|
This flattens the cluster.
| Out[9]= |  |
|
Options for Agglomerate.
The
DistanceFunction option is the same as for
FindClusters.
DistanceFunction defines the distance or dissimilarity between data points, while
Linkage defines the dissimilarity between clusters of data points.
| "Single" | smallest intercluster dissimilarity |
| "Average" | average intercluster dissimilarity |
| "Complete" | largest intercluster dissimilarity |
| "WeightedAverage" | weighted average intercluster dissimilarity |
| "Centroid" | distance from cluster centroids |
| "Median" | distance from cluster medians |
| "Ward" | Ward's minimum variance dissimilarity |
| f | a pure function |
Possible values for the Linkage option.
Linkage methods determine the intercluster dissimilarity, or fusion level, given the dissimilarities between member elements. Common algorithms include single linkage, which selects the smallest distance between elements; complete linkage, which selects the largest distance between elements; and centroid linkage, which uses the dissimilarity between cluster centroids.
With
Linkage->f,
f is a pure function that defines the linkage algorithm. Distances or dissimilarities between clusters are determined recursively using information about the distances or dissimilarities of the unmerged clusters and their counterparts to determine the distances or dissimilarities for the newly merged cluster. If
i and
j represent clusters to be merged, new distances or dissimilarities are recursively calculated between this merged cluster and the remaining
k clusters. The function
f defines the recursion and is passed arguments
f[dik, djk, dij, ni, nj, nk], where
d represents the distances or dissimilarities between clusters and
n represents the number of data elements in a cluster. The function returns the dissimilarity between
k and the cluster formed by merging clusters
i and
j.
This clusters data using the average linkage method.
| Out[10]= |  |
|
The following uses a pure function equivalent to the average linkage.
| Out[11]= |  |
|
Elements can be labeled using a list of rules or a single rule. Distances or dissimilarities between elements are computed using the expression on the left-hand sides of the rules. The output
Cluster object contains the right-hand sides of the rules.
This clusters the data showing results by label.
| Out[12]= |  |
|
It is also possible to build a cluster hierarchy directly using a distance or dissimilarity matrix—a matrix that provides the pairwise distances or dissimilarities between all data elements in lieu of a distance or dissimilarity function. In a distance matrix the element in the
ith row and
jth column stores the distance value between the
ith and
jth data elements. Note that only the upper-triangular portion of the matrix is used, that is
i<j, since distances and dissimilarities are symmetric.
| DistanceMatrix[data] | computes the symmetric matrix of distance or dissimilarity |
| DirectAgglomerate[mat] | constructs a cluster hierarchy based on the square distance matrix mat |
| DirectAgglomerate[mat,list] | associates the elements of list with the corresponding rows in the distance matrix mat |
Functions for clustering using distance or dissimilarity matrices.
The
DistanceMatrix function can be given a
DistanceFunction option to define the distance measure.
Here is an example distance matrix.
| Out[13]= |  |
|
This clusters the distance matrix directly, representing data elements by row number.
| Out[14]= |  |
|
This clusters the distance matrix specifying the corresponding data elements.
| Out[15]= |  |
|
The binary tree structure of the hierarchical cluster leads naturally to a graphical representation—the dendrogram. A dendrogram plot shows a visual representation of the merger history by connecting two lines representing clusters with a bar at their fusion level.
Dendrogram plotting functions.
This generates a basic dendrogram plot.
| Out[16]= |  |
|
This generates a dendrogram using the Manhattan distance.
| Out[17]= |  |
|
Options for DendrogramPlot.
This generates a dendrogram using Ward's linkage method to cluster the data.
| Out[18]= |  |
|
The option
LeafLabels is provided to support labeling of the dendrogram. For data, a value of
LeafLabels->Automatic will use the data element position for the label, but the option can also take a list of expressions or a function. The function will be applied to each data element to generate the label expression. For
Cluster objects, the
LeafLabels option can take a function but not a list or
Automatic, as there is no unambiguous mapping between the labels and data points in a
Cluster object.
This generates a dendrogram with automatically generated labels.
| Out[19]= |  |
|
This generates a dendrogram using a list of labels.
| Out[20]= |  |
|
This generates a dendrogram using a label function.
| Out[21]= |  |
|
For large datasets, only a summary of the full dendrogram may be desired. Dendrograms may be truncated using the
TruncateDendrogram option by providing an integer or a list of two integers. If given as a list of two integers, the first value specifies the fusion level above which mergers should not be shown, while the second indicates the fusion level under which the dendrogram should be truncated. For a single integer
n,
TruncateDendrogram->n is equivalent to
TruncateDendrogram->{1, n}. A value of

may be given as a second value to show the full clustering without truncation from below. When a cluster is truncated and labels are specified, a box is substituted for a label indicating the size of the cluster that has been truncated.
Here is a larger dataset to cluster. It is based on Fisher's iris data. |
This turns off tie warnings. |
This generates a dendrogram truncated to show only the highest 20 mergers.
| Out[24]= |  |
|
This highlights the two top clusters.
| Out[25]= |  |
|
This generates a dendrogram truncated between levels 2 and 20.
| Out[26]= |  |
|