Hierarchical Clustering Package
The function FindClusters finds clusters in a dataset based on a distance or dissimilarity function. This package contains functions for generating cluster hierarchies and visualizing the mergers in the hierarchical clustering.
Agglomerate[data] | collect the elements of data into a hierarchy of clusters |
Hierarchical clustering function.
The Agglomerate function computes a cluster hierarchy of a dataset. Agglomerate accepts data in the same forms accepted by FindClusters. The output from Agglomerate is a nested Cluster object representing the hierarchical clustering.
Cluster[c1,c2,d,n1,n2] | represent a merger in the cluster hierarchy where the elements c1 and c2 are the subclusters merged with distance or dissimilarity value d, and the subclusters contain n1 and n2 data elements, respectively |
An element of the cluster hierarchy.
ClusterFlatten[cluster] | flatten cluster, returning a list of the data elements contained in the cluster |
ClusterSplit[cluster,n] | split cluster into n clusters, effectively undoing the last n-1 mergers |
Functions for manipulating Cluster expressions.
The ClusterFlatten and ClusterSplit functions are utilities for manipulating Cluster objects.
option name | default value | |
DistanceFunction | Automatic | the distance or dissimilarity measure to use |
Linkage | Automatic | the clustering linkage algorithm to use |
Options for Agglomerate.
The DistanceFunction option is the same as for FindClusters. DistanceFunction defines the distance or dissimilarity between data points, while Linkage defines the dissimilarity between clusters of data points.
"Single" | smallest intercluster dissimilarity |
"Average" | average intercluster dissimilarity |
"Complete" | largest intercluster dissimilarity |
"WeightedAverage" | weighted average intercluster dissimilarity |
"Centroid" | distance from cluster centroids |
"Median" | distance from cluster medians |
"Ward" | Ward’s minimum variance dissimilarity |
f | a pure function |
Possible values for the Linkage option.
Linkage methods determine the intercluster dissimilarity, or fusion level, given the dissimilarities between member elements. Common algorithms include single linkage, which selects the smallest distance between elements; complete linkage, which selects the largest distance between elements; and centroid linkage, which uses the dissimilarity between cluster centroids.
With Linkage->f, f is a pure function that defines the linkage algorithm. Distances or dissimilarities between clusters are determined recursively using information about the distances or dissimilarities of the unmerged clusters and their counterparts to determine the distances or dissimilarities for the newly merged cluster. If i and j represent clusters to be merged, new distances or dissimilarities are recursively calculated between this merged cluster and the remaining k clusters. The function f defines the recursion and is passed arguments f[dik,djk,dij,ni,nj,nk], where d represents the distances or dissimilarities between clusters and n represents the number of data elements in a cluster. The function returns the dissimilarity between k and the cluster formed by merging clusters i and j.
Elements can be labeled using a list of rules or a single rule. Distances or dissimilarities between elements are computed using the expression on the left‐hand sides of the rules. The output Cluster object contains the right‐hand sides of the rules.
It is also possible to build a cluster hierarchy directly using a distance or dissimilarity matrix—a matrix that provides the pairwise distances or dissimilarities between all data elements in lieu of a distance or dissimilarity function. In a distance matrix, the element in the i th row and j th column stores the distance value between the i th and j th data elements. Note that only the upper‐triangular portion of the matrix is used, that is, i<j, since distances and dissimilarities are symmetric.
DistanceMatrix[data] | compute the symmetric matrix of distance or dissimilarity |
DirectAgglomerate[mat] | construct a cluster hierarchy based on the square distance matrix mat |
DirectAgglomerate[mat,list] | associate the elements of list with the corresponding rows in the distance matrix mat |
Functions for clustering using distance or dissimilarity matrices.
The DistanceMatrix function can be given a DistanceFunction option to define the distance measure.
The binary tree structure of the hierarchical cluster leads naturally to a graphical representation—the dendrogram. A dendrogram plot shows a visual representation of the merger history by connecting two lines representing clusters with a bar at their fusion level.
DendrogramPlot[list] | plots a dendrogram derived from clustering list |
DendrogramPlot[cluster] | plots a dendrogram for the Cluster object cluster |
Dendrogram plotting functions.
option name | default value | |
TruncateDendrogram | All | the level at which to truncate the dendrogram |
HighlightLevel | None | the level at which to highlight the clusters |
HighlightStyle | Automatic | graphics directives to specify the style of each highlighted cluster |
LeafLabels | None | the leaf labels |
DistanceFunction | Automatic | the distance or dissimilarity measure to use |
Linkage | Automatic | the clustering linkage algorithm to use |
Orientation | Top | the orientation of the dendrogram |
PlotStyle | Automatic | graphics directives to specify the style of the dendrogram |
Options for DendrogramPlot.
The option LeafLabels is provided to support labeling of the dendrogram. For data, a value of LeafLabels->Automatic will use the data element position for the label, but the option can also take a list of expressions or a function. The function will be applied to each data element to generate the label expression. For Cluster objects, the LeafLabels option can take a function, but not a list or Automatic, as there is no unambiguous mapping between the labels and data points in a Cluster object.
For large datasets, only a summary of the full dendrogram may be desired. Dendrograms may be truncated using the TruncateDendrogram option by providing an integer or a list of two integers. If given as a list of two integers, the first value specifies the fusion level above which mergers should not be shown, while the second indicates the fusion level under which the dendrogram should be truncated. For a single integer n, TruncateDendrogram->n is equivalent to TruncateDendrogram->{1,n}. A value of ∞ may be given as a second value to show the full clustering without truncation from below. When a cluster is truncated and labels are specified, a box is substituted for a label indicating the size of the cluster that has been truncated.