Hierarchical Clustering Package

The function FindClusters finds clusters in a dataset based on a distance or dissimilarity function. This package contains functions for generating cluster hierarchies and visualizing the mergers in the hierarchical clustering.

Agglomerate[data]collect the elements of data into a hierarchy of clusters

Hierarchical clustering function.

The Agglomerate function computes a cluster hierarchy of a dataset. Agglomerate accepts data in the same forms accepted by FindClusters. The output from Agglomerate is a nested Cluster object representing the hierarchical clustering.

This loads the package.
In[1]:=
Click for copyable input
Here is a small numerical dataset.
In[2]:=
Click for copyable input
This constructs a hierarchical clustering of the data.
In[3]:=
Click for copyable input
Out[3]=
Here is some Boolean data.
In[4]:=
Click for copyable input
This clusters the Boolean data.
In[5]:=
Click for copyable input
Out[5]=
Here is some string data.
In[6]:=
Click for copyable input
This clusters the string data.
In[7]:=
Click for copyable input
Out[7]=
Cluster[c1,c2,d,n1,n2]represent a merger in the cluster hierarchy where the elements and are the subclusters merged with distance or dissimilarity value d, and the subclusters contain and data elements, respectively

An element of the cluster hierarchy.

ClusterFlatten[cluster]flatten cluster, returning a list of the data elements contained in the cluster
ClusterSplit[cluster,n]split cluster into n clusters, effectively undoing the last n-1 mergers

Functions for manipulating Cluster expressions.

The ClusterFlatten and ClusterSplit functions are utilities for manipulating Cluster objects.

This splits the clustering from the numerical data into the top 3 subclusters.
In[8]:=
Click for copyable input
Out[8]=
This flattens the cluster.
In[9]:=
Click for copyable input
Out[9]=
option name
default value
DistanceFunctionAutomaticstyle for the line
LinkageAutomaticthe clustering linkage algorithm to use

Options for Agglomerate.

The DistanceFunction option is the same as for FindClusters. DistanceFunction defines the distance or dissimilarity between data points, while Linkage defines the dissimilarity between clusters of data points.

"Single"smallest intercluster dissimilarity
"Average"average intercluster dissimilarity
"Complete"largest intercluster dissimilarity
"WeightedAverage"weighted average intercluster dissimilarity
"Centroid"distance from cluster centroids
"Median"distance from cluster medians
"Ward"Ward's minimum variance dissimilarity
fa pure function

Possible values for the Linkage option.

Linkage methods determine the intercluster dissimilarity, or fusion level, given the dissimilarities between member elements. Common algorithms include single linkage, which selects the smallest distance between elements; complete linkage, which selects the largest distance between elements; and centroid linkage, which uses the dissimilarity between cluster centroids.

With Linkage->f, f is a pure function that defines the linkage algorithm. Distances or dissimilarities between clusters are determined recursively using information about the distances or dissimilarities of the unmerged clusters and their counterparts to determine the distances or dissimilarities for the newly merged cluster. If i and j represent clusters to be merged, new distances or dissimilarities are recursively calculated between this merged cluster and the remaining k clusters. The function f defines the recursion and is passed arguments , where d represents the distances or dissimilarities between clusters and n represents the number of data elements in a cluster. The function returns the dissimilarity between k and the cluster formed by merging clusters i and j.

This clusters data using the weighted average linkage method.
In[10]:=
Click for copyable input
Out[10]=
The following uses a pure function equivalent to the weighted average linkage.
In[11]:=
Click for copyable input
Out[11]=

Elements can be labeled using a list of rules or a single rule. Distances or dissimilarities between elements are computed using the expression on the left-hand sides of the rules. The output Cluster object contains the right-hand sides of the rules.

This clusters the data, showing results by label.
In[12]:=
Click for copyable input
Out[12]=

It is also possible to build a cluster hierarchy directly using a distance or dissimilarity matrix—a matrix that provides the pairwise distances or dissimilarities between all data elements in lieu of a distance or dissimilarity function. In a distance matrix, the element in the i^(th) row and j^(th) column stores the distance value between the i^(th) and j^(th) data elements. Note that only the upper-triangular portion of the matrix is used, that is, , since distances and dissimilarities are symmetric.

DistanceMatrix[data]compute the symmetric matrix of distance or dissimilarity
DirectAgglomerate[mat]construct a cluster hierarchy based on the square distance matrix mat
DirectAgglomerate[mat,list]associate the elements of list with the corresponding rows in the distance matrix mat

Functions for clustering using distance or dissimilarity matrices.

The DistanceMatrix function can be given a DistanceFunction option to define the distance measure.

Here is an example distance matrix.
In[13]:=
Click for copyable input
Out[13]=
This clusters the distance matrix directly, representing data elements by row number.
In[14]:=
Click for copyable input
Out[14]=
This clusters the distance matrix, specifying the corresponding data elements.
In[15]:=
Click for copyable input
Out[15]=

The binary tree structure of the hierarchical cluster leads naturally to a graphical representation—the dendrogram. A dendrogram plot shows a visual representation of the merger history by connecting two lines representing clusters with a bar at their fusion level.

DendrogramPlot[list]plots a dendrogram derived from clustering list
DendrogramPlot[cluster]plots a dendrogram for the Cluster object cluster

Dendrogram plotting functions.

This generates a basic dendrogram plot.
In[16]:=
Click for copyable input
Out[16]=
This generates a dendrogram using the Manhattan distance.
In[17]:=
Click for copyable input
Out[17]=
option name
default value
TruncateDendrogramAllthe level at which to truncate the dendrogram
HighlightLevelNonethe level at which to highlight the clusters
HighlightStyleAutomaticgraphics directives to specify the style of each highlighted cluster
LeafLabelsNonethe leaf labels
DistanceFunctionAutomaticthe distance or dissimilarity measure to use
LinkageAutomaticthe clustering linkage algorithm to use
OrientationTopthe orientation of the dendrogram
PlotStyleAutomaticgraphics directives to specify the style of the dendrogram

Options for DendrogramPlot.

This generates a dendrogram using Ward's linkage method to cluster the data.
In[18]:=
Click for copyable input
Out[18]=

The option LeafLabels is provided to support labeling of the dendrogram. For data, a value of LeafLabels->Automatic will use the data element position for the label, but the option can also take a list of expressions or a function. The function will be applied to each data element to generate the label expression. For Cluster objects, the LeafLabels option can take a function, but not a list or Automatic, as there is no unambiguous mapping between the labels and data points in a Cluster object.

This generates a dendrogram with automatically generated labels.
In[19]:=
Click for copyable input
Out[19]=
This generates a dendrogram using a list of labels.
In[20]:=
Click for copyable input
Out[20]=
This generates a dendrogram using a label function.
In[21]:=
Click for copyable input
Out[21]=

For large datasets, only a summary of the full dendrogram may be desired. Dendrograms may be truncated using the TruncateDendrogram option by providing an integer or a list of two integers. If given as a list of two integers, the first value specifies the fusion level above which mergers should not be shown, while the second indicates the fusion level under which the dendrogram should be truncated. For a single integer n, TruncateDendrogram->n is equivalent to TruncateDendrogram->{1, n}. A value of may be given as a second value to show the full clustering without truncation from below. When a cluster is truncated and labels are specified, a box is substituted for a label indicating the size of the cluster that has been truncated.

Here is a larger dataset to cluster. It is based on Fisher's iris data.
In[22]:=
Click for copyable input
This turns off tie warnings.
In[23]:=
Click for copyable input
This generates a dendrogram truncated to show only the highest 20 mergers.
In[24]:=
Click for copyable input
Out[24]=
This highlights the two top clusters.
In[25]:=
Click for copyable input
Out[25]=
This generates a dendrogram truncated between levels 2 and 20.
In[26]:=
Click for copyable input
Out[26]=
New to Mathematica? Find your learning path »
Have a question? Ask support »