Partitioning Data into Clusters

Cluster analysis is an unsupervised learning technique used for classification of data. Data elements are partitioned into groups called clusters that represent proximate collections of data elements based on a distance or dissimilarity function. Identical element pairs have zero distance or dissimilarity, and all others have positive distance or dissimilarity.

FindClusters[data]partition data into lists of similar elements
FindClusters[data,n]partition data into at most n lists of similar elements

General clustering function.

The data argument of FindClusters can be a list of data elements, associations, or rules indexing elements and labels.

{e1,e2,}data specified as a list of data elements ei
{e1v1,e2v2,}data specified as a list of rules between data elements ei and labels vi
{e1,e2,}{v1,v2,}data specified as a rule mapping data elements ei to labels vi
key1e1,key2e2|>data specified as an association mapping elements ei to labels keyi

Ways of specifying data in FindClusters.

FindClusters works for a variety of data types, including numerical, textual, and image, as well as Boolean vectors dates and times. All data elements ei must have the same dimensions.

Here is a list of numbers.
In[1]:=
Click for copyable input
FindClusters clusters the numbers based on their proximity.
In[2]:=
Click for copyable input
Out[2]=

The rule-based data syntax allows for clustering data elements and returning labels for those elements.

Here two-dimensional points are clustered and labeled with their positions in the data list.
In[3]:=
Click for copyable input
Out[3]=

The rule-based data syntax can also be used to cluster data based on parts of each data entry. For instance, you might want to cluster data in a data table while ignoring particular columns in the table.

Here is a list of data entries.
In[4]:=
Click for copyable input
This clusters the data while ignoring the first two elements in each data entry.
In[5]:=
Click for copyable input
Out[5]=

In principle, it is possible to cluster points given in an arbitrary number of dimensions. However, it is difficult at best to visualize the clusters above two or three dimensions. To compare optional methods in this documentation, an easily visualizable set of two-dimensional data will be used.

The following commands define a set of 300 two-dimensional data points chosen to group into four somewhat nebulous clusters.
In[6]:=
Click for copyable input
This clusters the data based on the proximity of points.
In[7]:=
Click for copyable input
Here is a plot of the clusters.
In[8]:=
Click for copyable input
Out[8]=

With the default settings, FindClusters has found the four clusters of points.

You can also direct FindClusters to find a specific number of clusters.

This shows the effect of choosing 3 clusters.
In[9]:=
Click for copyable input
Out[9]=
This shows the effect of choosing 5 clusters.
In[10]:=
Click for copyable input
Out[10]=
option name
default value
CriterionFunctionAutomaticcriterion for selecting a method
DistanceFunctionAutomaticthe distance function to use
MethodAutomaticthe clustering method to use
PerformanceGoalAutomaticaspect of performance to optimize
WeightsAutomaticwhat weight to give to each example

Options for FindClusters.

In principle, clustering techniques can be applied to any set of data. All that is needed is a measure of how far apart each element in the set is from other elements, that is, a function giving the distance between elements.

FindClusters[{e1,e2,},DistanceFunction->f] treats pairs of elements as being less similar when their distances f[ei,ej] are larger. The function f can be any appropriate distance or dissimilarity function. A dissimilarity function f satisfies the following:

If the ei are vectors of numbers, FindClusters by default uses a squared Euclidean distance. If the ei are lists of Boolean True and False (or 0 and 1) elements, FindClusters by default uses a dissimilarity based on the normalized fraction of elements that disagree. If the ei are strings, FindClusters by default uses a distance function based on the number of point changes needed to get from one string to another.

EuclideanDistance[u,v]the Euclidean norm
SquaredEuclideanDistance[u,v]squared Euclidean norm
ManhattanDistance[u,v]the Manhattan distance
ChessboardDistance[u,v]the chessboard or Chebyshev distance
CanberraDistance[u,v]the Canberra distance
CosineDistance[u,v]the cosine distance
CorrelationDistance[u,v]the correlation distance 1-(u-Mean[u]).(v-Mean[v])/(Abs[u-Mean[u]]Abs[v-Mean[v]])
BrayCurtisDistance[u,v]the BrayCurtis distance

Distance functions for numerical data.

This shows the clusters in datapairs found using a Manhattan distance.
In[11]:=
Click for copyable input
Out[11]=

Dissimilarities for Boolean vectors are typically calculated by comparing the elements of two Boolean vectors and pairwise. It is convenient to summarize each dissimilarity function in terms of , where is the number of corresponding pairs of elements in and , respectively, equal to and . The number counts the pairs in , with and being either 0 or 1. If the Boolean values are True and False, True is equivalent to 1 and False is equivalent to 0.

MatchingDissimilarity[u,v]simple matching (n10+n01)/Length[u]
JaccardDissimilarity[u,v]the Jaccard dissimilarity
RussellRaoDissimilarity[u,v]the RussellRao dissimilarity (n10+n01+n00)/Length[u]
SokalSneathDissimilarity[u,v]the SokalSneath dissimilarity
RogersTanimotoDissimilarity[u,v]the RogersTanimoto dissimilarity
DiceDissimilarity[u,v]the Dice dissimilarity
YuleDissimilarity[u,v]the Yule dissimilarity

Dissimilarity functions for Boolean data.

Here is some Boolean data.
In[12]:=
Click for copyable input
These are the clusters found using the default dissimilarity for Boolean data.
In[13]:=
Click for copyable input
Out[13]=
EditDistance[u,v]the number of edits to transform u into string v
DamerauLevenshteinDistance[u,v]DamerauLevenshtein distance between u and v
HammingDistance[u,v]the number of elements whose values disagree in u and v

Dissimilarity functions for string data.

The edit distance is determined by counting the number of deletions, insertions, and substitutions required to transform one string into another while preserving the ordering of characters. In contrast, the DamerauLevenshtein distance counts the number of deletions, insertions, substitutions, and transpositions, while the Hamming distance counts only the number of substitutions.

Here is some string data.
In[14]:=
Click for copyable input
This clusters the string data using the edit distance.
In[15]:=
Click for copyable input
Out[15]=

The Method option can be used to specify different methods of clustering.

"Agglomerate"find clustering hierarchically
"Optimize"find clustering by local optimization
"DBSCAN"density-based spatial clustering of applications with noise
"GaussianMixture"variational Gaussian mixture algorithm
"JarvisPatrick"JarvisPatrick clustering algorithm
"KMeans"k-means clustering algorithm
"KMedoids"partitioning around medoids
"MeanShift"mean-shift clustering algorithm
"NeighborhoodContraction"displace examples toward high-density region
"SpanningTree"minimum spanning tree-based clustering algorithm
"Spectral"spectral clustering algorithm

Explicit settings for the Method option.

By default, FindClusters tries different methods and selects the best clustering.

The methods "KMeans" and "KMedoids" determine how to cluster the data for a particular number of clusters k.

The methods "DBSCAN", "JarvisPatrick", "MeanShift", "SpanningTree", "NeighborhoodContraction", and "GaussianMixture" determine how to cluster the data without assuming any particular number of clusters.

The methods "Agglomerate" and "Spectral" can be used in both cases.

This shows the clusters in datapairs found using the "KMeans" method.
In[16]:=
Click for copyable input
Out[16]=
This shows the clusters in datapairs found using the "GaussianMixture" method.
In[17]:=
Click for copyable input
Out[17]=

Additional Method suboptions are available to allow for more control over the clustering. Available suboptions depend on the Method chosen.

"NeighborhoodRadius"specifies the average radius of a neighborhood of a point
"NeighborsNumber"specifies the average number of points in a neighborhood
"InitialCentroids"specifies the initial centroids/medoids
ClusterDissimilarityFunctionspecifies the intercluster dissimilarity

Suboption for all methods.

The suboption "NeighborhoodRadius" can be used in methods "DBSCAN", "MeanShift", "JarvisPatrick", "NeighborhoodContraction", and "Spectral".

The suboption "NeighborsNumber" can be used in methods "DBSCAN" and "JarvisPatrick".

The suboption "InitialCentroids" can be used in methods "KMeans" and "KMedoids".

The suboption ClusterDissimilarityFunction can be used in the method "Agglomerate".

The "NeighborhoodRadius" suboption can be used to control the average radius of the neighborhood of a generic point.

This shows different clusterings of datapairs found using the "NeighborhoodContraction" method by varying the "NeighborhoodRadius".
In[18]:=
Click for copyable input
Out[18]=

The "NeighborsNumber" suboption can be used to control the number of neighbors in the neighborhood of a generic point.

This shows different clusterings of datapairs found using the "DBSCAN" method by varying the "NeighborsNumber".
In[19]:=
Click for copyable input
Out[19]=

The "InitialCentroids" suboption can be used to change the initial configuration in the "KMeans" and "KMedoids" methods. Bad initial configurations may result in bad clusterings.

This shows different clusterings of datapairs found using the "KMeans" method by varying the "InitialCentroids".
In[20]:=
Click for copyable input
Out[20]=

With Method->{"Agglomerate",ClusterDissimilarityFunction->f}, the specified linkage function f is used for agglomerative clustering.

"Single"smallest intercluster dissimilarity
"Average"average intercluster dissimilarity
"Complete"largest intercluster dissimilarity
"WeightedAverage"weighted average intercluster dissimilarity
"Centroid"distance from cluster centroids
"Median"distance from cluster medians
"Ward"Ward's minimum variance dissimilarity
fa pure function

Possible values for the ClusterDissimilarityFunction suboption.

Linkage methods determine this intercluster dissimilarity, or fusion level, given the dissimilarities between member elements.

With ClusterDissimilarityFunction->f, f is a pure function that defines the linkage algorithm. Distances or dissimilarities between clusters are determined recursively using information about the distances or dissimilarities between unmerged clusters to determine the distances or dissimilarities for the newly merged cluster. The function f defines a distance from a cluster k to the new cluster formed by fusing clusters i and j. The arguments supplied to f are dik, djk, dij, ni, nj, and nk, where d is the distance between clusters and n is the number of elements in a cluster.

This shows different clusterings of datapairs found using the "Agglomerate" method by varying the ClusterDissimilarityFunction.
In[21]:=
Click for copyable input
Out[21]=

The CriterionFunction option can be used to select both the method to use and the best number of clusters.

"StandardDeviation"root-mean-square standard deviation
"RSquared"R-squared
"Dunn"Dunn index
"CalinskiHarabasz"CalinskiHarabasz index
"DaviesBouldin"DaviesBouldin index
Automaticinternal index
This shows the result of clustering using different settings for CriterionFunction.
In[22]:=
Click for copyable input
Out[22]=
These are the clusters found using the default CriterionFunction with automatically selected number of clusters.
In[23]:=
Click for copyable input
Out[23]=
These are the clusters found using the "CalinskiHarabasz" index.
In[24]:=
Click for copyable input
Out[24]=