Mathematica 教程 函数 »|更多关于 »

# Partitioning Data into Clusters

Cluster analysis is an unsupervised learning technique used for classification of data. Data elements are partitioned into groups called clusters that represent proximate collections of data elements based on a distance or dissimilarity function. Identical element pairs have zero distance or dissimilarity, and all others have positive distance or dissimilarity.
 FindClusters[data] partition data into lists of similar elements FindClusters[data,n] partition data into exactly n lists of similar elements

General clustering function.

The data argument of FindClusters can be a list of data elements or rules indexing elements and labels.
 {e1,e2,...} data specified as a list of data elements ei {e1→v1,e2→v2,...} data specified as a list of rules between data elements ei and labels vi {e1,e2,...}→{v1,v2,...} data specified as a rule mapping data elements ei to labels vi

Ways of specifying data in FindClusters.

The data elements ei can be numeric lists, matrices, tensors, lists of True and False elements, or lists of strings. All data elements ei must have the same dimensions.
 Here is a list of numbers.
FindClusters clusters the numbers based on their proximity.
 Out[2]=
The rule-based data syntax allows for clustering data elements and returning labels for those elements.
Here two-dimensional points are clustered and labeled with their positions in the data list.
 Out[3]=
The rule-based data syntax can also be used to cluster data based on parts of each data entry. For instance, you might want to cluster data in a data table while ignoring particular columns in the table.
 Here is a list of data entries.
This clusters the data while ignoring the first two elements in each data entry.
 Out[5]=
In principle, it is possible to cluster points given in an arbitrary number of dimensions. However, it is difficult at best to visualize the clusters above two or three dimensions. To compare optional methods in this documentation, an easily visualizable set of two-dimensional data will be used.
 The following commands define a set of 300 two-dimensional data points chosen to group into four somewhat nebulous clusters.
 This clusters the data based on the proximity of points.
Here is a plot of the clusters.
 Out[8]=
With the default settings, FindClusters has found the four clusters of points.
You can also direct FindClusters to find a specific number of clusters.
This shows the effect of choosing 3 clusters.
 Out[9]=
This shows the effect of choosing 5 clusters.
 Out[10]=
 option name default value DistanceFunction Automatic the distance or dissimilarity measure to use Method Automatic the clustering method to use

Options for FindClusters.

Randomness is used in clustering in two different ways. Some of the methods use a random assignment of some points to a specific number of clusters as a starting point. Randomness may also be used to help determine what seems to be the best number of clusters to use. Changing the random seed for generating the randomness by using FindClusters[{e1, e2, ...}, Method→{Automatic, "RandomSeed"->s}] may lead to different results for some cases.
In principle, clustering techniques can be applied to any set of data. All that is needed is a measure of how far apart each element in the set is from other elements, that is, a function giving the distance between elements.
FindClusters[{e1, e2, ...}, DistanceFunction->f] treats pairs of elements as being less similar when their distances f[ei, ej] are larger. The function f can be any appropriate distance or dissimilarity function. A dissimilarity function f satisfies the following:
If the ei are vectors of numbers, FindClusters by default uses a squared Euclidean distance. If the ei are lists of Boolean True and False (or 0 and 1) elements, FindClusters by default uses a dissimilarity based on the normalized fraction of elements that disagree. If the ei are strings, FindClusters by default uses a distance function based on the number of point changes needed to get from one string to another.
 EuclideanDistance[u,v] the Euclidean norm SquaredEuclideanDistance[u,v] squared Euclidean norm (u-v)2 ManhattanDistance[u,v] the Manhattan distance u-v ChessboardDistance[u,v] the chessboard or Chebyshev distance max (u-v) CanberraDistance[u,v] the Canberra distance u-v/ (u+v) CosineDistance[u,v] the cosine distance 1-u.v/ (u v) CorrelationDistance[u,v] the correlation distance 1-(u-Mean[u]).(v-Mean[v])/(u-Mean[u]v-Mean[v]) BrayCurtisDistance[u,v] the Bray-Curtis distance u-v/u+v

Distance functions for numerical data.

This shows the clusters in datapairs found using a Manhattan distance.
 Out[11]=
Dissimilarities for Boolean vectors are typically calculated by comparing the elements of two Boolean vectors u and v pairwise. It is convenient to summarize each dissimilarity function in terms of nij, where nij is the number of corresponding pairs of elements in u and v, respectively, equal to i and j. The number nij counts the pairs {i, j} in {u1, v1}, {u2, v2}..., with i and j being either 0 or 1. If the Boolean values are True and False, True is equivalent to 1 and False is equivalent to 0.
 MatchingDissimilarity[u,v] simple matching (n10+n01)/Length[u] JaccardDissimilarity[u,v] the Jaccard dissimilarity (n10+n01)/ (n11+n10+n01) RussellRaoDissimilarity[u,v] the Russell-Rao dissimilarity (n10+n01+n00)/Length[u] SokalSneathDissimilarity[u,v] the Sokal-Sneath dissimilarity 2 (n10+n01)/ (n11+2 (n10+n01)) RogersTanimotoDissimilarity[u,v] the Rogers-Tanimoto dissimilarity 2 (n10+n01)/ (n11+2 (n10+n01)+n00) DiceDissimilarity[u,v] the dice dissimilarity (n10+n01)/ (2n11+n10+n01) YuleDissimilarity[u,v] the Yule dissimilarity 2n10 n01/ (n11 n00+n10 n01)

Dissimilarity functions for Boolean data.

 Here is some Boolean data.
These are the clusters found using the default dissimilarity for Boolean data.
 Out[13]=
 EditDistance[u,v] the number of edits to transform u into string v DamerauLevenshteinDistance[u,v] Damerau-Levenshtein distance between u and v HammingDistance[u,v] the number of elements whose values disagree in u and v

Dissimilarity functions for string data.

The edit distance is determined by counting the number of deletions, insertions, and substitutions required to transform one string into another while preserving the ordering of characters. In contrast, the Damerau-Levenshtein distance counts the number of deletions, insertions, substitutions, and transpositions, while the Hamming distance counts only the number of substitutions.
 Here is some string data.
This clusters the string data using the edit distance.
 Out[15]=
The Method option can be used to specify different methods of clustering.
 "Agglomerate" find clustering hierarchically "Optimize" find clustering by local optimization

Explicit settings for the Method option.

The methods "Agglomerate" and "Optimize" determine how to cluster the data for a particular number of clusters k. "Agglomerate" uses an agglomerative hierarchical method starting with each member of the set in a cluster of its own and fusing nearest clusters until there are k remaining. "Optimize" starts by building a set of k representative objects and clustering around those, iterating until a (locally) optimal clustering is found. The default "Optimize" method is based on partitioning around medoids.
Additional Method suboptions are available to allow for more control over the clustering. Available suboptions depend on the Method chosen.
 "SignificanceTest" test for identifying the best number of clusters

Suboption for all methods.

For a given set of data and distance function, the choice of the best number of clusters k may be unclear. With Method->{methodname, "SignificanceTest"->"stest"}, "stest" is used to determine statistically significant clusters to help choose an appropriate number. Possible values of "stest" are "Silhouette" and "Gap". The "Silhouette" test uses the silhouette statistic to test how well the data is clustered. The "Gap" test uses the gap statistic to determine how well the data is clustered.
The "Silhouette" test subdivides the data into successively more clusters looking for the first minimum of the silhouette statistic.
The "Gap" test compares the dispersion of clusters generated from the data to that derived from a sample of null hypothesis sets. The null hypothesis sets are uniformly randomly distributed data in the box defined by the principal components of the input data. The "Gap" method takes two suboptions: "NullSets" and "Tolerance". The suboption "NullSets" sets the number of null hypothesis sets to compare with the input data. The option "Tolerance" sets the sensitivity. Typically larger values of "Tolerance" will favor fewer clusters being chosen. The default settings are "NullSets"->5 and "Tolerance"->1.
This shows the result of clustering datapairs using the "Silhouette" test.
 Out[16]=
Here are the clusters found using the "Gap" test with the tolerance parameter set to 3. The larger value leads to fewer clusters being selected.
 Out[17]=
Note that the clusters found in these two examples are identical. The only difference is how the number of clusters is chosen.

Suboption for the "Agglomerate" method.

With Method->{"Agglomerate", "Linkage"->f}, the specified linkage function f is used for agglomerative clustering.
 "Single" smallest intercluster dissimilarity "Average" average intercluster dissimilarity "Complete" largest intercluster dissimilarity "WeightedAverage" weighted average intercluster dissimilarity "Centroid" distance from cluster centroids "Median" distance from cluster medians "Ward" Ward's minimum variance dissimilarity f a pure function

Possible values for the "Linkage" suboption.

Linkage methods determine this intercluster dissimilarity, or fusion level, given the dissimilarities between member elements.
With Linkage->f, f is a pure function that defines the linkage algorithm. Distances or dissimilarities between clusters are determined recursively using information about the distances or dissimilarities between unmerged clusters to determine the distances or dissimilarities for the newly merged cluster. The function f defines a distance from a cluster k to the new cluster formed by fusing clusters i and j. The arguments supplied to f are dik, djk, dij, ni, nj, and nk, where d is the distance between clusters and n is the number of elements in a cluster.
These are the clusters found using complete linkage hierarchical clustering.
 Out[18]=
 "Iterations" the maximum number of iterations to use

Suboption for the "Optimize" method.

Here are the clusters determined from a single iteration of the "Optimize" method.
 Out[19]=