ClusteringMeasurements
ClusteringMeasurements[{{e1,e2,…},…},meas]
returns the measurement meas for the clustered examples ei.
ClusteringMeasurements[clusters,gt,meas]
assumes the ground truth clustering gt.
Details and Options
- ClusteringMeasurements is used to analyze the result of a clustering process. It can work with the clustered data alone or by comparing it with ground truth information.
- Possible clustering specification clusters include:
-
{{e1,e2,…},…} a list of clustered examples <l1{e1,e2,…},… > an association of clustered examples with label li {e1l1,e2l2,…} a list of examples and their correspondent cluster labels {e1,e2,…}{l1,l2,…} two separate lists for examples and labels {e1,e2,…}cfun an implicit classification via ClassifierFunction - Possible ground truth specifications gt include:
-
{{e1,e2,…},…} a list of example clusters <l1{e1,e2,…},… > an association of example lists labeled by cluster {e1l1,e2l2,…} list of examples and their correspondent cluster {e1,e2,…}{l1,l2,…} separate lists for examples and clusters {l1,l2,…} a list with cluster labels for each example - The measurements meas can have the form:
-
"Summary" a summary table of measurements "name" a specific measurement "name" {"name1","name2",…} a list of measuremetns All all possible measurements "Properties" a list of possible measurement names - Measurements can be divided in internals and externals.
- Internal measurements generally assume that good clusters have high separation and low dispersion.
- Common separation definitions (intercluster distances):
- Common dispersion definitions (intracluster distances):
- The notations 〈ei〉 and 〈e〉 represent the average over a cluster and over the whole dataset.
- Supported internal measurements meas include:
-
"CalinskiHarabasz" ratio between average separation and average centroid dispersion (maximize) "DaviesBouldin" average maximal ratio of the sum of centroid dispersions over centroid separation for a cluster pair (minimize) "Dunn" ratio of smallest minimal separation to dataset maximal dispersion (maximize) "RSquared" ratio of mean average dispersion to dataset centroid dispersion (elbow) "Silhouette" average difference between intercluster distance and intercluster distance for the next nearest cluster (maximize) "StandardDeviation" mean average dispersion (elbow) - Internal measurements that return a result per cluster or per example include:
-
"DaviesBouldinScore" maximal cluster similarity "RSquaredScore" ratio between cluster and overall dataset dispersion "SilhouetteScore" difference between intercluster distance and intercluster distance for the next nearest cluster "SilhouetteScoreList" per example slihouette value "StandardDeviationScore" average dispersion - External measurements compare the cluster assignment of an example ei with its ground truth value gt.
- Supported external measurements include:
-
"Purity" fraction of examples with the commonest ground truth assignment in their cluster (maximize) "Rand" fraction of (ei,ej) pairs that correctly share or do not share the same ground truth assignment (maximize) - External measurements that return a result per cluster or per example include:
-
"PurityScore" largest fraction of examples sharing the same ground truth assignment in each cluster "RandScore" fraction of (ei,ej) pairs that correctly share or do not share the same ground truth assignment in each cluster - ClusteringMeasurements[…,{"prop1","prop2",…}] can be used to compute multiple properties.
- ClusteringMeasurements supports the following options:
-
DistanceFunction Automatic the distance function to use FeatureExtractor Identity how to extract features from the examples - By default, the following distance functions are used for different types of elements:
-
EuclideanDistance numeric data ImageDistance images JaccardDissimilarity Boolean data EditDistance text and nominal sequences Abs[DateDifference[#1,#2]]& dates and times ColorDistance colors GeoDistance geospatial data Boole[SameQ[#1,#2]]& nominal data HammingDistance nominal vector data WarpingDistance numerical sequences
Examples
open allclose allBasic Examples (2)
Scope (9)
Data Formats (5)
Specify the clusters explicitly in a list:
Specify the clusters explicitly in an association:
Specify the clusters by a list of rules between examples and assignments:
Specify the clusters by a rule between examples and assignments:
Specify the clusters by a rule between examples and a ClassifierFunction[…]:
Options (2)
Applications (2)
Find the optimal cluster number on a synthetic dataset:
Combine the different groups in a random permutation:
Compute the k-mean clustering for different values of k:
Measure the Dunn index of each set of clusters:
The optimal clustering is at 5 clusters:
The clustering process was able to recover all the original groups:
Visualize the Silhouette score for each point in a clustering:
Compute the k-mean clustering a given k:
Visualize the Silhouette score:
Compute the k-mean clustering for different values of k:
Plot each set of clusters with the correspondent Silhouette profile:
Text
Wolfram Research (2022), ClusteringMeasurements, Wolfram Language function, https://reference.wolfram.com/language/ref/ClusteringMeasurements.html.
CMS
Wolfram Language. 2022. "ClusteringMeasurements." Wolfram Language & System Documentation Center. Wolfram Research. https://reference.wolfram.com/language/ref/ClusteringMeasurements.html.
APA
Wolfram Language. (2022). ClusteringMeasurements. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/ClusteringMeasurements.html