ClusteringMeasurements

ClusteringMeasurements[{{e₁,e₂,…},…},meas]

returns the measurement meas for the clustered examples e_i.

ClusteringMeasurements[clusters,gt,meas]

assumes the ground truth clustering gt.

Details and Options

ClusteringMeasurements is used to analyze the result of a clustering process. It can work with the clustered data alone or by comparing it with ground truth information.

Possible clustering specification clusters include:

	{{e₁,e₂,…},…}	a list of clustered examples
	<\|l₁{e₁,e₂,…},…\|>	an association of clustered examples with label l_i
	{e₁l₁,e₂l₂,…}	a list of examples and their correspondent cluster labels
	{e₁,e₂,…}{l₁,l₂,…}	two separate lists for examples and labels
	{e₁,e₂,…}cfun	an implicit classification via ClassifierFunction

Possible ground truth specifications gt include:

	{{e₁,e₂,…},…}	a list of example clusters
	<\|l₁{e₁,e₂,…},…\|>	an association of example lists labeled by cluster
	{e₁l₁,e₂l₂,…}	list of examples and their correspondent cluster
	{e₁,e₂,…}{l₁,l₂,…}	separate lists for examples and clusters
	{l₁,l₂,…}	a list with cluster labels for each example

The measurements meas can have the form:

	"Summary"	a summary table of measurements
	"name"	a specific measurement "name"
	{"name₁","name₂",…}	a list of measuremetns
	All	all possible measurements
	"Properties"	a list of possible measurement names

Measurements can be divided in internals and externals.
Internal measurements generally assume that good clusters have high separation and low dispersion.

Common separation definitions (intercluster distances):

Common dispersion definitions (intracluster distances):

The notations 〈e_i〉 and 〈e〉 represent the average over a cluster and over the whole dataset.
Supported internal measurements meas include:

	"CalinskiHarabasz"	ratio between average separation and average centroid dispersion (maximize)
	"DaviesBouldin"	average maximal ratio of the sum of centroid dispersions over centroid separation for a cluster pair (minimize)
	"Dunn"	ratio of smallest minimal separation to dataset maximal dispersion (maximize)
	"RSquared"	ratio of mean average dispersion to dataset centroid dispersion (elbow)
	"Silhouette"	average difference between intercluster distance and intercluster distance for the next nearest cluster (maximize)
	"StandardDeviation"	mean average dispersion (elbow)

Internal measurements that return a result per cluster or per example include:

	"DaviesBouldinScore"	maximal cluster similarity
	"RSquaredScore"	ratio between cluster and overall dataset dispersion
	"SilhouetteScore"	difference between intercluster distance and intercluster distance for the next nearest cluster
	"SilhouetteScoreList"	per example slihouette value
	"StandardDeviationScore"	average dispersion

External measurements compare the cluster assignment of an example e_i with its ground truth value gt.

Supported external measurements include:

	"Purity"	fraction of examples with the commonest ground truth assignment in their cluster (maximize)
	"Rand"	fraction of (e_i,e_j) pairs that correctly share or do not share the same ground truth assignment (maximize)

External measurements that return a result per cluster or per example include:

	"PurityScore"	largest fraction of examples sharing the same ground truth assignment in each cluster
	"RandScore"	fraction of (e_i,e_j) pairs that correctly share or do not share the same ground truth assignment in each cluster

ClusteringMeasurements[…,{"prop₁","prop₂",…}] can be used to compute multiple properties.
ClusteringMeasurements supports the following options:
DistanceFunction Automatic the distance function to use

FeatureExtractor Identity how to extract features from the examples
By default, the following distance functions are used for different types of elements:

	EuclideanDistance	numeric data
	ImageDistance	images
	JaccardDissimilarity	Boolean data
	EditDistance	text and nominal sequences
	Abs[DateDifference[#1,#2]]&	dates and times
	ColorDistance	colors
	GeoDistance	geospatial data
	Boole[SameQ[#1,#2]]&	nominal data
	HammingDistance	nominal vector data
	WarpingDistance	numerical sequences

Examples

open allclose all

Basic Examples (2)

Get a summary of the clustering measurements:

Compute the silhouette score for a group of clusters:

Visualize the scores in a bar chart:

Compute and chart the silhouette score for individual examples:

Scope (9)

Data Formats (5)

Specify the clusters explicitly in a list:

Specify the clusters explicitly in an association:

Specify the clusters by a list of rules between examples and assignments:

Specify the clusters by a rule between examples and assignments:

Specify the clusters by a rule between examples and a ClassifierFunction[…]:

Measurements (4)

Compute a clustering property:

Compute a list of properties:

Compute a summary of the global measurements:

Get a list of available properties:

Get a list of available properties when a ground truth is specified:

Options (2)

DistanceFunction (1)

Specify a custom distance function:

FeatureExtractor (1)

Specify a custom feature extractor to pre-process the examples:

Applications (2)

Find the optimal cluster number on a synthetic dataset:

Combine the different groups in a random permutation:

Compute the k-mean clustering for different values of k:

Measure the Dunn index of each set of clusters:

The optimal clustering is at 5 clusters:

The clustering process was able to recover all the original groups:

Visualize the Silhouette score for each point in a clustering:

Compute the k-mean clustering a given k:

Visualize the Silhouette score:

Compute the k-mean clustering for different values of k:

Plot each set of clusters with the correspondent Silhouette profile:

Possible Issues (1)

External measurements require a ground truth specification:

Interactive Examples (1)

Cluster a list of points interactively measuring the Calinski–Harabasz index:

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

ClusteringMeasurements

Details and Options

Examples

Basic Examples (2)

Scope (9)

Data Formats (5)

Measurements (4)

Options (2)

DistanceFunction (1)

FeatureExtractor (1)

Applications (2)

Possible Issues (1)

Interactive Examples (1)

Text

CMS

APA

BibTeX

BibLaTeX

	DistanceFunction	Automatic	the distance function to use
	FeatureExtractor	Identity	how to extract features from the examples

ClusteringMeasurements

Details and Options

Examples

Basic Examples (2)

Scope (9)

Data Formats (5)

Measurements (4)

Options (2)

DistanceFunction (1)

FeatureExtractor (1)

Applications (2)

Possible Issues (1)

Interactive Examples (1)

See Also

Related Guides

History

Text

CMS

APA

BibTeX

BibLaTeX