ClusteringMeasurements

ClusteringMeasurements[{{e1,e2,},},meas]

returns the measurement meas for the clustered examples ei.

ClusteringMeasurements[clusters,gt,meas]

assumes the ground truth clustering gt.

Details and Options

  • ClusteringMeasurements is used to analyze the result of a clustering process. It can work with the clustered data alone or by comparing it with ground truth information.
  • Possible clustering specification clusters include:
  • {{e1,e2,},}a list of clustered examples
    <|l1{e1,e2,},|>an association of clustered examples with label li
    {e1l1,e2l2,}a list of examples and their correspondent cluster labels
    {e1,e2,}{l1,l2,}two separate lists for examples and labels
    {e1,e2,}cfunan implicit classification via ClassifierFunction
  • Possible ground truth specifications gt include:
  • {{e1,e2,},}a list of example clusters
    <|l1{e1,e2,},|>an association of example lists labeled by cluster
    {e1l1,e2l2,}list of examples and their correspondent cluster
    {e1,e2,}{l1,l2,}separate lists for examples and clusters
    {l1,l2,}a list with cluster labels for each example
  • The measurements meas can have the form:
  • "Summary"a summary table of measurements
    "name"a specific measurement "name"
    {"name1","name2",}a list of measuremetns
    Allall possible measurements
    "Properties"a list of possible measurement names
  • Measurements can be divided in internals and externals.
  • Internal measurements generally assume that good clusters have high separation and low dispersion.
  • Common separation definitions (intercluster distances):
  • Common dispersion definitions (intracluster distances):
  • The notations ei and e represent the average over a cluster and over the whole dataset.
  • Supported internal measurements meas include:
  • "CalinskiHarabasz"ratio between average separation and average centroid dispersion (maximize)
    "DaviesBouldin"average maximal ratio of the sum of centroid dispersions over centroid separation for a cluster pair (minimize)
    "Dunn"ratio of smallest minimal separation to dataset maximal dispersion (maximize)
    "RSquared"ratio of mean average dispersion to dataset centroid dispersion (elbow)
    "Silhouette"average difference between intercluster distance and intercluster distance for the next nearest cluster (maximize)
    "StandardDeviation"mean average dispersion (elbow)
  • Internal measurements that return a result per cluster or per example include:
  • "DaviesBouldinScore"maximal cluster similarity
    "RSquaredScore"ratio between cluster and overall dataset dispersion
    "SilhouetteScore"difference between intercluster distance and intercluster distance for the next nearest cluster
    "SilhouetteScoreList"per example slihouette value
    "StandardDeviationScore"average dispersion
  • External measurements compare the cluster assignment of an example ei with its ground truth value gt.
  • Supported external measurements include:
  • "Purity"fraction of examples with the commonest ground truth assignment in their cluster (maximize)
    "Rand"fraction of (ei,ej) pairs that correctly share or do not share the same ground truth assignment (maximize)
  • External measurements that return a result per cluster or per example include:
  • "PurityScore"largest fraction of examples sharing the same ground truth assignment in each cluster
    "RandScore"fraction of (ei,ej) pairs that correctly share or do not share the same ground truth assignment in each cluster
  • ClusteringMeasurements[,{"prop1","prop2",}] can be used to compute multiple properties.
  • ClusteringMeasurements supports the following options:
  • DistanceFunction Automaticthe distance function to use
    FeatureExtractor Identityhow to extract features from the examples
  • By default, the following distance functions are used for different types of elements:
  • EuclideanDistancenumeric data
    ImageDistanceimages
    JaccardDissimilarityBoolean data
    EditDistancetext and nominal sequences
    Abs[DateDifference[#1,#2]]&dates and times
    ColorDistancecolors
    GeoDistancegeospatial data
    Boole[SameQ[#1,#2]]&nominal data
    HammingDistancenominal vector data
    WarpingDistancenumerical sequences

Examples

open allclose all

Basic Examples  (2)

Get a summary of the clustering measurements:

Compute the silhouette score for a group of clusters:

Visualize the scores in a bar chart:

Compute and chart the silhouette score for individual examples:

Scope  (9)

Data Formats  (5)

Specify the clusters explicitly in a list:

Specify the clusters explicitly in an association:

Specify the clusters by a list of rules between examples and assignments:

Specify the clusters by a rule between examples and assignments:

Specify the clusters by a rule between examples and a ClassifierFunction[]:

Measurements  (4)

Compute a clustering property:

Compute a list of properties:

Compute a summary of the global measurements:

Get a list of available properties:

Get a list of available properties when a ground truth is specified:

Options  (2)

DistanceFunction  (1)

Specify a custom distance function:

FeatureExtractor  (1)

Specify a custom feature extractor to pre-process the examples:

Applications  (2)

Find the optimal cluster number on a synthetic dataset:

Combine the different groups in a random permutation:

Compute the k-mean clustering for different values of k:

Measure the Dunn index of each set of clusters:

The optimal clustering is at 5 clusters:

The clustering process was able to recover all the original groups:

Visualize the Silhouette score for each point in a clustering:

Compute the k-mean clustering a given k:

Visualize the Silhouette score:

Compute the k-mean clustering for different values of k:

Plot each set of clusters with the correspondent Silhouette profile:

Possible Issues  (1)

External measurements require a ground truth specification:

Interactive Examples  (1)

Cluster a list of points interactively measuring the CalinskiHarabasz index:

Wolfram Research (2022), ClusteringMeasurements, Wolfram Language function, https://reference.wolfram.com/language/ref/ClusteringMeasurements.html.

Text

Wolfram Research (2022), ClusteringMeasurements, Wolfram Language function, https://reference.wolfram.com/language/ref/ClusteringMeasurements.html.

CMS

Wolfram Language. 2022. "ClusteringMeasurements." Wolfram Language & System Documentation Center. Wolfram Research. https://reference.wolfram.com/language/ref/ClusteringMeasurements.html.

APA

Wolfram Language. (2022). ClusteringMeasurements. Wolfram Language & System Documentation Center. Retrieved from https://reference.wolfram.com/language/ref/ClusteringMeasurements.html

BibTeX

@misc{reference.wolfram_2023_clusteringmeasurements, author="Wolfram Research", title="{ClusteringMeasurements}", year="2022", howpublished="\url{https://reference.wolfram.com/language/ref/ClusteringMeasurements.html}", note=[Accessed: 28-March-2024 ]}

BibLaTeX

@online{reference.wolfram_2023_clusteringmeasurements, organization={Wolfram Research}, title={ClusteringMeasurements}, year={2022}, url={https://reference.wolfram.com/language/ref/ClusteringMeasurements.html}, note=[Accessed: 28-March-2024 ]}