ClusterClassify

ClusterClassify[data]

generates a ClassifierFunction[…] by partitioning data into clusters of similar elements.

ClusterClassify[data,n]

generates a ClassifierFunction[…] with n clusters.

Details and Options

ClusterClassify works for a variety of data types, including numerical, textual, and image, as well as dates and times and combinations of these.
The number of clusters can be specified in the following ways:
Automatic find the number of clusters automatically

n find exactly n clusters

UpTo[n] find at most n clusters
The following options can be given:

CriterionFunction	Automatic	criterion for selecting a method
DistanceFunction	Automatic	the distance function to use
FeatureExtractor	Identity	how to extract features from which to learn
FeatureNames	Automatic	feature names to assign for input data
FeatureTypes	Automatic	feature types to assume for input data
Method	Automatic	what method to use
MissingValueSynthesis	Automatic	how to synthesize missing values
PerformanceGoal	Automatic	aspect of performance to optimize
RandomSeeding	1234	what seeding of pseudorandom generators should be done internally
Weights	Automatic	what weight to give to each example

By default, ClusterClassify will preprocess the data automatically unless a DistanceFunction is specified.
The setting for DistanceFunction can be any distance or dissimilarity function, or a function f defining a distance between two values.
Possible settings for PerformanceGoal include:

	Automatic	automatic tradeoff among speed, accuracy, and memory
	"Memory"	minimize the storage requirements of the classifier
	"Quality"	maximize the accuracy of the classifier
	"Speed"	maximize the speed of the classifier
	"TrainingSpeed"	minimize the time spent producing the classifier

Possible settings for Method include:

	Automatic	automatically select a method
	"Agglomerate"	single linkage clustering algorithm
	"DBSCAN"	density-based spatial clustering of applications with noise
	"GaussianMixture"	variational Gaussian mixture algorithm
	"JarvisPatrick"	Jarvis–Patrick clustering algorithm
	"KMeans"	k-means clustering algorithm
	"KMedoids"	partitioning around medoids
	"MeanShift"	mean-shift clustering algorithm
	"NeighborhoodContraction"	shift data points toward high-density regions
	"SpanningTree"	minimum spanning tree-based clustering algorithm
	"Spectral"	spectral clustering algorithm

The methods "KMeans" and "KMedoids" can only be used when the number of clusters is specified.
The methods "DBSCAN", "GaussianMixture", "JarvisPatrick", "MeanShift" and "NeighborhoodContraction" can only be used when the number of clusters is Automatic.
The following plots show results of common methods on toy datasets:

Possible settings for CriterionFunction include:

	"StandardDeviation"	root-mean-square standard deviation
	"RSquared"	R-squared
	"Dunn"	Dunn index
	"CalinskiHarabasz"	Calinski–Harabasz index
	"DaviesBouldin"	Davies–Bouldin index
	"Silhouette"	Silhouette score
	Automatic	internal index

Possible settings for RandomSeeding include:

	Automatic	automatically reseed every time the function is called
	Inherited	use externally seeded random numbers
	seed	use an explicit integer or strings as a seed

ClusterClassify[…,FeatureExtractor"Minimal"] indicates that the internal preprocessing should be as simple as possible.

Examples

open allclose all

Basic Examples (3)

Train the ClassifierFunction on some numerical data:

Use the classifier function to classify a new unlabeled example:

Obtain classification probabilities for this example:

Classify multiple examples:

Plot the probabilities for the two different classes in the interval {-5,5}:

Train the ClassifierFunction on some colors by requiring the number of classes to be 5:

Use the ClassifierFunction on some unlabeled data:

Gather the elements by their class number:

Train the ClassifierFunction on some strings:

Gather the elements by their class number:

Scope (11)

Classify real numbers:

Classify vectors:

Classify Boolean vectors:

Use the classifier to assign clusters to a new Boolean True, False vector:

Use the classifier to assign clusters to a Boolean 1, 0 vector:

Look at their probabilities:

Classify images:

Use the classifier to cluster new images:

Classify 3D images:

Classify colors:

Classify strings:

Use the classifier to cluster new strings:

Classify heterogeneous data:

Classify times:

Use the classifier to cluster the data:

Classify random reals:

Look at the classifier information:

Get a description for the specific method used:

Generate random points in the plane and visualize them:

Classify the data:

Classify new random points in the place:

Visualize the resulting clustering:

Classify the same test data using IndeterminateThreshold:

Visualize the resulting clustering including the Indeterminate cluster:

Options (10)

CriterionFunction (1)

Generate some separated data and visualize it:

Construct a classifier function using the Automatic CriterionFunction:

Construct a classifier function using the Calinski–Harabasz index as CriterionFunction:

Compare the two clusterings of the data:

FeatureExtractor (1)

Create a ClassifierFunction from a list of images and classify new examples:

Create a custom FeatureExtractor to extract features:

FeatureNames (1)

Generate a classifier function and give a name to each feature:

Use the association format to assign cluster to a new example:

The list format can still be used:

FeatureTypes (1)

Generate a classifier function assuming numerical and nominal feature types:

Generate a classifier function assuming nominal feature types instead:

Compare the result on new examples:

Method (2)

Generate some data using uniform distributions:

Classify the data:

Use Information to obtain a method description:

Look at the clustered data:

Classify the data using k-means:

Look at the clustered data:

Generate a large dataset using multinormal distributions and visualize it:

Use ClusterClassify to find clusters by specifying the method to use and look at the AbsoluteTiming:

Look at the resulting clustering:

Use ClusterClassify to find clusters without specifying the method to use and look at the AbsoluteTiming:

MissingValueSynthesis (1)

Generate a large dataset using multinormal distributions and visualize it:

Use ClusterClassify to find clusters:

Get the top cluster probabilities for a point with missing data:

Set the missing value synthesis to replace each missing variable with its estimated most likely value given known values (which is the default behavior):

Replace missing variables with random samples conditioned on known values:

Get the distribution of likely clusters for the point by replacing missing variables repeatedly with the random sampling strategy:

PerformanceGoal (1)

Generate a uniformly distributed dataset and visualize it:

Obtain a classifier from this data, with an emphasis on training speed:

Assign clusters to some randomly generated data and look at the AbsoluteTiming:

Obtain a classifier from this data, with an emphasis on the speed:

Assign clusters to some randomly generated data and look at the AbsoluteTiming compared to the one above:

Visualize the two clusterings for the test data and note how the setting "TrainingSpeed" gives better results:

RandomSeeding (1)

Train several classifiers on random colors:

Compute the classifiers on a new color and observe that the result is always the same:

Train several classifiers on the same colors by using different values of the RandomSeeding option:

Compute the classifiers on and observe how the classifier differs:

Weights (1)

Generate some separated data containing outliers:

Clusterize the data:

Use the classifier function to classify the outlier together with another point:

Clusterize the data, adding a big weight on the outlier:

Use the classifier function to classify the same points:

Applications (3)

Train several classifiers on a small, uniformly distributed dataset:

Divide a triangle into segments by using the classifiers on a large number of uniformly distributed random points:

Generate some normally distributed data:

Clusterize the data without specifying the number of classes:

Clusterize the data, specifying the number of classes:

Find dominant colors in an image:

Cluster the data given by the array of pixel values of the image:

Use the classifier to assign clusters to each pixel:

Use the classifier function to find four dominant colors:

Use the classifier to get binary masks for each dominant color:

Top

More Learning

Tech Support

Wolfram Solutions

Wolfram Solutions For Education

Get Started

Grow Your Skills

Work with Us

Educational Programs for Adults

Educational Programs for Youth

Read

ClusterClassify

Details and Options

Examples

Basic Examples (3)

Scope (11)

Options (10)

CriterionFunction (1)

FeatureExtractor (1)

FeatureNames (1)

FeatureTypes (1)

Method (2)

MissingValueSynthesis (1)

PerformanceGoal (1)

RandomSeeding (1)

Weights (1)

Applications (3)

Text

CMS

APA

BibTeX

BibLaTeX

	Automatic	find the number of clusters automatically
	n	find exactly n clusters
	UpTo[n]	find at most n clusters

ClusterClassify

Details and Options

Examples

Basic Examples (3)

Scope (11)

Options (10)

CriterionFunction (1)

FeatureExtractor (1)

FeatureNames (1)

FeatureTypes (1)

Method (2)

MissingValueSynthesis (1)

PerformanceGoal (1)

RandomSeeding (1)

Weights (1)

Applications (3)

See Also

Tech Notes

Related Guides

History

Text

CMS

APA

BibTeX

BibLaTeX