"DBSCAN" (Machine Learning Method)

Details & Suboptions

  • "DBSCAN" (density-based spatial clustering of applications with noise) is a density-based clustering method where the density is estimated using a neighbor-based approach. "DBSCAN" works for arbitrary cluster shapes and sizes but requires clusters to have similar densities.
  • The following plots show the results of the "DBSCAN" method applied to toy datasets (black points indicate outliers):
  • "DBSCAN" defines "core points" as data points that have more than k neighbors within a ball of ϵ radius (i.e. data points in high-density regions). Then, core points that are at a distance of less than ϵ from each other define a cluster. Furthermore, any point that is at a distance of less than ϵ of a core point belongs to the cluster of the core point. Any point that is not near a core point is considered noise.
  • This results in each cluster containing one or more core points at its core and some non-core points at its "edge". Overall, "DBSCAN" defines clusters as connected high-density regions. In the following figure, core points are red, edge points are yellow and noise points are blue:
  • In ClusteringComponents and ClusterClassify, noise points are labeled Missing["Anomalous"].
  • In FindClusters, noise points are returned as a cluster.
  • The option DistanceFunction can be used to define which distance to use.
  • The following suboptions can be given:
  • "NeighborhoodRadius" Automatic
    radius ϵ
    "NeighborsNumber" Automaticnumber of neighbors k
    "DropAnomalousValues" Falsewhether to drop outliers

Examples

open allclose all

Basic Examples  (3)

Find clusters of nearby values using the "DBSCAN" method:

Train the ClassifierFunction on a list of colors using the "DBSCAN" method:

Gather the elements by their class number:

Create random 2D vectors:

Plot clusters in data found using the "DBSCAN" method:

Scope  (2)

Obtain a random list of times:

Train the ClassifierFunction using the "DBSCAN" method:

Obtain the cluster assignment and cluster the data:

Train the ClassifierFunction using the "DBSCAN" method:

Noise points are labeled as Missing["Anomalous"]:

Options  (7)

DistanceFunction  (1)

Cluster string data using edit distance:

Cluster data using Manhattan distance:

"NeighborhoodRadius"  (2)

Find clusters by specifying the "NeighborhoodRadius" suboption:

Define a set of two-dimensional data points, characterized by four somewhat nebulous clusters:

Plot clusters in data found using the "DBSCAN" method:

Plot different clusterings of data using the "DBSCAN" method by varying the "NeighborhoodRadius":

"NeighborsNumber"  (3)

Find clusters by specifying the "NeighborsNumber" suboption:

Create random 2D vectors:

Plot clusters in data found using the "DBSCAN" method:

Plot different clusterings of data using the "DBSCAN" method by varying the "NeighborsNumber":

Define a set of two-dimensional data points, characterized by four somewhat nebulous clusters:

Plot clusters in data using the "DBSCAN" method:

Plot different clusterings of data using the "DBSCAN" method by varying the "NeighborsNumber":

"DropAnomalousValues"  (1)

Train the ClassifierFunction, which labels outliers as Missing["Anomalous"]:

Use the trained ClassifierFunction to identify the outliers:

Train the ClassifierFunction by dropping outliers and finding new cluster assignments:

Similarly, find clusters of nearby values with outliers:

Remove outliers using the "DropAnomalousValues" suboption: