Density-Based Clustering in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Density-Based Clustering in Data Mining

Balaji. K

Density-Based Clustering in Data Mining

Density-based clustering is a clustering technique used in data mining and machine learning. It
groups data points based on how closely they are located to each other. Points that are in
dense regions form clusters, while points in sparse regions are considered noise or outliers.

What is Density-Based Clustering?

Density-based clustering is a popular unsupervised learning method used to discover patterns
in data without predefined labels.

In this method:
  •  Data points that are closely packed together form a cluster.
  •  Areas with very few points separate clusters.
  •  Points located in these low-density areas are treated as noise.
The neighborhood around a point within a radius ε (epsilon) is called the ε-neighborhood.

If the number of points in this neighborhood is greater than or equal to a minimum value called

MinPts, the point is called a core point.

Important Parameters

Density-based clustering mainly depends on two parameters:

1. EPS (ε – Epsilon)
It is the maximum distance between two points to be considered neighbors.

It defines the radius of the neighborhood.

2. MinPts
It is the minimum number of points required inside the ε-neighborhood to form a dense region.

Mathematically, the ε-neighborhood of point i is defined as:

NEps(i) = { k ∈ D | distance(i,k) ≤ ε }

Where D represents the dataset.

Key Concepts in Density-Based Clustering

1. Directly Density Reachable

A point i is directly density reachable from point k if:

i lies within the ε-neighborhood of k, and

k is a core point (it has at least MinPts points in its neighborhood).

2. Density Reachable
A point i is density reachable from point j if there exists a chain of points:

j → i1 → i2 → ... → i

Where each point in the chain is directly density reachable from the previous point.

This means clusters can grow through connected dense regions.

3. Density Connected

Two points i and j are density connected if there exists a point o such that:

Both i and j are density reachable from o.

This concept helps identify points belonging to the same cluster.

Working of Density-Based Clustering

Working of Density-Based Clustering

Consider a dataset D containing multiple data points.
  •  The algorithm starts by selecting a point.
  •  It checks whether the point is a core point by counting neighbors within ε.
  •  If it is a core point, a cluster is formed.
  •  Neighboring points are added to the cluster if they satisfy the density conditions.
  •  Points that do not belong to any cluster are marked as noise.
This process continues until all points in the dataset are processed.

Major Features of Density-Based Clustering

  • It scans the dataset to detect dense regions.
  • It uses density parameters (ε and MinPts) to form clusters.
  • It can handle noise and outliers effectively.
  • It can detect clusters of any shape and size.
  • It works well with spatial datasets.

Density-Based Clustering Methods

1. DBSCAN

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is the most widely used
density-based clustering algorithm.

Features:

  •  Detects clusters based on density of points.
  •  Identifies outliers automatically.
  •  Can discover clusters with arbitrary shapes.

2. OPTICS

OPTICS (Ordering Points To Identify the Clustering Structure) is an extension of DBSCAN.

Features:
  •  Orders points based on their density relationships.
  •  Works well with datasets having varying densities.
  •  Helps identify the clustering structure of data more clearly.
3. DENCLUE

DENCLUE (DENsity-based CLUstEring) is another density-based clustering method.

Features:

  •  Uses mathematical density functions to identify clusters.
  •  Can detect clusters of complex shapes.
  •  Performs well with high-dimensional data and datasets containing large amounts of
noise.
Our website uses cookies to enhance your experience. Learn More
Accept !