Grid-Based Method in Data Mining
The grid-based clustering method is a technique used in data mining to
group data points. Instead of looking at each data point individually, it
divides the entire data space into a fixed number of small regions called
cells (or grids).
All clustering operations are performed on these grid cells, not directly
on the data points. This makes the method very fast, because the processing
time depends on the number of grid cells, not the number of data
points
Key Idea:
- Divide data space into grids
- Perform clustering on grids instead of individual points
- Faster and efficient for large datasets
Examples of Grid-Based Methods
Some popular grid-based approaches include:
STING (Statistical Information Grid) – Uses statistical information
stored in grid cells
WaveCluster – Uses wavelet transformation for clustering
CLIQUE – Works for high-dimensional data using grid and density concepts
Basics of Grid-Based Methods
Grid-based methods are especially useful when working with
multi-dimensional data, such as:
- Geographic data
- Image data
- Data with many attributes
1. Data Partitioning
Data is divided into multiple groups (clusters) based on
similarity
Each partition represents a cluster
The number of clusters (K) is often predefined
Examples of partitioning algorithms:
K-Means
PAM (K-Medoids)
CLARA
2. Data Reduction
Reduces the size of large datasets
Removes unnecessary or duplicate data
Keeps only important information
This helps in faster processing and better performance.
3. Local Pattern Discovery
Helps find patterns within specific regions (cells)
Useful for identifying hidden trends in small areas
Good for detecting localized behaviors in data
4. Scalability
Works well with large datasets
Efficient for high-dimensional data
Grid structure simplifies complex data
5. Density Estimation
Groups are formed based on dense regions of data
Sparse regions are treated as noise
Key concepts:
ε (epsilon) → Radius of neighborhood
MinPts → Minimum number of points required
If a region has enough points → it forms a cluster
Otherwise → considered noise
6. Clustering and Classification
Uses grid cells instead of individual data points
Improves processing speed
Efficient for large datasets
7. Grid-Based Indexing
Organizes data using grid structure
Helps in faster data retrieval and searching
Improves query performance
Popular Grid-Based Methods
1. K-Means Clustering
K-Means is a simple and widely used clustering algorithm.
Key Points:
- It is an unsupervised learning algorithm
- The number of clusters (K) is predefined
- Each cluster has a centroid (center point)
- Goal: Minimize distance between points and their centroid
How K-Means Works
Step 1: Choose number of clusters (K)
Step 2: Select K random centroids
Step 3: Assign each data point to nearest centroid
Step 4: Recalculate centroids
Step 5: Repeat steps 3 & 4 until no change
2. DBSCAN (Density-Based Clustering)
DBSCAN groups data based on density.
Key Concepts:
Eps (ε): Radius of neighborhood
MinPts: Minimum number of points required
Types of Points:
Core Point: Has enough neighbors
Border Point: Close to core point
Noise: Outliers
Important Condition:
A point is a core point if:
Number of points in its ε-neighborhood ≥ MinPts
3. STING (Statistical Information Grid)
STING divides space into multiple grid levels.
Key Features:
- Uses rectangular grid cells
- Each cell stores statistical values like:
- Mean
- Min
- Max
- Works in a hierarchical structure (top → bottom levels)
How STING Works
Step 1: Select a starting layer
Step 2: Calculate probability for each cell
Step 3: Mark cells as relevant or not
Step 4: Move to lower level if needed
Step 5: Repeat for relevant cells
Step 6: Retrieve data from selected cells
Step 7: Return final result
Summary
- Grid-based methods divide data into cells
- They are fast and efficient for large datasets
- Work well for multi-dimensional data
- Popular methods include K-Means, DBSCAN, and STING