Binning in Data Mining
What is Binning?
Binning (also called data discretization or bucketing) is a data
preprocessing technique used in data mining to reduce noise in data.
In this method, large sets of numerical data are divided into smaller groups
called bins. All the values in each bin are then replaced with a
representative value such as the mean, median, or boundary value.
This process smooths the data and helps improve the performance of data
analysis and machine learning models.
Simple Explanation:
Binning groups similar values together into intervals.
For example, instead of storing individual ages like:
21, 23, 25, 27, 29
We can group them into bins such as:
20–25
26–30
This makes the data easier to analyze.
Why is Binning Used?
Binning is used for several reasons:
- Reduce noise in the data
- Simplify complex datasets
- Improve model performance
- Prevent overfitting, especially in small datasets
- Convert numerical data into categorical data
- Identify outliers or missing values
Purpose of Binning
The main purpose of binning is to reduce the number of distinct data values
by grouping similar values together.
This helps in:
- Faster data processing
- Better visualization
- Stronger relationships between variables in machine learning models
Binning in Image Processing
In image processing, binning refers to combining multiple pixels into a
single larger pixel.
For example:
In 2 × 2 binning, four pixels are merged into one pixel.
Advantages:
- Reduces the amount of image data
- Improves image brightness
- Reduces noise in images
Disadvantage:
- Image resolution becomes lower.
Supervised Binning
Supervised binning is an advanced binning method used in machine learning.
In this method:
- The bin boundaries are created using the target variable.
- A decision tree is often used to determine the best bin divisions.
This helps improve prediction accuracy because it considers the relationship
between input features and the target variable.
Example of Binning
A common example of binning is a Histogram.
A histogram groups data into intervals and shows how frequently values fall
within each interval.
Example:
Marks of students:
45, 50, 52, 55, 60, 65, 70
Bins may be:
40–50
50–60
60–70
This helps visualize the distribution of marks.
Methods of Binning
There are two main methods used to divide data into bins.
1. Equal Frequency Binning
In this method, each bin contains the same number of data values.
Example:
Input data:
[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output bins:
Bin 1: [5, 10, 11, 13]
Bin 2: [15, 35, 50, 55]
Bin 3: [72, 92, 204, 215]
Each bin contains four values.
2. Equal Width Binning
In this method, each bin has the same range (width).
The bin width is calculated using the formula:
Example:
Input data:
[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output bins:
Bin 1: [5, 10, 11, 13, 15, 35, 50, 55, 72]
Bin 2: [92]
Bin 3: [204, 215]
Each bin covers the same range of values.
Implementation of Binning (Python Example)
Below is a simple Python program that demonstrates binning techniques.
# Equal Frequency Binning
def equifreq(arr1, m):
a = len(arr1)
n = int(a / m)
for i in range(0, m):
arr = []
for j in range(i * n, (i + 1) * n):
if j >= a:
break
arr.append(arr1[j])
print(arr)
# Equal Width Binning
def equiwidth(arr1, m):
w = int((max(arr1) - min(arr1)) / m)
min1 = min(arr1)
bins = []
for i in range(0, m + 1):
bins.append(min1 + w * i)
result = []
for i in range(0, m):
temp = []
for j in arr1:
if j >= bins[i] and j <=
bins[i + 1]:
temp.append(j)
result.append(temp)
print(result)
# Data to be binned
data = [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
# Number of bins
m = 3
print("Equal Frequency Binning")
equifreq(data, m)
print("\nEqual Width Binning")
equiwidth(data, m)
Output:
Equal Frequency Binning
[5, 10, 11, 13]
[15, 35, 50, 55]
[72, 92, 204, 215]
Equal Width Binning
[[5, 10, 11, 13, 15, 35, 50, 55, 72], [92], [204, 215]]
