Discretization in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Discretization in Data Mining

shareef

 Discretization in Data Mining

Data discretization is the process of converting a large number of continuous data values into a smaller number of intervals or groups. This makes the data easier to analyze, understand, and manage.

In simple terms, discretization transforms continuous numerical data into a finite set of ranges or categories while trying to keep the loss of information as small as possible.

There are two main types of discretization:

1. Supervised Discretization

In supervised discretization, the class label (target variable) is used while dividing the data into intervals.
This means the discretization process considers how the data is related to the class values.

2. Unsupervised Discretization

In unsupervised discretization, class labels are not used.
The method only depends on the distribution of the data itself.

Two common strategies used here are:
  • Top-down splitting – starting with a large interval and dividing it into smaller intervals.
  • Bottom-up merging – starting with small intervals and combining them into larger ones.

Techniques of Data Discretization

Several techniques are used in data mining to perform discretization.

1. Histogram Analysis

A histogram is a graphical representation that shows the frequency distribution of continuousdata.

It helps analysts understand the structure of the data, such as:
  • Outliers
  • Skewness
  • Normal distribution
By observing the histogram, suitable intervals for discretization can be created.

2. Binning

Binning is a data smoothing technique that groups a large number of continuous values intosmaller intervals called bins.

This method helps:
  • Reduce noise in data
  • Simplify data representation
  • Create concept hierarchies
Example:
Marks from 0–100 can be grouped into bins like:

0–40 → Low

41–70 → Medium

71–100 → High

3. Cluster Analysis

Cluster analysis is another technique used for discretization.

In this method:
  • A clustering algorithm divides data into groups (clusters).
  • Each cluster contains values that are similar to each other.
  • These clusters can then be used as intervals for discretization.

4. Discretization using Decision Tree Analysis

Decision trees can also be used for discretization.

This method works using a top-down splitting approach and is usually supervised.

Steps involved:
  • Select the attribute with the lowest entropy.
  • Apply a recursive process.
  • Divide the attribute values into different intervals.
  • Continue splitting until meaningful ranges are formed.

5. Discretization using Correlation Analysis

In this method, linear regression is used to find the best neighboring intervals.

Steps include:
  • Identify small intervals first.
  • Merge them into larger overlapping intervals.
  • Finally, create a set of optimal intervals.
  • This is also a supervised method.
  • Data Discretization and Concept Hierarchy Generation

A concept hierarchy represents data in a structured order from specific to general concepts.

It organizes data into levels of abstraction.

For example:

City → Country → Continent

Example:

New Delhi → India → Asia

Concept hierarchies help in data summarization and data mining tasks.

Types of Hierarchy

1. Top-Down Mapping

Top-down mapping starts with general information and moves toward more specificdetails.

Example:
Continent → Country → City

2. Bottom-Up Mapping

Bottom-up mapping starts with specific data and gradually moves toward more generalinformation.

Example:
City → Country → Continent

Data Discretization and Binarization
Data Discretization converts continuous data into intervals or categories.
Data Binarization converts attributes into binary values (0 or 1).

Example of binarization
Temperature:

Above 30°C → 1 (Hot)
Below 30°C → 0 (Not Hot)

Importance of Discretization

Discretization is important in data mining because:
  • It simplifies complex continuous data.
  • It reduces noise in the dataset.
  • It improves the signal-to-noise ratio.
  • It makes many machine learning algorithms work more efficiently.
  • It helps in better data interpretation and visualization.
Our website uses cookies to enhance your experience. Learn More
Accept !