Numerosity Reduction in Data Mining
Data reduction is a process used to reduce the size of data so that it
becomes easier and faster to analyze. While reducing the data, it is
important to maintain the quality and meaning of the original
data.
There are two main types of data reduction:
- Dimensionality Reduction
- Numerosity Reduction
What is Numerosity Reduction?
Numerosity Reduction reduces the amount of data by representing it in a
smaller form instead of storing the entire dataset.
Instead of keeping all raw data, we store:
- A model, or
- A summary of the data
There are two types of numerosity reduction:
- Parametric methods
- Non-parametric methods
Types of Numerosity Reduction
1. Parametric Methods
In parametric methods, we assume that data follows a model. Instead of
storing all data, we only store the model parameters.
Examples:
- Regression
- Log-linear models
Regression
Regression is used to show the relationship between variables.
- Simple Linear Regression → One independent variable
- Multiple Linear Regression → More than one independent variable
y=wx+b
Where:
y = dependent variable
x = independent variable
w, b = constants (coefficients)
This model helps us predict values without storing all data.
Log-Linear Model
A log-linear model is used when dealing with multiple variables
(dimensions).
- It estimates the probability of data points
- Works well with discrete data
- Helps represent high-dimensional data using fewer parameters
2. Non-Parametric Methods
Non-parametric methods do not assume any model.Instead, they reduce data
by creating summaries or groups.
These methods are:
- Easier to apply
- More flexible
- But may reduce less data compared to parametric methods
Types of Non-Parametric Methods
1. Histograms
Represent data using frequency counts
Data is divided into bins (ranges)
Helps understand data distribution quickly
2. Clustering
Groups similar data into clusters
Data inside a cluster is similar
Data in different clusters is different
Cluster quality is measured using:
- Diameter → Distance between farthest points
- Centroid Distance → Distance from center
3. Sampling
Sampling reduces data by selecting a small subset of the
dataset.
Types of sampling:
- Simple Random Sampling (with replacement)
- Simple Random Sampling (without replacement)
- Cluster Sampling
- Stratified Sampling
4. Data Cube Aggregation
Data is summarized at higher levels
Reduces detailed data into aggregated form
Commonly used in data warehouses
5. Data Compression
Reduces data size by encoding it efficiently
Removes redundancy
Types:
- Lossless Compression → Original data can be recovered
- Lossy Compression → Some data is lost
Numerosity Reduction vs Dimensionality Reduction
Numerosity Reduction focuses on reducing the volume of data records.
Instead of storing the entire dataset, it uses techniques like regression,
log-linear models, histograms, clustering, and sampling to represent the
data in a smaller form. The idea is to approximate the original data using
models or summaries, which helps in saving storage space and improving
processing speed. However, since it represents data in a simplified way,
some detailed information may be lost.
Dimensionality Reduction, on the other hand, focuses on reducing the
number of attributes or features in the dataset. It transforms or selects
important features so that irrelevant or redundant data can be removed.
Techniques like feature selection and transformations (e.g., wavelet
transform) are commonly used. This method helps in simplifying the dataset,
improving model performance, and reducing complexity, while trying to
preserve the most important information.