Numerosity Reduction in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Numerosity Reduction in Data Mining

Jeevadharshan

Numerosity Reduction in Data Mining

Data reduction is a process used to reduce the size of data so that it becomes easier and faster to analyze. While reducing the data, it is important to maintain the quality and meaning of the original data. 

There are two main types of data reduction:
  • Dimensionality Reduction 
  • Numerosity Reduction

What is Numerosity Reduction? 

Numerosity Reduction reduces the amount of data by representing it in a smaller form instead of storing the entire dataset.

Instead of keeping all raw data, we store:
  • A model, or 
  • A summary of the data
There are two types of numerosity reduction:
  • Parametric methods 
  • Non-parametric methods

Types of Numerosity Reduction 

1. Parametric Methods

In parametric methods, we assume that data follows a model. Instead of storing all data, we only store the model parameters.

Examples:

  • Regression 
  • Log-linear models

Regression

Regression is used to show the relationship between variables.
  • Simple Linear Regression → One independent variable 
  • Multiple Linear Regression → More than one independent variable
y=wx+b 
Where:
y = dependent variable
x = independent variable 
w, b = constants (coefficients) 
This model helps us predict values without storing all data.

Log-Linear Model

A log-linear model is used when dealing with multiple variables (dimensions).
  • It estimates the probability of data points 
  • Works well with discrete data 
  • Helps represent high-dimensional data using fewer parameters

2. Non-Parametric Methods

Non-parametric methods do not assume any model.Instead, they reduce data by creating summaries or groups. 

These methods are:
  • Easier to apply 
  • More flexible 
  • But may reduce less data compared to parametric methods

Types of Non-Parametric Methods

 1. Histograms 

 Represent data using frequency counts
 Data is divided into bins (ranges) 
 Helps understand data distribution quickly 

 2. Clustering

 Groups similar data into clusters 
 Data inside a cluster is similar
 Data in different clusters is different 

Cluster quality is measured using:
  • Diameter → Distance between farthest points 
  • Centroid Distance → Distance from center 

3. Sampling 

 Sampling reduces data by selecting a small subset of the dataset.

Types of sampling
  • Simple Random Sampling (with replacement) 
  • Simple Random Sampling (without replacement) 
  • Cluster Sampling 
  • Stratified Sampling

4. Data Cube Aggregation 

Data is summarized at higher levels 
Reduces detailed data into aggregated form 
Commonly used in data warehouses 

5. Data Compression 

 Reduces data size by encoding it efficiently 
 Removes redundancy 

Types:
  • Lossless Compression → Original data can be recovered 
  • Lossy Compression → Some data is lost  

Numerosity Reduction vs Dimensionality Reduction 

Numerosity Reduction focuses on reducing the volume of data records. Instead of storing the entire dataset, it uses techniques like regression, log-linear models, histograms, clustering, and sampling to represent the data in a smaller form. The idea is to approximate the original data using models or summaries, which helps in saving storage space and improving processing speed. However, since it represents data in a simplified way, some detailed information may be lost.

Dimensionality Reduction, on the other hand, focuses on reducing the number of attributes or features in the dataset. It transforms or selects important features so that irrelevant or redundant data can be removed. Techniques like feature selection and transformations (e.g., wavelet transform) are commonly used. This method helps in simplifying the dataset, improving model performance, and reducing complexity, while trying to preserve the most important information.
Our website uses cookies to enhance your experience. Learn More
Accept !