Numerosity Reduction in Data Mining

Jeevadharshan

Numerosity Reduction in Data Mining

Data reduction is a process used to reduce the size of data so that it becomes easier and faster to analyze. While reducing the data, it is important to maintain the quality and meaning of the original data.

There are two main types of data reduction:

Dimensionality Reduction
Numerosity Reduction

What is Numerosity Reduction?

Numerosity Reduction reduces the amount of data by representing it in a smaller form instead of storing the entire dataset.

Instead of keeping all raw data, we store:

A model, or
A summary of the data

There are two types of numerosity reduction:

Parametric methods
Non-parametric methods

Types of Numerosity Reduction

1. Parametric Methods

In parametric methods, we assume that data follows a model. Instead of storing all data, we only store the model parameters.

Examples:

Regression
Log-linear models

Regression

Regression is used to show the relationship between variables.

Simple Linear Regression → One independent variable
Multiple Linear Regression → More than one independent variable

y=wx+b

Where:

y = dependent variable

x = independent variable

w, b = constants (coefficients)

This model helps us predict values without storing all data.

Log-Linear Model

A log-linear model is used when dealing with multiple variables (dimensions).

It estimates the probability of data points
Works well with discrete data
Helps represent high-dimensional data using fewer parameters

2. Non-Parametric Methods

Non-parametric methods do not assume any model.Instead, they reduce data by creating summaries or groups.

These methods are:

Easier to apply
More flexible
But may reduce less data compared to parametric methods

Types of Non-Parametric Methods

1. Histograms

Represent data using frequency counts

Data is divided into bins (ranges)

Helps understand data distribution quickly

2. Clustering

Groups similar data into clusters

Data inside a cluster is similar

Data in different clusters is different

Cluster quality is measured using:

Diameter → Distance between farthest points
Centroid Distance → Distance from center

3. Sampling

Sampling reduces data by selecting a small subset of the dataset.

Types of sampling:

Simple Random Sampling (with replacement)
Simple Random Sampling (without replacement)
Cluster Sampling
Stratified Sampling

4. Data Cube Aggregation

Data is summarized at higher levels

Reduces detailed data into aggregated form

Commonly used in data warehouses

5. Data Compression

Reduces data size by encoding it efficiently

Removes redundancy

Types:

Lossless Compression → Original data can be recovered
Lossy Compression → Some data is lost

Numerosity Reduction vs Dimensionality Reduction

Numerosity Reduction focuses on reducing the volume of data records. Instead of storing the entire dataset, it uses techniques like regression, log-linear models, histograms, clustering, and sampling to represent the data in a smaller form. The idea is to approximate the original data using models or summaries, which helps in saving storage space and improving processing speed. However, since it represents data in a simplified way, some detailed information may be lost.

Dimensionality Reduction, on the other hand, focuses on reducing the number of attributes or features in the dataset. It transforms or selects important features so that irrelevant or redundant data can be removed. Techniques like feature selection and transformations (e.g., wavelet transform) are commonly used. This method helps in simplifying the dataset, improving model performance, and reducing complexity, while trying to preserve the most important information.

« Previous Next »

Numerosity Reduction in Data Mining

Numerosity Reduction in Data Mining

There are two main types of data reduction:

What is Numerosity Reduction?

Instead of keeping all raw data, we store:

There are two types of numerosity reduction:

Types of Numerosity Reduction

1. Parametric Methods

Examples:

Regression

Where:

Log-Linear Model

2. Non-Parametric Methods

Types of Non-Parametric Methods

1. Histograms

2. Clustering

3. Sampling

4. Data Cube Aggregation

5. Data Compression

Numerosity Reduction vs Dimensionality Reduction

Translate

Related course

Social Plugin

Ads

Ads

Website by

Categories

Our Services

Footer Copyright

Contact form

Numerosity Reduction in Data Mining

Numerosity Reduction in Data Mining

There are two main types of data reduction:

What is Numerosity Reduction?

Instead of keeping all raw data, we store:

There are two types of numerosity reduction:

Types of Numerosity Reduction

1. Parametric Methods

Examples:

Regression

Where:

Log-Linear Model

2. Non-Parametric Methods

Types of Non-Parametric Methods

1. Histograms

2. Clustering

3. Sampling

4. Data Cube Aggregation

5. Data Compression

Numerosity Reduction vs Dimensionality Reduction

You may like these posts

Footer Copyright

Contact form