Redundancy and Correlation in Data Mining

Sabareshwari

What is Data Redundancy?

In data mining, data is often collected from different sources during data integration. Because of this, the same information may appear more than once in the dataset. This situation is called data redundancy.

An attribute is considered redundant if its value can be calculated or derived from other attributes in the dataset.

For example, imagine a dataset with 20 attributes. If one attribute can be determined using other attributes, then that attribute does not provide any new information. Such attributes are called redundant attributes.

Redundancy can also occur because of inconsistent naming of attributes or dimensions in different data sources.

Example of Data Redundancy

Consider a dataset with three attributes:

pizza_name
is_veg
is_nonveg

Definitions:

is_veg = 1 if the selected pizza is vegetarian, otherwise 0.
is_nonveg = 1 if the selected pizza is non-vegetarian, otherwise 0.

Since a pizza can only be veg or non-veg, the two attributes are directly related.

For example:

If is_veg = 0, then the pizza must be non-veg.
If is_veg = 1, then is_nonveg = 0.

This means one attribute can be derived from the other. Therefore, one of them is redundant,and we can remove either is_veg or is_nonveg without losing information.

Detecting Data Redundancy

Some common methods used to detect redundancy between attributes are:

1. Chi-Square (χ²) Test

The Chi-Square test is used for categorical or qualitative data.

Suppose we have two attributes X and Y in a dataset. A contingency table is created to represent the frequency of combinations of these attributes.

The Chi-Square test compares:

Observed Values → Actual frequency in the data
Expected Values → Frequency expected if the attributes were independent

The test checks the hypothesis that X and Y are independent.

If the hypothesis is rejected, it means X and Y are related.
In that case, one attribute may be redundant, and we may remove one of them.

2. Correlation Coefficient for Numeric Data

For numeric attributes, redundancy can be detected using the correlation coefficient.

The relationship between two attributes A and B is calculated using Pearson’s Product-Moment

Correlation Coefficient.

The correlation coefficient measures how strongly two variables are related.

Its value ranges from –1 to +1.

Meaning of values:

+1 → Perfect positive correlation (both variables increase together)

–1 → Perfect negative correlation (one increases while the other decreases)

0 → No relationship between the variables

Common correlation methods include:

Pearson Correlation → Used for continuous numeric variables

Spearman Rank Correlation → Used when at least one variable represents a rank

Interpretation

If the correlation value is high, the attributes are strongly related, and one attribute may be removed.
If the correlation value is 0, the attributes are independent.
If the correlation value is negative, when one attribute increases, the other decreases.

Thus, identifying correlation helps in reducing redundant attributes and improving data quality in data mining.

« Previous Next »

Redundancy and Correlation in Data Mining

What is Data Redundancy?

Example of Data Redundancy

Definitions:

For example:

Detecting Data Redundancy

1. Chi-Square (χ²) Test

The Chi-Square test compares:

2. Correlation Coefficient for Numeric Data

Meaning of values:

Common correlation methods include:

Interpretation

Translate

Related course

Social Plugin

Ads

Ads

Website by

Categories

Our Services

Footer Copyright

Contact form

Redundancy and Correlation in Data Mining

What is Data Redundancy?

Example of Data Redundancy

Definitions:

For example:

Detecting Data Redundancy

1. Chi-Square (χ²) Test

The Chi-Square test compares:

2. Correlation Coefficient for Numeric Data

Meaning of values:

Common correlation methods include:

Interpretation

You may like these posts

Footer Copyright

Contact form