Redundancy and Correlation in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Redundancy and Correlation in Data Mining

Sabareshwari

 Redundancy and Correlation in Data Mining

What is Data Redundancy?

In data mining, data is often collected from different sources during data integration. Because of this, the same information may appear more than once in the dataset. This situation is called data redundancy.

An attribute is considered redundant if its value can be calculated or derived from other attributes in the dataset.

For example, imagine a dataset with 20 attributes. If one attribute can be determined using other attributes, then that attribute does not provide any new information. Such attributes are called redundant attributes.

Redundancy can also occur because of inconsistent naming of attributes or dimensions in different data sources.

Example of Data Redundancy

Consider a dataset with three attributes:
  • pizza_name
  • is_veg
  • is_nonveg

Definitions:

  • is_veg = 1 if the selected pizza is vegetarian, otherwise 0.
  • is_nonveg = 1 if the selected pizza is non-vegetarian, otherwise 0.
Since a pizza can only be veg or non-veg, the two attributes are directly related.

For example:

  • If is_veg = 0, then the pizza must be non-veg.
  • If is_veg = 1, then is_nonveg = 0.
This means one attribute can be derived from the other. Therefore, one of them is redundant,and we can remove either is_veg or is_nonveg without losing information.

Detecting Data Redundancy

Some common methods used to detect redundancy between attributes are:
  • Chi-Square (χ²) Test
  • Correlation Coefficient and Covariance

1. Chi-Square (χ²) Test

The Chi-Square test is used for categorical or qualitative data.

Suppose we have two attributes X and Y in a dataset. A contingency table is created to represent the frequency of combinations of these attributes.

The Chi-Square test compares:

  • Observed Values → Actual frequency in the data
  • Expected Values → Frequency expected if the attributes were independent
The test checks the hypothesis that X and Y are independent.
  • If the hypothesis is rejected, it means X and Y are related.
  • In that case, one attribute may be redundant, and we may remove one of them.

2. Correlation Coefficient for Numeric Data

For numeric attributes, redundancy can be detected using the correlation coefficient.

The relationship between two attributes A and B is calculated using Pearson’s Product-Moment
Correlation Coefficient.

The correlation coefficient measures how strongly two variables are related.

Its value ranges from –1 to +1.

Meaning of values:

+1 → Perfect positive correlation (both variables increase together)
–1 → Perfect negative correlation (one increases while the other decreases)
0 → No relationship between the variables

Common correlation methods include:

Pearson Correlation → Used for continuous numeric variables
Spearman Rank Correlation → Used when at least one variable represents a rank

Interpretation

  • If the correlation value is high, the attributes are strongly related, and one attribute may be removed.
  • If the correlation value is 0, the attributes are independent.
  • If the correlation value is negative, when one attribute increases, the other decreases.
Thus, identifying correlation helps in reducing redundant attributes and improving data quality in data mining.

Our website uses cookies to enhance your experience. Learn More
Accept !