Redundancy and Correlation in Data Mining
What is Data Redundancy?
In data mining, data is often collected from different sources during
data integration. Because of this, the same information may appear more than once in the dataset. This
situation is called data redundancy.
An attribute is considered redundant if its value can be calculated or
derived from other attributes in the dataset.
For example, imagine a dataset with 20 attributes. If one attribute can be
determined using other attributes, then that attribute does not provide any
new information. Such attributes are called redundant attributes.
Redundancy can also occur because of inconsistent naming of attributes or
dimensions in different data sources.
Example of Data Redundancy
Consider a dataset with three attributes:
- pizza_name
- is_veg
- is_nonveg
Definitions:
- is_veg = 1 if the selected pizza is vegetarian, otherwise 0.
- is_nonveg = 1 if the selected pizza is non-vegetarian, otherwise 0.
Since a pizza can only be veg or non-veg, the two attributes are directly
related.
For example:
- If is_veg = 0, then the pizza must be non-veg.
- If is_veg = 1, then is_nonveg = 0.
This means one attribute can be derived from the other. Therefore, one of
them is redundant,and we can remove either is_veg or is_nonveg without losing
information.
Detecting Data Redundancy
Some common methods used to detect redundancy between attributes
are:
- Chi-Square (χ²) Test
- Correlation Coefficient and Covariance
1. Chi-Square (χ²) Test
The Chi-Square test is used for categorical or qualitative
data.
Suppose we have two attributes X and Y in a dataset. A contingency
table is created to represent the frequency of combinations of these attributes.
The Chi-Square test compares:
- Observed Values → Actual frequency in the data
- Expected Values → Frequency expected if the attributes were independent
- If the hypothesis is rejected, it means X and Y are related.
- In that case, one attribute may be redundant, and we may remove one of them.
2. Correlation Coefficient for Numeric Data
For numeric attributes, redundancy can be detected using the
correlation coefficient.
The relationship between two attributes A and B is calculated using
Pearson’s Product-Moment
Correlation Coefficient.
The correlation coefficient measures how strongly two variables are
related.
Its value ranges from –1 to +1.
Meaning of values:
+1 → Perfect positive correlation (both variables increase
together)
–1 → Perfect negative correlation (one increases while the other
decreases)
0 → No relationship between the variables
Common correlation methods include:
Pearson Correlation → Used for continuous numeric variables
Spearman Rank Correlation → Used when at least one variable
represents a rank
Interpretation
- If the correlation value is high, the attributes are strongly related, and one attribute may be removed.
- If the correlation value is 0, the attributes are independent.
- If the correlation value is negative, when one attribute increases, the other decreases.