Attribute Selection Measures in Data Mining
In this article, we will learn about attribute selection measures in a
simple way.
What is Attribute Selection?
Attribute selection is also known as:
- Feature selection
- Variable selection
It is an important concept in data mining, especially when building decision
trees.
Attributes (or features) are the columns in a dataset. Sometimes, datasets
contain:
- Irrelevant data
- Duplicate information
- Noise (unwanted data)
These can reduce the performance of a model and make learning difficult.
Attribute selection helps by:
- Removing unnecessary data
- Improving model accuracy
- Reducing overfitting
- Making the model easier to understand
Why is it Important in Decision Trees?
In a decision tree, we need to decide:
- Which attribute should be used first (root node)
- How to split the data at each step
Attribute selection measures help us choose the best attribute for splitting
the data.
Types of Attribute Selection Measures
There are three main measures:
- Entropy
- Information Gain
- Gini Index
1. Entropy
Entropy measures the impurity or disorder in a dataset.
High entropy → Data is mixed (impure)
Low entropy → Data is uniform (pure)
2. Information Gain
Information Gain tells us how much entropy decreases after splitting the
dataset.
In simple words:
It shows how useful an attribute is for classification.
Formula:
IG(D,A)=H(D)−∑v Dv/D H(Dv)
Where:
H(D) = Entropy of dataset
𝐷𝑣 = subset after split
∣𝐷𝑣|= size of subset
Rule:
Higher Information Gain = Better attribute
Example:
For attribute Gender:
IG ≈ 0.000
This means Gender is not useful for splitting.
3. Gini Index
Gini Index measures impurity like entropy but in a different way.
Formula:
Gini(D)=1−∑i=1n(pi)2
Value Range:
0 → Pure dataset
1 → Completely impure
0.5 → Balanced but impure
Example:
For:
6 Yes, 4 No
Gini = 0.48
Indicates moderate impurity.
Attribute selection measures help us choose the best feature when building a decision tree.
- Entropy → Measures disorder
- Information Gain → Measures improvement after split
- Gini Index → Measures impurity
Using these methods:
- Improves model accuracy
- Makes decision trees more efficient
- Helps in better decision-making