Attribute Selection Measures in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Attribute Selection Measures in Data Mining

shareef

 Attribute Selection Measures in Data Mining

In this article, we will learn about attribute selection measures in a simple way.

What is Attribute Selection?

Attribute selection is also known as:
  • Feature selection
  • Variable selection
It is an important concept in data mining, especially when building decision trees.

Attributes (or features) are the columns in a dataset. Sometimes, datasets contain:
  • Irrelevant data
  • Duplicate information
  • Noise (unwanted data)
These can reduce the performance of a model and make learning difficult.

Attribute selection helps by:
  • Removing unnecessary data
  • Improving model accuracy
  • Reducing overfitting
  • Making the model easier to understand

Why is it Important in Decision Trees?

In a decision tree, we need to decide:
  • Which attribute should be used first (root node)
  • How to split the data at each step
Attribute selection measures help us choose the best attribute for splitting the data.

Types of Attribute Selection Measures

There are three main measures:
  • Entropy
  • Information Gain
  • Gini Index

1. Entropy

Entropy measures the impurity or disorder in a dataset.
High entropy → Data is mixed (impure)
Low entropy → Data is uniform (pure)

2. Information Gain

Information Gain tells us how much entropy decreases after splitting the dataset.

In simple words:
It shows how useful an attribute is for classification.

Formula:
IG(D,A)=H(D)−∑v Dv/D H(Dv)
Where:
H(D) = Entropy of dataset
𝐷𝑣 = subset after split
∣𝐷𝑣|= size of subset

Rule:
Higher Information Gain = Better attribute

Example:
For attribute Gender:
IG ≈ 0.000
This means Gender is not useful for splitting.

3. Gini Index

Gini Index measures impurity like entropy but in a different way.

Formula:
Gini(D)=1−∑i=1n(pi)2

Value Range:
0 → Pure dataset
1 → Completely impure
0.5 → Balanced but impure

Example:
For:
6 Yes, 4 No

Gini = 0.48
Indicates moderate impurity.

Attribute selection measures help us choose the best feature when building a decision tree.

  • Entropy → Measures disorder
  • Information Gain → Measures improvement after split
  • Gini Index → Measures impurity

Using these methods:

  • Improves model accuracy
  • Makes decision trees more efficient
  • Helps in better decision-making
Our website uses cookies to enhance your experience. Learn More
Accept !