Tree Pruning in Data Mining
Tree pruning is a technique used in decision trees to make them smaller and
more efficient.
- It removes unnecessary branches from the tree
- Helps improve accuracy
- Makes the model easier to understand
What is a Decision Tree?
A decision tree is a method used for classification and prediction. It looks
like a tree and helps inmaking decisions step by step.
Structure of a Decision Tree:
- Root Node → Starting point (main question)
- Branches → Possible choices or conditions
- Leaf Nodes → Final result (answer like Yes/No)
Example:
Think of deciding: “Can we play cricket?”
Root: Weather
Branches: Sunny, Rainy, Cloudy
Leaf: Yes or No
Important Concepts
- 1. Information Gain
- 2. Entropy
1. Information Gain
Measures how useful a question is
Helps choose the best split
2. Entropy
Measures uncertainty or randomness
Lower entropy = better decision
Example: Cricket Decision
We have data with weather and temperature to decide whether cricket can be
played.
How Decision is Made:
Day 1 → Weather is Sunny → Temperature is Mild → Play cricket
Day 2 → Weather is Rainy → Do not play
Day 3 → Weather is Cloudy → Temperature is Mild → Play cricket
Day 4 → Weather is Sunny → Temperature is Cool → Play cricket
Day 5 → Weather is Sunny → Temperature is Hot → Do not play
This shows how decision trees help make predictions step-by-step.
Why Do We Need Tree Pruning?
When a tree grows too much:
- It becomes complex
- It may overfit (fits training data too closely)
- It performs poorly on new data
Tree pruning solves this by removing unnecessary branches.
What is Overfitting?
Model learns too many details (including noise)
Works well on training data
Performs poorly on new data
Goal of Pruning
- Reduce tree size
- Improve accuracy
- Avoid overfitting
- Make model simple
Types of Tree Pruning
- 1. Pre-Pruning (Early Stopping)
- 2. Post-Pruning (Backward Pruning)
1. Pre-Pruning (Early Stopping)
Stops the tree before it becomes too large
How it works:
- Set limits while building the tree
- Examples of limits:
- Maximum depth
- Minimum information gain
- Entropy threshold
Example:
Consider customer data with age, salary, and purchase decision.
If we set maximum depth = 3:
The tree stops growing after 3 levels
Even if more splitting is possible
Advantage: Prevents overfitting early
Disadvantage: May miss important patterns
2. Post-Pruning (Backward Pruning)
First grow the full tree, then remove unnecessary branches
How it works:
- Build the full tree
- Remove branches that do not improve accuracy
- Replace them with a leaf node
Example:
Consider student data based on study hours and sleep hours.
After building the full tree:
Some rules may not be useful
Example: “If sleep hours are high, then fail”
If this rule does not improve prediction:
Remove that branch
Result:
Simpler tree
Better accuracy
Pre-Pruning vs Post-Pruning
Pre-Pruning
Done before the tree is fully built
Faster and simpler
May miss useful patterns
Post-Pruning
Done after building the full tree
More accurate
Removes unnecessary branches effectively
Decision trees help in easy decision-making
Large trees can cause overfitting
Tree pruning removes unnecessary branches
It makes the model:
- Simpler
- Faster
- More accurate
Two methods:
- Pre-pruning (early stopping)
- Post-pruning (after full tree is built)