Decision Tree Induction
Decision Tree is a supervised learning method used in data mining for
both classification and regression problems. It is a tree-like model that
helps in making decisions based on data.
A decision tree divides a dataset into smaller groups step by step.
During this process, the model creates a tree structure that contains
decision nodes and leaf nodes.
- Root Node – The top node of the tree that represents the best predictor or main attribute.
- Decision Node – A node where the data is split into two or more branches based on a condition.
- Leaf Node – The final node that shows the final decision or class label. No further splitting occurs at this stage.
Decision trees can work with both categorical data (such as Yes/No) and
numerical data (such
as age, income, etc.).
Key Concepts
1. Entropy
Entropy is used to measure the impurity or randomness in a
dataset.
- High entropy → Data is mixed and uncertain.
- Low entropy → Data is more pure or similar.
In a decision tree, entropy helps determine how well a feature
separates the data.
2. Information Gain
Information Gain measures the reduction in entropy after splitting the
dataset using a particular attribute.
It helps identify the best attribute for splitting the data.
The attribute with the highest information gain is chosen to create the
next branch in the tree.
Simple Explanation
- A decision tree works like a flowchart.
- Start with the complete dataset.
- Calculate entropy to measure impurity.
- Choose the attribute that gives the highest information gain.
- Split the dataset based on that attribute.
- Repeat the process until the data in each group belongs to the same class.
Why Are Decision Trees Useful?
Decision trees are useful because they:
- Help analyze the possible outcomes of a decision.
- Provide a structure to measure probabilities and results.
- Make it easier to choose the best decision based on available data.
Decision Tree Structure
A decision tree is a hierarchical structure used to divide a
large dataset into smaller and more
similar groups.
The tree uses a set of simple rules to split the data.
These rules separate a large population into smaller and more
homogeneous groups.
The attributes used for splitting can be:
- Nominal values (e.g., color, gender)
- Ordinal values (e.g., small, medium, large)
- Binary values (Yes/No)
- Numerical values (e.g., salary, age)
However, the final output class is usually categorical.
Each split in the tree creates a segment called a node. As the
splitting continues, the data in each node becomes more
similar.
This repeated splitting process is called Recursive
Partitioning.
One common algorithm used to build decision trees is CART
(Classification and Regression Trees).
Example
1.Factory Expansion Decision
Consider a factory that needs to decide whether to expand
production or not.
Option 1: Expand the Factory
Cost of expansion = $3 million
Probability of good economy = 0.6 (60%) → Profit = $8
million
Probability of bad economy = 0.4 (40%) → Profit = $6
million
Expected profit:
Net Expand=(0.6×8+0.4×6)−3
So, Net Expand = $4.2 million
Option 2: Do Not Expand
Cost = $0
Good economy = $4 million
Bad economy = $2 million
Expected profit:
Net Not Expand=(0.6×4+0.4×2)−0
So, Net Not Expand = $3 million
Final Decision
Since:
4.2M>3M
The factory should expand.
This decision can be clearly visualized using a decision
tree model.
Decision Tree Algorithm
The decision tree algorithm is based on three main
parameters:
1. Dataset (D)
- Represents the training dataset. Initially, it contains all the training data with class labels.
2. Attribute List
- A list of attributes or features used to describe the data.
3. Attribute Selection Method
- A method used to select the best attribute for splitting.
- It uses measures such as Information Gain or Gini Index.
The algorithm repeatedly chooses the best attribute and
splits the data until the tree is complete.
Advantages of Decision Trees
Decision trees provide several benefits:
- They do not require data scaling or normalization.
- Missing values in data usually do not significantly affect the model.
- The model is easy to understand and explain.
- It requires less data preprocessing compared to many other algorithms.
- The tree structure makes decision-making clear and transparent.a