Decision Tree Induction
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Decision Tree Induction

Vinithra

Decision Tree Induction

Decision Tree is a supervised learning method used in data mining for both classification and regression problems. It is a tree-like model that helps in making decisions based on data.

A decision tree divides a dataset into smaller groups step by step. During this process, the model creates a tree structure that contains decision nodes and leaf nodes.
  • Root Node – The top node of the tree that represents the best predictor or main attribute.
  • Decision Node – A node where the data is split into two or more branches based on a condition.
  • Leaf Node – The final node that shows the final decision or class label. No further splitting occurs at this stage.
Decision trees can work with both categorical data (such as Yes/No) and numerical data (such
as age, income, etc.).

Key Concepts

1. Entropy

Entropy is used to measure the impurity or randomness in a dataset.
  • High entropy → Data is mixed and uncertain.
  • Low entropy → Data is more pure or similar.
In a decision tree, entropy helps determine how well a feature separates the data.

2. Information Gain

Information Gain measures the reduction in entropy after splitting the dataset using a particular attribute.

It helps identify the best attribute for splitting the data.

The attribute with the highest information gain is chosen to create the next branch in the tree.

Simple Explanation

  • A decision tree works like a flowchart.
  • Start with the complete dataset.
  • Calculate entropy to measure impurity.
  • Choose the attribute that gives the highest information gain.
  • Split the dataset based on that attribute.
  • Repeat the process until the data in each group belongs to the same class.

Why Are Decision Trees Useful?

Decision trees are useful because they:
  • Help analyze the possible outcomes of a decision.
  • Provide a structure to measure probabilities and results.
  • Make it easier to choose the best decision based on available data.

Decision Tree Structure

A decision tree is a hierarchical structure used to divide a large dataset into smaller and more
similar groups.

The tree uses a set of simple rules to split the data.

These rules separate a large population into smaller and more homogeneous groups.

The attributes used for splitting can be:
  • Nominal values (e.g., color, gender)
  • Ordinal values (e.g., small, medium, large)
  • Binary values (Yes/No)
  • Numerical values (e.g., salary, age)
However, the final output class is usually categorical.

Each split in the tree creates a segment called a node. As the splitting continues, the data in each node becomes more similar.

This repeated splitting process is called Recursive Partitioning.

One common algorithm used to build decision trees is CART (Classification and Regression Trees).

Example

1.Factory Expansion Decision

Consider a factory that needs to decide whether to expand production or not.

Option 1: Expand the Factory

Cost of expansion = $3 million

Probability of good economy = 0.6 (60%) → Profit = $8 million

Probability of bad economy = 0.4 (40%) → Profit = $6 million

Expected profit:
Net Expand=(0.6×8+0.4×6)−3

So, Net Expand = $4.2 million

Option 2: Do Not Expand

Cost = $0

Good economy = $4 million

Bad economy = $2 million

Expected profit:
Net Not Expand=(0.6×4+0.4×2)−0

So, Net Not Expand = $3 million

Final Decision

Since:
4.2M>3M

The factory should expand.

This decision can be clearly visualized using a decision tree model.

Decision Tree Algorithm

The decision tree algorithm is based on three main parameters:

1. Dataset (D)

  • Represents the training dataset. Initially, it contains all the training data with class labels.

2. Attribute List

  • A list of attributes or features used to describe the data.

3. Attribute Selection Method

  • A method used to select the best attribute for splitting.
  • It uses measures such as Information Gain or Gini Index.
The algorithm repeatedly chooses the best attribute and splits the data until the tree is complete.

Advantages of Decision Trees

Decision trees provide several benefits:
  • They do not require data scaling or normalization.
  • Missing values in data usually do not significantly affect the model.
  • The model is easy to understand and explain.
  • It requires less data preprocessing compared to many other algorithms.
  • The tree structure makes decision-making clear and transparent.a
Our website uses cookies to enhance your experience. Learn More
Accept !