Apriori Algorithm
The Apriori Algorithm is a popular algorithm in data mining used to find relationships betweenitems ina dataset. It helps identify patterns showing which items are frequently purchasedtogether.
For example, in a supermarket, customers who buy pizza often buy soft
drinks and breadsticksas well. Because of this pattern, shops create combo
offers. This makes shopping easier forcustomers and increases store
sales.
Similarly, in large stores like Big Bazaar, products such as biscuits,
chips, and chocolates areoften placed together because customers usually
buy them together. These relationships arediscovered using association
rules, and the Apriori algorithm helps find these rules.
What is Apriori Algorithm?
The Apriori Algorithm is used to identify frequent itemsets (groups of
items that appear togetherfrequently in transactions) and to generate
association rules from those itemsets.
It works on large databases containing many transactions, such as
customer purchase records.
For example, if many customers buy biscuits and chocolates together,
the algorithm identifiesthispattern and stores can use it for product
placement or recommendations.
Components of Apriori Algorithm
The Apriori algorithm mainly uses three important measures:
- Support
- Confidence
- Lift
Let us understand them using an example.
Suppose a supermarket has 4000 transactions.
- 400 transactions include Biscuits
- 600 transactions include Chocolate
- 200 transactions include both Biscuits and Chocolate
1. Support
Support shows how frequently an item appears in the dataset.
Formula
Support = (Number of transactions containing the item) / (Total
number of transactions)
Example:
Support (Biscuits)
= 400 / 4000
= 10%
This means 10% of all transactions contain biscuits.
2. Confidence
Confidence measures how often items are purchased together.
Formula
Confidence = (Transactions containing both items) / (Transactions
containing the first item)
Example:
Confidence (Biscuits → Chocolate)
= 200 / 400
= 50%
This means 50% of customers who bought biscuits also bought
chocolate.
3. Lift
Lift measures how strongly two items are related.
Formula
Lift = Confidence / Support
Example:
Lift = 50 / 10 = 5
This means customers are 5 times more likely to buy biscuits and
chocolate together thanbuying biscuits alone.
Interpretation:
Lift = 1 → No relationship
Lift > 1 → Positive relationship
Lift < 1 → Negative relationship
History of Apriori Algorithm
The Apriori Algorithm was introduced in 1994 by Rakesh Agrawal and
Ramakrishnan Srikant.
The name Apriori comes from the idea of using prior knowledge of
frequent itemsets to findlarger patterns.
The algorithm first finds frequent k-itemsets, and then uses them to
generate (k+1)itemsets.
Applications of Apriori Algorithm
The Apriori algorithm is used in many fields.
1. Mobile E-Commerce
Online shopping platforms use it to recommend products frequently
bought together,improvingcustomer experience and increasing sales.
2. Education
Educational institutions analyze student data such as grades,
performance, and demographicinformation.
3. Forestry
It helps analyze and manage environmental data related to plants and
wildlife.
4. Medical Field
Hospitals use it to analyze patient records and identify patterns in
medical data.
5. Market Basket Analysis
Retail stores analyze customer purchase patterns to understand which
products are boughttogether.
6. Website Design
It helps analyze user navigation patterns to improve website structure
and user experience.
7. Tourism Industry
Tour companies analyze booking patterns to understand tourist
preferences.
How Apriori Algorithm Works
Let us consider a simple example.
Products:
P = {Rice, Pulse, Oil, Milk, Apple}
The database contains several transactions showing which products were
purchased.
Assumptions of Apriori Algorithm
- All subsets of a frequent itemset must also be frequent.
- If an itemset is infrequent, all its supersets will also be infrequent.
- A minimum support threshold is set.
Assume minimum support = 50%.
Step 1:
Find Frequent Single Items
Create a frequency table.
Product Frequency
Rice 4
Pulse 5
Oil 4
Milk 4
Only items with support above the threshold are selected.
Step 2:
Create Item Pairs
Possible pairs:
RP, RO, RM, PO, PM, OM
Itemset Frequency
RP 4
RO 3
RM 2
PO 4
PM 3
OM 2
Step 3:
Apply Support Threshold
Frequent pairs:
RP
RO
PO
PM
Step 4:
Generate 3-Itemsets
Possible combinations:
RPO
POM
Step 5:
Calculate Frequency
Itemset Frequency
RPO 4
POM 3
The frequent itemset is RPO.
Improving Apriori Efficiency
Some techniques improve performance.
1. Hash-Based Itemset Counting
Uses hashing to reduce the number of candidate itemsets.
2. Transaction Reduction
Transactions that do not contain frequent itemsets are removed from
further analysis.
Finding Association Rules
To find association rules:
1. Brute Force Method
Analyze all possible rules and calculate support and confidence.
2. Two-Step Approach
Step 1:
Find frequent itemsets.
Step 2:
Generate association rules from these itemsets.
Example from itemset RPO:
Possible rules:
RP → O
RO → P
PO → R
O → RP
P → RO
R → PO
For n items, the number of rules possible is:
2n − 2
Advantages of Apriori Algorithm
1.High Scalability
Works well with large datasets.
2.Extensions Available
Many improved versions exist for different applications.
3.Easy to Understand
Simple logic and easy implementation.
4.Works with Unlabeled Data
Useful when data is not categorized.
Disadvantages of Apriori Algorithm
1.High Computational Cost
Requires scanning the entire database multiple times.
2.Large Number of Candidate Item sets
Can generate many possible combinations, increasing processing
time.