Stemming in Data Mining
What is Stemming?
Stemming is the process of reducing a word to its base or root
form.
Example:
- running → run
- fishing → fish
A program that performs this task is called a stemmer.
Stemming removes prefixes and suffixes (like -ing, -ed, -ly) to find the
root word. It is widely
used in Natural Language Processing (NLP) and data mining.
Why is Stemming Important?
In English, one word can have many forms:
- connect, connected, connecting, connection
If we treat them as different words, it creates duplicate and unnecessary
data.
Stemming helps to:
- Reduce data size
- Improve search results
- Make machine learning models more efficient
For example, searching for “fish” should also return:
- fishing
- fishes
Where is Stemming Used?
Stemming is used in:
- Search engines (like Google)
- Text mining
- SEO (Search Engine Optimization)
- Data analysis
- Information retrieval systems
Errors in Stemming
1. Over-Stemming
Different words are reduced to the same root incorrectly.
Example:
universe and university → same stem
This is called a false positive
2. Under-Stemming
Words that should have the same root are not reduced
properly.
Example:
connect and connection → different stems
This is called a false negative
History of Stemming
- First stemmer developed by Julie Beth Lovins (1968)
- Later improved by Martin Porter (1980)
- Porter Stemmer became the most widely used method
Types of Stemming Algorithms
1. Truncation (Rule-Based Methods)
- These methods remove prefixes and suffixes.
a. Lovins Stemmer
- Removes the longest suffix first
- Very fast
Example:
- sitting → sitt → sit
Advantages:
- Fast execution
- Handles irregular words
Limitations:
- Can produce incorrect stems
- Requires large suffix list
b. Porter Stemmer
- Most popular algorithm
- Uses step-by-step rules
Example:
agreed → agree
Advantages:
- Good accuracy
- Widely used
Limitations:
- Output may not always be a real word
c. Paice/Husk Stemmer
- Uses iterative rules
- Repeatedly applies transformations
Advantages:
- Flexible and powerful
Limitations:
- Can cause over-stemming
d. Dawson Stemmer
- Improved version of Lovins
- Uses large suffix database
Advantages:
- High accuracy
Limitations:
- Complex to implement
2. Statistical Methods
These methods use data patterns instead of rules.
a. N-Gram Stemmer
- Breaks words into character groups
Example (n=2):
INTRO → IN, NT, TR, RQ
Advantages:
- Language independent
Limitations:
- Requires more memory and time
b. HMM Stemmer (Hidden Markov Model)
- Uses probability to split words into root + suffix
Advantages:
- No language rules required
Limitations:
- Complex method
- May over-stem
c. YASS Stemmer
- Groups similar words using clustering
Advantages:
- No linguistic knowledge needed
Limitations:
- Depends heavily on dataset
3. Linguistic (Advanced Methods)
a. Krovetz Stemmer
- Converts words into real dictionary words
Steps:
Plural → Singular
Past → Present
Remove -ing
Advantages:
- Produces meaningful words
Limitations:
- Needs dictionary
- Cannot handle unknown words
b. Xerox Analyzer
- Uses linguistic databases
- Converts words to proper base forms
Example:
children → child
better → good
c. Corpus-Based Stemmer
- Uses real text data (corpus) to decide stems
Advantages:
- More accurate for specific datasets
Limitations:
- Time-consuming
- Needs large data
d. Context-Sensitive Stemmer
- Uses context (sentence meaning) before stemming
Advantages:
- Improves search accuracy
Limitations:
- Complex
- High processing time