Stemming in Data Mining
gocourse.in Maintenance

We'll be back soon

Our CDN (cdn.gocourse.in) is currently unreachable. Some images, JavaScript, or CSS files may not load properly.

Estimated downtime: ~30 minutes

Stemming in Data Mining

Balaji. K

Stemming in Data Mining

What is Stemming?

Stemming is the process of reducing a word to its base or root form.

Example:
  •  running → run
  •  fishing → fish
A program that performs this task is called a stemmer.

Stemming removes prefixes and suffixes (like -ing, -ed, -ly) to find the root word. It is widely
used in Natural Language Processing (NLP) and data mining.

Why is Stemming Important?

In English, one word can have many forms:
  •  connect, connected, connecting, connection
If we treat them as different words, it creates duplicate and unnecessary data.

Stemming helps to:
  •  Reduce data size
  •  Improve search results
  •  Make machine learning models more efficient
For example, searching for “fish” should also return:
  •  fishing
  •  fishes

Where is Stemming Used?

Stemming is used in:
  •  Search engines (like Google)
  •  Text mining
  •  SEO (Search Engine Optimization)
  •  Data analysis
  •  Information retrieval systems

Errors in Stemming

1. Over-Stemming
Different words are reduced to the same root incorrectly.

Example:
universe and university → same stem
This is called a false positive

2. Under-Stemming
Words that should have the same root are not reduced properly.

Example:
connect and connection → different stems
This is called a false negative

History of Stemming

  •  First stemmer developed by Julie Beth Lovins (1968)
  • Later improved by Martin Porter (1980)
  •  Porter Stemmer became the most widely used method

Types of Stemming Algorithms

1. Truncation (Rule-Based Methods)

  • These methods remove prefixes and suffixes.
a. Lovins Stemmer
  •  Removes the longest suffix first
  •  Very fast
Example:
  •  sitting → sitt → sit
Advantages:
  •  Fast execution
  •  Handles irregular words
Limitations:
  • Can produce incorrect stems
  •  Requires large suffix list
b. Porter Stemmer
  •  Most popular algorithm
  •  Uses step-by-step rules
Example:
agreed → agree

Advantages:
  •  Good accuracy
  •  Widely used
Limitations:
  •  Output may not always be a real word
c. Paice/Husk Stemmer
  •  Uses iterative rules
  •  Repeatedly applies transformations
Advantages:
  •  Flexible and powerful
Limitations:
  •  Can cause over-stemming
d. Dawson Stemmer
  •  Improved version of Lovins
  •  Uses large suffix database
Advantages:
  •  High accuracy
Limitations:
  •  Complex to implement

2. Statistical Methods

These methods use data patterns instead of rules.

a. N-Gram Stemmer
  •  Breaks words into character groups
Example (n=2):
INTRO → IN, NT, TR, RQ

Advantages:
  •  Language independent
Limitations:
  •  Requires more memory and time
b. HMM Stemmer (Hidden Markov Model)
  •  Uses probability to split words into root + suffix

Advantages:
  •  No language rules required

Limitations:
  •  Complex method
  •  May over-stem

c. YASS Stemmer
  •  Groups similar words using clustering

Advantages:
  •  No linguistic knowledge needed

Limitations:
  •  Depends heavily on dataset

3. Linguistic (Advanced Methods)


a. Krovetz Stemmer
  •  Converts words into real dictionary words

Steps:
Plural → Singular
Past → Present
Remove -ing

Advantages:
  •  Produces meaningful words

Limitations:
  •  Needs dictionary
  •  Cannot handle unknown words

b. Xerox Analyzer
  •  Uses linguistic databases
  •  Converts words to proper base forms
Example: 
children → child
better → good 

c. Corpus-Based Stemmer
  •  Uses real text data (corpus) to decide stems

Advantages:
  •  More accurate for specific datasets

Limitations:
  •  Time-consuming
  •  Needs large data

d. Context-Sensitive Stemmer
  •  Uses context (sentence meaning) before stemming

Advantages:
  •  Improves search accuracy

Limitations:
  •  Complex
  •  High processing time
Our website uses cookies to enhance your experience. Learn More
Accept !