Stemming in Data Mining

Balaji. K

« Previous Next »

Stemming in Data Mining

Stemming is the process of reducing a word to its base or root form.

Example:

running → run
fishing → fish

A program that performs this task is called a stemmer.

Stemming removes prefixes and suffixes (like -ing, -ed, -ly) to find the root word. It is widely

used in Natural Language Processing (NLP) and data mining.

Why is Stemming Important?

In English, one word can have many forms:

connect, connected, connecting, connection

If we treat them as different words, it creates duplicate and unnecessary data.

Stemming helps to:

Reduce data size
Improve search results
Make machine learning models more efficient

For example, searching for “fish” should also return:

fishing
fishes

Where is Stemming Used?

Stemming is used in:

Search engines (like Google)
Text mining
SEO (Search Engine Optimization)
Data analysis
Information retrieval systems

Errors in Stemming

1. Over-Stemming

Different words are reduced to the same root incorrectly.

Example:

Universe and University → same stem

This is called a false positive.

2. Under-Stemming

Words that should have the same root are not reduced properly.

Example:

Connect and Connection → different stems

This is called a false negative

History of Stemming

First stemmer developed by Julie Beth Lovins (1968)
Later improved by Martin Porter (1980)
Porter Stemmer became the most widely used method

Types of Stemming Algorithms

1. Truncation (Rule-Based Methods)

These methods remove prefixes and suffixes.

a. Lovins Stemmer

Removes the longest suffix first
Very fast

Example:

sitting → sitt → sit

Advantages:

Fast execution
Handles irregular words

Limitations:

Can produce incorrect stems
Requires large suffix list

b. Porter Stemmer

Most popular algorithm
Uses step-by-step rules

Example:

agreed → agree

Advantages:

Good accuracy
Widely used

Limitations:

Output may not always be a real word

c. Paice/Husk Stemmer

Uses iterative rules
Repeatedly applies transformations

Advantages:

Flexible and powerful

Limitations:

Can cause over-stemming

d. Dawson Stemmer

Improved version of Lovins
Uses large suffix database

Advantages:

High accuracy

Limitations:

Complex to implement

2. Statistical Methods

These methods use data patterns instead of rules.

a. N-Gram Stemmer

Breaks words into character groups

Example (n=2):

INTRO → IN, NT, TR, RQ

Advantages:

Language independent

Limitations:

Requires more memory and time

b. HMM Stemmer (Hidden Markov Model)

Uses probability to split words into root + suffix

Advantages:

No language rules required

Limitations:

Complex method
May over-stem

c. YASS Stemmer

Groups similar words using clustering

Advantages:

No linguistic knowledge needed

Limitations:

Depends heavily on dataset

3. Linguistic (Advanced Methods)

a. Krovetz Stemmer

Converts words into real dictionary words

Steps:

Plural → Singular

Past → Present

Remove -ing

Advantages:

Produces meaningful words

Limitations:

Needs dictionary
Cannot handle unknown words

b. Xerox Analyzer

Uses linguistic databases
Converts words to proper base forms

Example:

children → child

better → good

c. Corpus-Based Stemmer

Uses real text data (corpus) to decide stems

Advantages:

More accurate for specific datasets

Limitations:

Time-consuming
Needs large data

d. Context-Sensitive Stemmer

Uses context (sentence meaning) before stemming

Advantages:

Improves search accuracy

Limitations:

Complex
High processing time

« Previous Next »

Stemming in Data Mining

Stemming in Data Mining

Example:

Why is Stemming Important?

Where is Stemming Used?

Errors in Stemming

1. Over-Stemming

2. Under-Stemming

History of Stemming

Types of Stemming Algorithms

1. Truncation (Rule-Based Methods)

a. Lovins Stemmer

b. Porter Stemmer

c. Paice/Husk Stemmer

d. Dawson Stemmer

2. Statistical Methods

a. N-Gram Stemmer

b. HMM Stemmer (Hidden Markov Model)

c. YASS Stemmer

3. Linguistic (Advanced Methods)

a. Krovetz Stemmer

b. Xerox Analyzer

c. Corpus-Based Stemmer

d. Context-Sensitive Stemmer

Translate

Related course

Social Plugin

Ads

Ads

Website by

Categories

Our Services

Footer Copyright

Contact form

Stemming in Data Mining

Stemming in Data Mining

Example:

Why is Stemming Important?

Where is Stemming Used?

Errors in Stemming

1. Over-Stemming

2. Under-Stemming

History of Stemming

Types of Stemming Algorithms

1. Truncation (Rule-Based Methods)

a. Lovins Stemmer

b. Porter Stemmer

c. Paice/Husk Stemmer

d. Dawson Stemmer

2. Statistical Methods

a. N-Gram Stemmer

b. HMM Stemmer (Hidden Markov Model)

c. YASS Stemmer

3. Linguistic (Advanced Methods)

a. Krovetz Stemmer

b. Xerox Analyzer

c. Corpus-Based Stemmer

d. Context-Sensitive Stemmer

You may like these posts

Footer Copyright

Contact form