Difference between stemming and lemmatizing and where to use

Stemming and Lemmatization is very important and basic technique for any Project of Natural Language Processing.

You have noticed that if you type something on google search it will show relevant results not only for the exact expression you typed but also for the other possible forms of the words you use.

For example, if you have typed “mobiles” in the search bar, it’s likely you want to see results containing the form of “mobile”.

This is done by finding out the root word of a given word. Here “mobile” is the root word of “mobiles”.

This can be done by two possible methods: stemming and lemmatization.


In this topic I will explain on below topics:
  • What is stemming
  • How to do Stemming in Python
  • What is Lemmatization
  • How to do Lemmatization in Python
  • Which one is best: lemmatization or stemming?
  • Where to use stemming and where to use Lemmatization

What is Stemming




Stemming converts a word into its stem(root form).

Stemming is a rule based approach, it strips inflected words based on common prefixes and suffixes that can be found in an inflected word.

For example: Common suffix like: “es”, “ing”, “pre” etc.

Now if you want to apply stemming on a word “reading”, it will convert it to “read”. Just strip the suffix “ing” from the word which is available in stemming dictionary.

This is also applicable for prefix also.

For Example: “pregame” to “game”

The root form generated by stemming is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix.

For example: The words study, studies and studying stems into studi, which is not an English word.

The most common algorithm for stemming is Porter’s Algorithm (Porter, 1980). It is only striping suffix of a word.

Stemming in Python NLTK

NLTK provides several famous stemmers interfaces, such as Porter stemmer, Lancaster Stemmer, Snowball Stemmer and etc.

Here I am using Porter Stemmer for Stemming.

from nltk.stem.porter import *
import nltk

stemmer = PorterStemmer()

# For single word
print(stemmer.stem('builders'))
print(stemmer.stem('good'))
print(stemmer.stem('better'))
print(stemmer.stem('run'))
print(stemmer.stem('ran'))
print(stemmer.stem('running'))

Output:
builder
good
better
run
ran
run


# For Sentence
sent = 'I have seen this yesterday'
tokens = nltk.word_tokenize(sent)
print('Word')
print(tokens)
stemd_word = [stemmer.stem(plural) for plural in tokens]
print('Stemmed Form')
print(stemd_word)

Output:
Word 
[‘I’, ‘have’, ‘seen’, ‘this’, ‘yesterday’] 
Stemmed Form 
[‘I’, ‘have’, ‘seen’, u’thi’, ‘yesterday’]

What is Lemmatization




Lemmatization converts a word into its lemma (root form).

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words. It observes position and Parts of speech of a word before striping anything.

For example consider two lemma’s listed below:

1. saw [verb] - Past tense of see
2. saw [noun] - Cutting instrument


It normally aims to strip inflection from end of a word.

For word “saw”, stemming might return just “s”, whereas lemmatization would attempt to return either “see” or “saw” depending on whether the use of the token was as a verb or a noun.

Lemmatization in Python NLTK


The NLTK Lemmatization method is based on WordNet’sbuilt-in morphy function.

This lemmatizer removes affixes only if the resulting word is found in lexical resource, wordnet.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

# For single word
print(lemmatizer.lemmatize("good", pos='v'))
print(lemmatizer.lemmatize("better", pos='a'))
print(lemmatizer.lemmatize('run',pos='v'))
print(lemmatizer.lemmatize('ran',pos='v'))
print(lemmatizer.lemmatize('running',pos='v'))
Output: 

good
good
run
run
run

# For sentence
sent = 'I have seen this yesterday'
tokens = nltk.word_tokenize(sent)
#Print each tokens
print(tokens)

lemma_word = [lemmatizer.lemmatize(plural) for plural in tokens]
lemma_word

# Print lemmatized sentence
print(' '.join(lemma_word))

Output: 
[‘I’, ‘have’, ‘seen’, ‘this’, ‘yesterday’] 
I have seen this yesterday

Which one is best: lemmatization or stemming?

Stemming and Lemmatizing have their own flavour to normalize word.

The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech.

Stemming is much faster than Lemmatizing.

Accuracy of Stemming is much less than Lemmatization.

Where to use stemming and where to use Lemmatization


It based on your requirement.

If you are handling huge amount of text and you only want to normalize and analyze text not to visualize, then you may go with stemming.

But if you want to visualize your normalized text then you should choose lemmatization as stem words may not necessarily the real world word.

Also you can anti stemming to your stemmed word to get real world word but as per my experience it will take huge amount of time to execute anti stemming task.

Conclusion:

In this topic I have tried to explain
  • What is stemming
  • What is Lemmatization
  • How to do Stemming in Python
  • How to do Lemmatization in Python
  • How Stemming and Lemmatization works
  • Which one could be your choice based on your requirement.

Have Questions?

If you have any question regarding this topic, feel free to comment. I will try my best to answer your questions.

Leave a Comment

Your email address will not be published. Required fields are marked *