Extract Custom Keywords using NLTK POS tagger in python

Parts-Of-Speech tagging (POS tagging) is one of the main and basic component of almost any NLP task. Parts-of-Speech are also known as word classes or lexical categories. POS tagger can be used for indexing of word, information retrieval and many more application.

Back in elementary school you learnt the difference between Nouns, Pronouns, Verbs, Adjectives etc. These are nothing but Parts-Of-Speech to form a sentence.

The task of POS-tagging is to labeling words of a sentence with their appropriate Parts-Of-Speech (Nouns, Pronouns, Verbs, Adjectives …).

Related Article: Word similarity matching using soundex in python

How POS Tagging works?

Here’s a simple example:

from nltk import word_tokenize, pos_tag
print(pos_tag(word_tokenize("I love NLP")))

Output:

[('I', 'PRP'), ('love', 'VBP'), ('NLP', 'RB')]

Here for the sentence “I love NLP”, NLTK POS tagger successfully tagged

I as PRP (pronoun, personal)
Love as VBP (verb, present tense, not 3rd person singular)
NLP as RB (adverb)

Now how I got those full forms of POS tags?

Here is the code to view all possible POS tags for NLTK.

import nltk
nltk.help.upenn_tagset()

Note: Don’t forget to download help data/ corpus from NLTK

Related Article: How to download NLTK corpus Manually

If you don’t want to write code to see all, I will do it for you.

Here are those all possible tags of NLTK with their full form:

POS tags	Full Form of POS tags	Example
CC	conjunction, coordinating	&, ‘n, and, both etc.
CD	numeral, cardinal	mid-1890, nine-thirty, zero, two etc.
DT	determiner	all, an, another, any, both etc.
EX	existential there	there
FW	foreign word	gemeinschaft, quibusdam, fille etc.
IN	preposition or conjunction, subordinating	astride, among, uppon, whether etc.
JJ	adjective or numeral, ordinal	battery-powered, pre-war, multi-disciplinary etc.
JJR	adjective, comparative	braver, cleaner, brighter etc.
JJS	adjective, superlative	cheapest, closest, darkest etc.
LS	list item marker	SP-44002, SP-4005 etc.
MD	modal auxiliary	can, cannot, could, couldn’t, shouldn’t etc.
NN	noun, common, singular or mass	cabbage, afghan, slick etc.
NNP	noun, proper, singular	Ranzer, Shannon, CTCA, Light etc.
NNPS	noun, proper, plural	Americans, Indians, Australians etc.
NNS	noun, common, plural	undergraduates, scotches, bodyguards etc.
PDT	pre-determiner	all, many, quite, such etc.
POS	genitive marker	‘, s etc.
PRP	pronoun, personal	hers, herself, him, himself etc.
PRP$	pronoun, possessive	her, his, mine, my etc.
RB	adverb	occasionally, adventurously, professedly etc.
RBR	adverb, comparative	further, longer, louder etc.
RBS	adverb, superlative	best, biggest, largest etc.
RP	particle	about,along, apart etc.
SYM	symbol	%, &, ‘, ”,*,+ etc.
TO	“to” as preposition or infinitive marker	to
UH	interjection	Goodbye, Gosh, Wow etc.
VB	verb, base form	ask, assemble, assign etc.
VBD	verb, past tense	dipped, halted, registered etc.
VBG	verb, present participle or gerund	telegraphing, judging, erasing etc.
VBN	verb, past participle	used, unsettled, dubbed etc.
VBP	verb, present tense, not 3rd person singular	glisten, obtain, comprise etc.
VBZ	verb, present tense, 3rd person singular	marks, mixes, seals etc.
WDT	WH-determiner	that, what, whatever, which and whichever
WP	WH-pronoun	that, what, whatever, whatsoever, which, who, whom and whosoever
WP$	WH-pronoun, possessive	whose
WRB	Wh-adverb	how, however, whence, whenever, where, whereby, whereever, wherein, whereof and why

There are number of applications of POS tagging like:

Application of POS:

Indexing of words, you can use these tags as feature of a sentence to do sentiment analysis, extract entity etc.

Okay now back to the topic.

Now you know how to tag POS of a sentence. But what to do with it?
In this topic I will only explain how to extract custom keywords from sentence using POS tagging.

Extract Custom keywords by POS tagging:

Let’s say you have some sentence like:

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

In this case let’s say I want only red color words (MI, Samsung and Motorola) to be extracted.

Let’s do this.

First let me check tags for those sentences:

comment = ["I am using Mi note5 it is working great",
          "My Samsung s7 is hanging very often",
          "My friend is using Motorola g5 for last 5 years, he is happy with it"]

for i in range(0,3):
    print(pos_tag(word_tokenize(comment[i])))
    print('n')

Output:

[('I', 'PRP'), ('am', 'VBP'), ('using', 'VBG'), ('Mi', 'NNP'), ('note5', 'NN'), ('it', 'PRP'), ('is', 'VBZ'), ('working', 'VBG'), ('great', 'JJ')]


[('My', 'PRP

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk

# Extracting all Nouns from a text file using nltk

for i in range(0,3):
    token_comment = word_tokenize(comment[i])
    tagged_comment = pos_tag(token_comment)
    print( [(word, tag) for word, tag in tagged_comment if (tag=='NNP')])

Output:

[('Mi', 'NNP')]
[('Samsung', 'NNP')]
[('Motorola', 'NNP')]

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.

[('I', 'PRP'), ('am', 'VBP'), ('using', 'VBG'), ('Mi', 'NNP'), ('note5', 'NN'), ('it', 'PRP'), ('is', 'VBZ'), ('working', 'VBG'), ('great', 'JJ')]


[('My', 'PRP

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

# Function to extract two pattern tags   
             
def match2(token_pos,pos1,pos2):
    for subsen in token_pos:
        # avoid index error and catch last three elements
        end = len(subsen) - 1
        for ind, (a, b) in enumerate(subsen, 1):
            if ind == end:
                break
            if b == pos1 and subsen[ind][1] == pos2:
                yield ("{} {}".format(a, subsen[ind][0], subsen[ind + 1][0]))

# Print company and model no for each sentence

for i in range(0,3):
    tokens = word_tokenize(comment[i]) # Generate list of tokens
    tokens_pos = pos_tag(tokens)
    a = [tokens_pos]
    print(list(match2(a,'NNP','NN')))

Output:

['Mi note5']
['Samsung s7']
['Motorola g5']

Yes now we got exactly what we wanted.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

# Define an array of comments to test.

comment = ["I am using Mi note5 it is working great",
          "My Samsung s7 is hanging very often",
          "My friend is using Motorola g5 for last 5 years, he is happy with it"]

# Print POS tags of all sentences.

for i in range(0,3):
    print(pos_tag(word_tokenize(comment[i])))
    print('n')

# Extract only company name from all sentences

for i in range(0,3):
    token_comment = word_tokenize(comment[i])
    tagged_comment = pos_tag(token_comment)
    print( [(word, tag) for word, tag in tagged_comment if (tag=='NNP')])

# Extract company name and model no. both.

# Function to extract two pattern tags                
def match2(token_pos,pos1,pos2):
    for subsen in token_pos:
        # avoid index error and catch last three elements
        end = len(subsen) - 1
        for ind, (a, b) in enumerate(subsen, 1):
            if ind == end:
                break
            if b == pos1 and subsen[ind][1] == pos2:
                yield ("{} {}".format(a, subsen[ind][0], subsen[ind + 1][0]))

# Print company and model no for each sentence

for i in range(0,3):
    tokens = word_tokenize(comment[i]) # Generate list of tokens
    tokens_pos = pos_tag(tokens)
    a = [tokens_pos]
    print(list(match2(a,'NNP','NN')))

Also Read: Recurrent Neural Network tutorial for Beginners

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

), (‘Samsung’, ‘NNP’), (‘s7’, ‘NN’), (‘is’, ‘VBZ’), (‘hanging’, ‘VBG’), (‘very’, ‘RB’), (‘often’, ‘RB’)]

[(‘My’, ‘PRP

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk

Output:

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

), (‘friend’, ‘NN’), (‘is’, ‘VBZ’), (‘using’, ‘VBG’), (‘Motorola’, ‘NNP’), (‘g5’, ‘NN’), (‘for’, ‘IN’), (‘last’, ‘JJ’), (‘5’, ‘CD’), (‘years’, ‘NNS’), (‘,’, ‘,’), (‘he’, ‘PRP’), (‘is’, ‘VBZ’), (‘happy’, ‘JJ’), (‘with’, ‘IN’), (‘it’, ‘PRP’)]

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk

Output:

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

), (‘Samsung’, ‘NNP’), (‘s7’, ‘NN’), (‘is’, ‘VBZ’), (‘hanging’, ‘VBG’), (‘very’, ‘RB’), (‘often’, ‘RB’)]

[(‘My’, ‘PRP

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

Also Read: Continuous Bag of Words (CBOW) - Multi Word Model - How It Works

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

), (‘Samsung’, ‘NNP’), (‘s7’, ‘NN’), (‘is’, ‘VBZ’), (‘hanging’, ‘VBG’), (‘very’, ‘RB’), (‘often’, ‘RB’)]

[(‘My’, ‘PRP

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk

Output:

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk

Output:

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

Also Read: Latent Dirichlet Allocation explained

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

), (‘Samsung’, ‘NNP’), (‘s7’, ‘NN’), (‘is’, ‘VBZ’), (‘hanging’, ‘VBG’), (‘very’, ‘RB’), (‘often’, ‘RB’)]

[(‘My’, ‘PRP

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk

Output:

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk

Output:

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.

“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.

This time extracting “NN” tag will give us some unwanted word.

As you can see in third sentence “friend” is tagged under “NN” tag)

Now if you see in another way then you will find out a pattern,

Model numbers are appearing just after company name.

So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

Output:

Yes now we got exactly what we wanted.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Full code:

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.

You learned:

How to tokenize a sentence
How to tag Parts-of-Speech
How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
How to extract pattern from list of POS tagged words.

AutomaticKeyword extraction using Topica in Python

AutomaticKeyword extraction using RAKE in Python

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

Anindya Naskar

67 thoughts on “Extract Custom Keywords using NLTK POS tagger in python”

Socjologia

February 10, 2021 at 1:53 am

Of course, what a fantastic blog and instructive posts, I definitely will bookmark your blog.All the Best!
Kobieta i Samuraj

February 26, 2021 at 3:01 am

You present a provocative argument! Good job with this blog post
Cukrzyca

February 28, 2021 at 11:23 pm

I ran into this page accidentally, surprisingly, this is a amazing blog :-). The site owner has carried out a superb job of putting it together, the info here is really insightful. You just secured myself a guarenteed reader.

How POS Tagging works?

Application of POS:

Extract Custom keywords by POS tagging:

Extract patterns from lists of POS tagged words in NLTK:

Full code:

Conclusion:

Do you have any question?

Extract patterns from lists of POS tagged words in NLTK:

Full code:

Conclusion:

Do you have any question?

Extract patterns from lists of POS tagged words in NLTK:

Full code:

Conclusion:

Do you have any question?

Full code:

Conclusion:

Do you have any question?

Extract patterns from lists of POS tagged words in NLTK:

Full code:

Conclusion:

Do you have any question?

Extract patterns from lists of POS tagged words in NLTK:

Full code:

Conclusion:

Do you have any question?

Full code:

Conclusion:

Do you have any question?

Extract patterns from lists of POS tagged words in NLTK:

Full code:

Conclusion:

Do you have any question?

Extract patterns from lists of POS tagged words in NLTK:

Full code:

Conclusion:

Do you have any question?

Related Posts

67 thoughts on “Extract Custom Keywords using NLTK POS tagger in python”

Leave a comment Cancel reply