Extract Custom Keywords using NLTK POS tagger in python

                                             

                                                      

Parts-Of-Speech tagging (POS tagging) is one of the main and basic component of almost any NLP task. Parts-of-Speech are also known as word classes or lexical categories. POS tagger can be used for indexing of word, information retrieval and many more application.

Back in elementary school you learnt the difference between Nouns, Pronouns, Verbs, Adjectives etc. These are nothing but Parts-Of-Speech to form a sentence.

The task of POS-tagging is to labeling words of a sentence with their appropriate Parts-Of-Speech (Nouns, Pronouns, Verbs, Adjectives …).

Related Article: Word similarity matching using soundex in python

How POS Tagging works?

Here’s a simple example:

from nltk import word_tokenize, pos_tag
print(pos_tag(word_tokenize("I love NLP")))

Output:

[('I', 'PRP'), ('love', 'VBP'), ('NLP', 'RB')]


Here for the sentence “I love NLP”, NLTK POS tagger successfully tagged

  •         I as PRP (pronoun, personal)
  •         Love as VBP (verb, present tense, not 3rd person singular)
  •         NLP as RB (adverb)

Now how I got those full forms of POS tags?

Here is the code to view all possible POS tags for NLTK.

import nltk
nltk.help.upenn_tagset()

Note: Don’t forget to download help data/ corpus from NLTK

Related Article: How to download NLTK corpus Manually 


If you don’t want to write code to see all, I will do it for you.

Here are those all possible tags of NLTK with their full form:
POS tags Full Form of POS tags Example
CC conjunction, coordinating &, ‘n, and, both etc.
CD numeral, cardinal mid-1890, nine-thirty, zero, two etc.
DT determiner all, an, another, any, both etc.
EX existential there there
FW foreign word gemeinschaft, quibusdam, fille etc.
IN preposition or conjunction, subordinating astride, among, uppon, whether etc.
JJ adjective or numeral, ordinal battery-powered, pre-war, multi-disciplinary etc.
JJR adjective, comparative braver, cleaner, brighter etc.
JJS adjective, superlative cheapest, closest, darkest etc.
LS list item marker SP-44002, SP-4005 etc.
MD modal auxiliary can, cannot, could, couldn’t, shouldn’t etc.
NN noun, common, singular or mass cabbage, afghan, slick etc.
NNP noun, proper, singular Ranzer, Shannon, CTCA, Light etc.
NNPS noun, proper, plural Americans, Indians, Australians etc.
NNS noun, common, plural undergraduates, scotches, bodyguards etc.
PDT pre-determiner all, many, quite, such etc.
POS genitive marker ‘, s etc.
PRP pronoun, personal hers, herself, him, himself etc.
PRP$ pronoun, possessive her, his, mine, my etc.
RB adverb occasionally, adventurously, professedly etc.
RBR adverb, comparative further, longer, louder etc.
RBS adverb, superlative best, biggest, largest etc.
RP particle about,along, apart etc.
SYM symbol %, &, ‘, ”,*,+ etc.
TO “to” as preposition or infinitive marker to
UH interjection Goodbye, Gosh, Wow etc.
VB verb, base form ask, assemble, assign etc.
VBD verb, past tense dipped, halted, registered etc.
VBG verb, present participle or gerund telegraphing, judging, erasing etc.
VBN verb, past participle used, unsettled, dubbed etc.
VBP verb, present tense, not 3rd person singular glisten, obtain, comprise etc.
VBZ verb, present tense, 3rd person singular marks, mixes, seals etc.
WDT WH-determiner that, what, whatever, which and whichever
WP WH-pronoun that, what, whatever, whatsoever, which, who, whom and whosoever
WP$ WH-pronoun, possessive whose
WRB Wh-adverb how, however, whence, whenever, where, whereby, whereever, wherein, whereof and why

There are number of applications of POS tagging like:

Application of POS:


Indexing of words, you can use these tags as feature of a sentence to do sentiment analysis, extract entity etc.

Okay now back to the topic.

Now you know how to tag POS of a sentence. But what to do with it?
In this topic I will only explain how to extract custom keywords from sentence using POS tagging.

Extract Custom keywords by POS tagging:

Let’s say you have some sentence like:


“I am using Mi note5 it is working great”

“My Samsung s7 is hanging very often”

“My friend is using Motorola g5 for last 5 years, he is happy with it”

In this case let’s say I want only red color words (MI, Samsung and Motorola) to be extracted. 

Let’s do this.

First let me check tags for those sentences:


comment = ["I am using Mi note5 it is working great",
"My Samsung s7 is hanging very often",
"My friend is using Motorola g5 for last 5 years, he is happy with it"]

for i in range(0,3):
print(pos_tag(word_tokenize(comment[i])))
print('n')

Output:

[('I', 'PRP'), ('am', 'VBP'), ('using', 'VBG'), ('Mi', 'NNP'), ('note5', 'NN'), ('it', 'PRP'), ('is', 'VBZ'), ('working', 'VBG'), ('great', 'JJ')]


[('My', 'PRP$'), (
'Samsung', 'NNP'), ('s7', 'NN'), ('is', 'VBZ'), ('hanging', 'VBG'), ('very', 'RB'), ('often', 'RB')]


[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('using', 'VBG'), (
'Motorola', 'NNP'), ('g5', 'NN'), ('for', 'IN'), ('last', 'JJ'), ('5', 'CD'), ('years', 'NNS'), (',', ','), ('he', 'PRP'), ('is', 'VBZ'), ('happy', 'JJ'), ('with', 'IN'), ('it', 'PRP')]

You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.

Extracting all Nouns (NNP) from a text file using nltk
# Extracting all Nouns from a text file using nltk

for i in range(0,3):
token_comment = word_tokenize(comment[i])
tagged_comment = pos_tag(token_comment)
print( [(word, tag) for word, tag in tagged_comment if (tag=='NNP')])

Output:

[('Mi', 'NNP')]
[('Samsung', 'NNP')]
[('Motorola', 'NNP')]

See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.

Related Article:

Extract patterns from lists of POS tagged words in NLTK:

Now I am interested to extract model no. of those phones also. Let’s do this.
“I am using Mi note5 it is working great”
“My Samsung s7 is hanging very often”
“My friend is using Motorola g5 for last 5 years, he is happy with it”

If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.
[('I', 'PRP'), ('am', 'VBP'), ('using', 'VBG'), ('Mi', 'NNP'), ('note5', 'NN'), ('it', 'PRP'), ('is', 'VBZ'), ('working', 'VBG'), ('great', 'JJ')]


[('My', 'PRP$'), (
'Samsung', 'NNP'), ('s7', 'NN'), ('is', 'VBZ'), ('hanging', 'VBG'), ('very', 'RB'), ('often', 'RB')]


[('My', 'PRP$'), (
'friend', 'NN'), ('is', 'VBZ'), ('using', 'VBG'), ('Motorola', 'NNP'), ('g5', 'NN'), ('for', 'IN'), ('last', 'JJ'), ('5', 'CD'), ('years', 'NNS'), (',', ','), ('he', 'PRP'), ('is', 'VBZ'), ('happy', 'JJ'), ('with', 'IN'), ('it', 'PRP')]

This time extracting “NN” tag will give us some unwanted word.
As you can see in third sentence “friend” is tagged under “NN” tag)
Now if you see in another way then you will find out a pattern,
Model numbers are appearing just after company name.
So if we are putting logic like:

If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.

Let’s do that.

# Function to extract two pattern tags   

def match2(token_pos,pos1,pos2):
for subsen in token_pos:
# avoid index error and catch last three elements
end = len(subsen) - 1
for ind, (a, b) in enumerate(subsen, 1):
if ind == end:
break
if b == pos1 and subsen[ind][1] == pos2:
yield ("{} {}".format(a, subsen[ind][0], subsen[ind + 1][0]))

# Print company and model no for each sentence

for i in range(0,3):
tokens = word_tokenize(comment[i]) # Generate list of tokens
tokens_pos = pos_tag(tokens)
a = [tokens_pos]
print(list(match2(a,'NNP','NN')))

Output:

['Mi note5']
['Samsung s7']
['Motorola g5']

Yes now we got exactly what we wanted.

Related Article:

Full code:

# Define an array of comments to test.

comment = ["I am using Mi note5 it is working great",
"My Samsung s7 is hanging very often",
"My friend is using Motorola g5 for last 5 years, he is happy with it"]

# Print POS tags of all sentences.

for i in range(0,3):
print(pos_tag(word_tokenize(comment[i])))
print('n')

# Extract only company name from all sentences

for i in range(0,3):
token_comment = word_tokenize(comment[i])
tagged_comment = pos_tag(token_comment)
print( [(word, tag) for word, tag in tagged_comment if (tag=='NNP')])

# Extract company name and model no. both.

# Function to extract two pattern tags
def match2(token_pos,pos1,pos2):
for subsen in token_pos:
# avoid index error and catch last three elements
end = len(subsen) - 1
for ind, (a, b) in enumerate(subsen, 1):
if ind == end:
break
if b == pos1 and subsen[ind][1] == pos2:
yield ("{} {}".format(a, subsen[ind][0], subsen[ind + 1][0]))

# Print company and model no for each sentence

for i in range(0,3):
tokens = word_tokenize(comment[i]) # Generate list of tokens
tokens_pos = pos_tag(tokens)
a = [tokens_pos]
print(list(match2(a,'NNP','NN')))

Conclusion:

In this tutorial you have discovered what is POS tagging and how to implement it from scratch.
           
You learned:
  •         How to tokenize a sentence
  •         How to tag Parts-of-Speech
  •         How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
  •         How to extract pattern from list of POS tagged words.


Related Article:

Do you have any question?

Ask your question in the comment below and I will do my best to answer.

Leave a Comment

Your email address will not be published. Required fields are marked *