Parts-Of-Speech tagging (POS tagging) is one of the main and basic component of almost any NLP task. Parts-of-Speech are also known as word classes or lexical categories. POS tagger can be used for indexing of word, information retrieval and many more application.
Back in elementary school you learnt the difference between Nouns, Pronouns, Verbs, Adjectives etc. These are nothing but Parts-Of-Speech to form a sentence.
The task of POS-tagging is to labeling words of a sentence with their appropriate Parts-Of-Speech (Nouns, Pronouns, Verbs, Adjectives …).
Related Article: Word similarity matching using soundex in python
How POS Tagging works?
Here’s a simple example:
from nltk import word_tokenize, pos_tag
print(pos_tag(word_tokenize("I love NLP")))
Output:
[('I', 'PRP'), ('love', 'VBP'), ('NLP', 'RB')]
Here for the sentence “I love NLP”, NLTK POS tagger successfully tagged
- I as PRP (pronoun, personal)
- Love as VBP (verb, present tense, not 3rd person singular)
- NLP as RB (adverb)
Now how I got those full forms of POS tags?
Here is the code to view all possible POS tags for NLTK.
import nltk
nltk.help.upenn_tagset()
Note: Don’t forget to download help data/ corpus from NLTK
Related Article: How to download NLTK corpus Manually
If you don’t want to write code to see all, I will do it for you.
Here are those all possible tags of NLTK with their full form:
POS tags | Full Form of POS tags | Example |
---|---|---|
CC | conjunction, coordinating | &, ‘n, and, both etc. |
CD | numeral, cardinal | mid-1890, nine-thirty, zero, two etc. |
DT | determiner | all, an, another, any, both etc. |
EX | existential there | there |
FW | foreign word | gemeinschaft, quibusdam, fille etc. |
IN | preposition or conjunction, subordinating | astride, among, uppon, whether etc. |
JJ | adjective or numeral, ordinal | battery-powered, pre-war, multi-disciplinary etc. |
JJR | adjective, comparative | braver, cleaner, brighter etc. |
JJS | adjective, superlative | cheapest, closest, darkest etc. |
LS | list item marker | SP-44002, SP-4005 etc. |
MD | modal auxiliary | can, cannot, could, couldn’t, shouldn’t etc. |
NN | noun, common, singular or mass | cabbage, afghan, slick etc. |
NNP | noun, proper, singular | Ranzer, Shannon, CTCA, Light etc. |
NNPS | noun, proper, plural | Americans, Indians, Australians etc. |
NNS | noun, common, plural | undergraduates, scotches, bodyguards etc. |
PDT | pre-determiner | all, many, quite, such etc. |
POS | genitive marker | ‘, s etc. |
PRP | pronoun, personal | hers, herself, him, himself etc. |
PRP$ | pronoun, possessive | her, his, mine, my etc. |
RB | adverb | occasionally, adventurously, professedly etc. |
RBR | adverb, comparative | further, longer, louder etc. |
RBS | adverb, superlative | best, biggest, largest etc. |
RP | particle | about,along, apart etc. |
SYM | symbol | %, &, ‘, ”,*,+ etc. |
TO | “to” as preposition or infinitive marker | to |
UH | interjection | Goodbye, Gosh, Wow etc. |
VB | verb, base form | ask, assemble, assign etc. |
VBD | verb, past tense | dipped, halted, registered etc. |
VBG | verb, present participle or gerund | telegraphing, judging, erasing etc. |
VBN | verb, past participle | used, unsettled, dubbed etc. |
VBP | verb, present tense, not 3rd person singular | glisten, obtain, comprise etc. |
VBZ | verb, present tense, 3rd person singular | marks, mixes, seals etc. |
WDT | WH-determiner | that, what, whatever, which and whichever |
WP | WH-pronoun | that, what, whatever, whatsoever, which, who, whom and whosoever |
WP$ | WH-pronoun, possessive | whose |
WRB | Wh-adverb | how, however, whence, whenever, where, whereby, whereever, wherein, whereof and why |
There are number of applications of POS tagging like:
Application of POS:
Indexing of words, you can use these tags as feature of a sentence to do sentiment analysis, extract entity etc.
Okay now back to the topic.
Now you know how to tag POS of a sentence. But what to do with it?
In this topic I will only explain how to extract custom keywords from sentence using POS tagging.
In this topic I will only explain how to extract custom keywords from sentence using POS tagging.
Extract Custom keywords by POS tagging:
Let’s say you have some sentence like:
“I am using Mi note5 it is working great”
“My Samsung s7 is hanging very often”
“My friend is using Motorola g5 for last 5 years, he is happy with it”
In this case let’s say I want only red color words (MI, Samsung and Motorola) to be extracted.
Let’s do this.
First let me check tags for those sentences:
comment = ["I am using Mi note5 it is working great",
"My Samsung s7 is hanging very often",
"My friend is using Motorola g5 for last 5 years, he is happy with it"]
for i in range(0,3):
print(pos_tag(word_tokenize(comment[i])))
print('n')
Output:
[('I', 'PRP'), ('am', 'VBP'), ('using', 'VBG'), ('Mi', 'NNP'), ('note5', 'NN'), ('it', 'PRP'), ('is', 'VBZ'), ('working', 'VBG'), ('great', 'JJ')]
[('My', 'PRP$'), ('Samsung', 'NNP'), ('s7', 'NN'), ('is', 'VBZ'), ('hanging', 'VBG'), ('very', 'RB'), ('often', 'RB')]
[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('using', 'VBG'), ('Motorola', 'NNP'), ('g5', 'NN'), ('for', 'IN'), ('last', 'JJ'), ('5', 'CD'), ('years', 'NNS'), (',', ','), ('he', 'PRP'), ('is', 'VBZ'), ('happy', 'JJ'), ('with', 'IN'), ('it', 'PRP')]
You can see that all those entity I wanted to extract is coming under “NNP” tag. So I will extract only “NNP” tagged text.
Extracting all Nouns (NNP) from a text file using nltk
# Extracting all Nouns from a text file using nltk
for i in range(0,3):
token_comment = word_tokenize(comment[i])
tagged_comment = pos_tag(token_comment)
print( [(word, tag) for word, tag in tagged_comment if (tag=='NNP')])
Output:
[('Mi', 'NNP')]
[('Samsung', 'NNP')]
[('Motorola', 'NNP')]
See now I am able to extract those entity (Mi, Samsung and Motorola) what I was trying to do.
Related Article:
Extract patterns from lists of POS tagged words in NLTK:
Now I am interested to extract model no. of those phones also. Let’s do this.
“I am using Mi note5 it is working great”
“My Samsung s7 is hanging very often”
“My friend is using Motorola g5 for last 5 years, he is happy with it”
If you see all POS tagging carefully then you’ll find out that all model nos. are coming under “NN” tag.
[('I', 'PRP'), ('am', 'VBP'), ('using', 'VBG'), ('Mi', 'NNP'), ('note5', 'NN'), ('it', 'PRP'), ('is', 'VBZ'), ('working', 'VBG'), ('great', 'JJ')]
[('My', 'PRP$'), ('Samsung', 'NNP'), ('s7', 'NN'), ('is', 'VBZ'), ('hanging', 'VBG'), ('very', 'RB'), ('often', 'RB')]
[('My', 'PRP$'), ('friend', 'NN'), ('is', 'VBZ'), ('using', 'VBG'), ('Motorola', 'NNP'), ('g5', 'NN'), ('for', 'IN'), ('last', 'JJ'), ('5', 'CD'), ('years', 'NNS'), (',', ','), ('he', 'PRP'), ('is', 'VBZ'), ('happy', 'JJ'), ('with', 'IN'), ('it', 'PRP')]
This time extracting “NN” tag will give us some unwanted word.
As you can see in third sentence “friend” is tagged under “NN” tag)
Now if you see in another way then you will find out a pattern,
Model numbers are appearing just after company name.
So if we are putting logic like:
If “NN” coming just after “NNP” then those “NNP” and “NN” will be extracted.
Let’s do that.
# Function to extract two pattern tags
def match2(token_pos,pos1,pos2):
for subsen in token_pos:
# avoid index error and catch last three elements
end = len(subsen) - 1
for ind, (a, b) in enumerate(subsen, 1):
if ind == end:
break
if b == pos1 and subsen[ind][1] == pos2:
yield ("{} {}".format(a, subsen[ind][0], subsen[ind + 1][0]))
# Print company and model no for each sentence
for i in range(0,3):
tokens = word_tokenize(comment[i]) # Generate list of tokens
tokens_pos = pos_tag(tokens)
a = [tokens_pos]
print(list(match2(a,'NNP','NN')))
Output:
['Mi note5']
['Samsung s7']
['Motorola g5']
Yes now we got exactly what we wanted.
Related Article:
Full code:
# Define an array of comments to test.
comment = ["I am using Mi note5 it is working great",
"My Samsung s7 is hanging very often",
"My friend is using Motorola g5 for last 5 years, he is happy with it"]
# Print POS tags of all sentences.
for i in range(0,3):
print(pos_tag(word_tokenize(comment[i])))
print('n')
# Extract only company name from all sentences
for i in range(0,3):
token_comment = word_tokenize(comment[i])
tagged_comment = pos_tag(token_comment)
print( [(word, tag) for word, tag in tagged_comment if (tag=='NNP')])
# Extract company name and model no. both.
# Function to extract two pattern tags
def match2(token_pos,pos1,pos2):
for subsen in token_pos:
# avoid index error and catch last three elements
end = len(subsen) - 1
for ind, (a, b) in enumerate(subsen, 1):
if ind == end:
break
if b == pos1 and subsen[ind][1] == pos2:
yield ("{} {}".format(a, subsen[ind][0], subsen[ind + 1][0]))
# Print company and model no for each sentence
for i in range(0,3):
tokens = word_tokenize(comment[i]) # Generate list of tokens
tokens_pos = pos_tag(tokens)
a = [tokens_pos]
print(list(match2(a,'NNP','NN')))
Conclusion:
In this tutorial you have discovered what is POS tagging and how to implement it from scratch.
You learned:
- How to tokenize a sentence
- How to tag Parts-of-Speech
- How to extract only Nouns (you can apply same thing for anything like CD, JJ etc.)
- How to extract pattern from list of POS tagged words.
Related Article:
Do you have any question?
Ask your question in the comment below and I will do my best to answer.