You can extract keyword or important words or phrases by various methods like TF-IDF of word, TF-IDF of n-grams, Rule based POS tagging etc.
But all of those need manual effort to find proper logic.
In this topic I will show you how to extract keywords or important terms of given text automatically using package called RAKE in Python which is based on Rapid Automatic Keyword Extraction technique.
Also Read:
After reading this post you will know:
- Why to extract keywords
- Setting up RAKE in python
- Extract keyword using RAKE in Python
- What is happening at background?
Why to extract keywords:
- You can judge a comment or sentence within a second just by looking at keyword of a sentence.
- You can make decision whether the comment or sentence is worth reading or not.
- Further you can categorize the sentence to any category. For example whether a certain comment is about mobile or hotel etc.
- You can also use keywords or entity as a feature for your supervised model to train.
Setting up RAKE in python:
Type pip install python-rake==1.4.4
Or
- 1. Clone https://github.com/zelandiya/RAKE-tutorialthis github repo
- 2. Go to cmd (Windows-key + R-key then type “cmd” hit enter) and type cd the_cloned_folder_name
- 3. Inside this directory (the_cloned_folder_name) work with python.
Extract keyword using RAKE in Python:
Download SmartStoplist.txt from
# Reka
import RAKE
import operator
# Reka setup with stopword directory
stop_dir = "SmartStoplist.txt"
rake_object = RAKE.Rake(stop_dir)
# Sample text to test RAKE
text = """Google quietly rolled out a new way for Android users to listen
to podcasts and subscribe to shows they like, and it already works on
your phone. Podcast production company Pacific Content got the exclusive
on it.This text is taken from Google news."""
# Extract keywords
keywords = rake_object.run(text)
print ("keywords: ", keywords)
Output:
(‘keywords: ‘, [(‘podcast production company pacific content’, 25.0), (‘google quietly rolled’, 8.5), (‘google news’, 4.5), (‘android users’, 4.0), (‘exclusive’, 1.0), (‘works’, 1.0), (‘phone’, 1.0), (‘text’, 1.0), (‘podcasts’, 1.0), (‘subscribe’, 1.0), (‘listen’, 1.0), (‘shows’, 1.0)])
Great! Looks like RAKE is working fine and its giving great accuracy isn’t it?
Now what is happening at background?
How RAKE algorithm works?
Step 1:
First convert all text to lower case (ex: ‘Google’ -> ‘google’ or ‘GOOGLE’ -> ‘google’)
Then split into array of words (tokens) by the specified word delimiters (space, comma, dot etc.)
google quietly rolled out a new way for android users to listen to podcasts and subscribe to shows they like, and it already works on your phone. Podcast production company pacific content got the exclusive on it.this text is taken from google news.
[‘google’,
‘quietly’,
‘rolled’,
‘out’,
‘a’,
‘new’,
‘way’,
‘for’,
‘android’,
‘users’,
‘to’,
‘listen’,
‘to’,
‘podcasts’,
‘and’,
‘subscribe’,
‘to’,
‘shows’,
‘they’,
‘like’,
‘,’,
‘and’,
‘it’,
‘already’,
‘works’,
‘on’,
‘your’,
‘phone’,
‘.’,
‘podcast’,
‘production’,
‘company’,
‘pacific’,
‘Content’,
‘got’,
‘the’,
‘exclusive’,
‘on’,
‘it’,
‘.’,
‘this’,
‘text’,
‘is’,
‘taken’,
‘from’,
‘google’,
‘news’,
‘.’]
Step 2:
Now this array is then split into sequences of contiguous words by phrase delimiters and stop word positions.
Words within a sequence are assigned the same position in the text and together are considered a candidate keyword.
Split by delimiters Split by stop word Candidate Keyword
[‘google’, ‘google’ [‘google’, ‘quietly’]
‘quietly’, ‘quietly’
‘rolled’,
‘out’,
‘a’,
‘new’,
‘way’,
‘for’,
‘android’, ‘android’ [‘android’, ‘users’]
‘users’, ‘users’
‘to’,
‘listen’, ‘listen’ [‘listen’]
‘to’,
‘podcasts’, ‘podcasts’ [‘podcasts’]
‘and’,
‘subscribe’, ‘subscribe’ [‘subscribe’]
‘to’,
‘shows’, ‘shows’ [‘shows’]
‘they’,
‘like’,
‘,’,
‘and’,
‘it’,
‘already’,
‘works’, ‘works’ [‘works’]
‘on’,
‘your’,
‘phone’, ‘phone’ [‘phone’]
‘.’,
‘podcast’, ‘podcast’ [‘podcast’,’production’,’company’,’pacific’,’content’]
‘production’, ‘production’
‘company’, ‘company’
‘pacific’, ‘pacific’
‘content’, ‘content’
‘got’,
‘the’,
‘exclusive’, ‘exclusive’ [‘exclusive’]
‘on’,
‘it’,
‘.’,
‘this’,
‘text’, ‘text’ [‘text’]
‘is’,
‘taken’,
‘from’,
‘google’, ‘google’ [‘google’, ‘news’]
‘news’, ‘news’
‘.’]
Note: Red font character/ words are either stop words or punctuation to be removed.
Now from the above table we have all candidate words:
‘google’, ‘quietly’, ‘rolled’
‘android’, ‘users’
‘listen’
‘podcasts’
‘subscribe’
‘shows’
‘works’
‘phone’
‘podcast’, ‘production’, ‘company’, ‘pacific’, ‘content’
‘exclusive’
‘text’
‘google’, ‘news’
These candidate keywords are nothing but output keywords by RAKE (ex: ‘google quietly rolled’). But RAKE is doing one step further, calculate score for each keyword.
Step 3 (Calculate Keyword Score):
Let’s have a look at individual word appearing to candidate keyword, which are as follows.
‘google’, ‘quietly’, ‘rolled’, ‘android’, ‘users’, ‘listen’, ‘podcasts’, ‘subscribe’, ‘shows’, ‘works’, ‘phone’,‘podcast’, ‘production’, ‘company’, ‘pacific’, ‘content’, ‘exclusive’, ‘text’, ‘google’, ‘news’
RAKE is calculating keyword by taking ratio of degree to frequency or words. Let’s see how RAKE is calculating keyword score.
To do that first we need to draw a word co-occurrence graph.
If you don’t know what is co-occurrence graph:
It is same like term document matrix with one extra count of each word coming in a phrase.
Let’s see how its looks.
Explanation of keyword co-occurrence graph of RAKE:
Let’s take an example; first candidate keyword is ‘google’, ‘quietly’, ‘rolled’
So in table:
google-google = 2 (as ‘google’ appeared 2 time at all content keywords i.e. ‘google guilty rolled’ and ‘google news’)
(google-quietly) / (quietly/google) = 1 (as ‘google’ and ‘quietly’ together appeared 1 time at all content keywords)
Same logic is applicable for each word in the table.
Now to calculate score, we need to calculate two things for each word:
1. Word Frequency (freq(w))
2. Word Degree (deg(w))
Calculate Word frequency (freq(w)):
This is the count says how many times a particular word appeared among all candidate keywords.
Simply take the value of that word-word row.
Example for word ‘google’
freq(google) = 2 (value of row name ‘google’ and column name ‘google’)
Calculate Word Degree (deg(w)):
To calculate word degree for a particular word in the above table sum all numbers row wise.
Example for word ‘google’
deg(google)) = (2 + 1 + 1 + 1) => 5
Now finally we can calculate keyword score.
Keyword score = (deg(w)/freq(w)).
Let’s test it for a one keyword ‘google quietly rolled’
|
google
|
quietly
|
rolled
|
deg(w)
|
5
|
3
|
3
|
freq(w)
|
2
|
1
|
1
|
Keyword score (‘google quietly rolled’) = (5/2 +3/1 +3/1) = (2.5+3+3) = 8.5
Can we recall what was the score RAKE has given?
(‘keywords: ‘, [(‘podcast production company pacific content’, 25.0), (‘google quietly rolled’, 8.5), (‘google news’, 4.5), (‘android users’, 4.0), (‘exclusive’, 1.0), (‘works’, 1.0), (‘phone’, 1.0), (‘text’, 1.0), (‘podcasts’, 1.0), (‘subscribe’, 1.0), (‘listen’, 1.0), (‘shows’, 1.0)])
Yes it is same!!!!
Conclusion:
In this tutorial you learned:
- What is keyword?
- Why to extract keyword?
- Setting up RAKE for python
- Extract keywords using RAKE in python
- How Rapid Automatic Keyword Extraction (RAKE) works
Related Article:
Note:
Though you have seen that RAKE is almost accurately but sometime it won’t if some keyword contain some stop words.
For example, ‘new’ is listed in RAKE’s stopword list. This means that neither ‘New York’ nor ‘New Zealand’ can be ever a keyword.
If you have some question about this topic place those in comment box. I will try my best to answer those.
Ԍreat work! This is the type of infοrmation that
shoulⅾ be shared around the net. Shamе on the search engіnes for
now not positioning thіs post higher! Come on over and dіscuss with my webѕite .
Thank you =)
Thank you so much!
Apρreciating tһe time and effort you put intо your site
and in depth information you proviԁe. It’ѕ great to come across a blog every once
in a while that іsn’t the same old rehashed materiaⅼ.
Fantastic read!
Tһis iѕ my first time visit at here and і am genuinely happу to read all at single place.
Good read about extract keywords from text.
Best article for keyword extraction tutorial.
Fantaѕtic post for nlp for text extraction.
Ⲩou made some dеcent points there for text extraction using nlp.
Appгeciate this post about extract keywords from text using NLP. Will try іt out.
Greate post for keyword extraction. Thanks
Nice post rake keyword extraction
Good article about rake keyword extraction