Automatic Keyword extraction using RAKE in Python

Keyword extraction of Entity extraction are widely used to define queries within information Retrieval (IR) in the field of Natural Language Processing (NLP).

You can extract keyword or important words or phrases by various methods like TF-IDF of word, TF-IDF of n-grams, Rule based POS tagging etc.

But all of those need manual effort to find proper logic.

In this topic I will show you how to extract keywords or important terms of given text automatically using package called RAKE in Python which is based on Rapid Automatic Keyword Extraction technique.

Also Read:

After reading this post you will know:

Why to extract keywords
Setting up RAKE in python
Extract keyword using RAKE in Python
What is happening at background?

Why to extract keywords:

You can judge a comment or sentence within a second just by looking at keyword of a sentence.
You can make decision whether the comment or sentence is worth reading or not.
Further you can categorize the sentence to any category. For example whether a certain comment is about mobile or hotel etc.
You can also use keywords or entity as a feature for your supervised model to train.

Setting up RAKE in python:

Type pip install python-rake==1.4.4

Or you can download and install RAKE from https://pypi.org/project/python-rake/#files

1. Clone https://github.com/zelandiya/RAKE-tutorialthis github repo
2. Go to cmd (Windows-key + R-key then type “cmd” hit enter) and type cd the_cloned_folder_name
3. Inside this directory (the_cloned_folder_name) work with python.

Extract keyword using RAKE in Python:

Download SmartStoplist.txt from

https://raw.githubusercontent.com/zelandiya/RAKE-tutorial/master/data/stoplists/SmartStoplist.txt

# Reka
import RAKE
import operator

# Reka setup with stopword directory
stop_dir = "SmartStoplist.txt"
rake_object = RAKE.Rake(stop_dir)

# Sample text to test RAKE
text = """Google quietly rolled out a new way for Android users to listen 
to podcasts and subscribe to shows they like, and it already works on 
your phone. Podcast production company Pacific Content got the exclusive 
on it.This text is taken from Google news."""

# Extract keywords
keywords = rake_object.run(text)
print ("keywords: ", keywords)

Output:

(‘keywords: ‘, [(‘podcast production company pacific content’, 25.0), (‘google quietly rolled’, 8.5), (‘google news’, 4.5), (‘android users’, 4.0), (‘exclusive’, 1.0), (‘works’, 1.0), (‘phone’, 1.0), (‘text’, 1.0), (‘podcasts’, 1.0), (‘subscribe’, 1.0), (‘listen’, 1.0), (‘shows’, 1.0)])

Great! Looks like RAKE is working fine and its giving great accuracy isn’t it?

Automatic keyword extraction usingTextRank in python

Also Read: Generate sentences from keywords using Python

Now what is happening at background?

How RAKE algorithm works?

Step 1:

First convert all text to lower case (ex: ‘Google’ -> ‘google’ or ‘GOOGLE’ -> ‘google’)

Then split into array of words (tokens) by the specified word delimiters (space, comma, dot etc.)

google quietly rolled out a new way for android users to listen to podcasts and subscribe to shows they like, and it already works on your phone. Podcast production company pacific content got the exclusive on it.this text is taken from google news.

[‘google’,

‘quietly’,

‘rolled’,

‘out’,

‘a’,

‘new’,

‘way’,

‘for’,

‘android’,

‘users’,

‘to’,

‘listen’,

‘to’,

‘podcasts’,

‘and’,

‘subscribe’,

‘to’,

‘shows’,

‘they’,

‘like’,

‘,’,

‘and’,

‘it’,

‘already’,

‘works’,

‘on’,

‘your’,

‘phone’,

‘.’,

‘podcast’,

‘production’,

‘company’,

‘pacific’,

‘Content’,

‘got’,

‘the’,

‘exclusive’,

‘on’,

‘it’,

‘.’,

‘this’,

‘text’,

‘is’,

‘taken’,

‘from’,

‘google’,

‘news’,

‘.’]

Step 2:

Now this array is then split into sequences of contiguous words by phrase delimiters and stop word positions.

Words within a sequence are assigned the same position in the text and together are considered a candidate keyword.

Split by delimiters Split by stop word Candidate Keyword

[‘google’, ‘google’ [‘google’, ‘quietly’]

‘quietly’, ‘quietly’

‘rolled’,

‘out’,

‘a’,

‘new’,

‘way’,

‘for’,

‘android’, ‘android’ [‘android’, ‘users’]

‘users’, ‘users’

‘to’,

‘listen’, ‘listen’ [‘listen’]

‘to’,

‘podcasts’, ‘podcasts’ [‘podcasts’]

‘and’,

‘subscribe’, ‘subscribe’ [‘subscribe’]

‘to’,

‘shows’, ‘shows’ [‘shows’]

‘they’,

‘like’,

‘,’,

‘and’,

‘it’,

‘already’,

‘works’, ‘works’ [‘works’]

‘on’,

‘your’,

‘phone’, ‘phone’ [‘phone’]

‘.’,

‘podcast’, ‘podcast’ [‘podcast’,’production’,’company’,’pacific’,’content’]

‘production’, ‘production’

‘company’, ‘company’

‘pacific’, ‘pacific’

‘content’, ‘content’

‘got’,

‘the’,

‘exclusive’, ‘exclusive’ [‘exclusive’]

‘on’,

‘it’,

‘.’,

‘this’,

‘text’, ‘text’ [‘text’]

‘is’,

‘taken’,

‘from’,

‘google’, ‘google’ [‘google’, ‘news’]

‘news’, ‘news’

‘.’]

Note: Red font character/ words are either stop words or punctuation to be removed.

Automatic keyword extraction usingTextRank in python

Now from the above table we have all candidate words:

‘google’, ‘quietly’, ‘rolled’

‘android’, ‘users’

‘listen’

‘podcasts’

‘subscribe’

‘shows’

‘works’

‘phone’

‘podcast’, ‘production’, ‘company’, ‘pacific’, ‘content’

‘exclusive’

‘text’

‘google’, ‘news’

These candidate keywords are nothing but output keywords by RAKE (ex: ‘google quietly rolled’). But RAKE is doing one step further, calculate score for each keyword.

Step 3 (Calculate Keyword Score):

Let’s have a look at individual word appearing to candidate keyword, which are as follows.

‘google’, ‘quietly’, ‘rolled’, ‘android’, ‘users’, ‘listen’, ‘podcasts’, ‘subscribe’, ‘shows’, ‘works’, ‘phone’,‘podcast’, ‘production’, ‘company’, ‘pacific’, ‘content’, ‘exclusive’, ‘text’, ‘google’, ‘news’

RAKE is calculating keyword by taking ratio of degree to frequency or words. Let’s see how RAKE is calculating keyword score.

To do that first we need to draw a word co-occurrence graph.

If you don’t know what is co-occurrence graph:

It is same like term document matrix with one extra count of each word coming in a phrase.

Let’s see how its looks.

Explanation of keyword co-occurrence graph of RAKE:

Let’s take an example; first candidate keyword is ‘google’, ‘quietly’, ‘rolled’

So in table:

google-google = 2 (as ‘google’ appeared 2 time at all content keywords i.e. ‘google guilty rolled’ and ‘google news’)

Also Read: Code LLAMA: AI Tool That Will Change Your Coding Life

(google-quietly) / (quietly/google) = 1 (as ‘google’ and ‘quietly’ together appeared 1 time at all content keywords)

Same logic is applicable for each word in the table.

Now to calculate score, we need to calculate two things for each word:

1. Word Frequency (freq(w))

2. Word Degree (deg(w))

Calculate Word frequency (freq(w)):

This is the count says how many times a particular word appeared among all candidate keywords.

Simply take the value of that word-word row.

Example for word ‘google’

freq(google) = 2 (value of row name ‘google’ and column name ‘google’)

Calculate Word Degree (deg(w)):

To calculate word degree for a particular word in the above table sum all numbers row wise.

Example for word ‘google’

deg(google)) = (2 + 1 + 1 + 1) => 5

Now finally we can calculate keyword score.

Keyword score = (deg(w)/freq(w)).

Let’s test it for a one keyword ‘google quietly rolled’

	google	quietly	rolled
deg(w)	5	3	3
freq(w)	2	1	1

Keyword score (‘google quietly rolled’) = (5/2 +3/1 +3/1) = (2.5+3+3) = 8.5

Can we recall what was the score RAKE has given?

Yes it is same!!!!

Conclusion:

In this tutorial you learned:

What is keyword?

Why to extract keyword?

Setting up RAKE for python

Extract keywords using RAKE in python

How Rapid Automatic Keyword Extraction (RAKE) works

Automatic keyword extraction usingTextRank in python

Note:

Though you have seen that RAKE is almost accurately but sometime it won’t if some keyword contain some stop words.

For example, ‘new’ is listed in RAKE’s stopword list. This means that neither ‘New York’ nor ‘New Zealand’ can be ever a keyword.

If you have some question about this topic place those in comment box. I will try my best to answer those.

Anindya Naskar

882 thoughts on “Automatic Keyword extraction using RAKE in Python”

ingenuity

October 16, 2020 at 8:28 am

Ԍreat work! This is the type of infοrmation that
shoulⅾ be shared around the net. Shamе on the search engіnes for
now not positioning thіs post higher! Come on over and dіscuss with mｙ webѕite .
Thank you =)
Duyen

April 24, 2021 at 4:44 pm

Thank you so much!
lindsey

May 1, 2021 at 11:08 am

Apρreciating tһe time and effort you put intо your site
and in depth information you proviԁe. It’ѕ great to come across a blog every once
in a while that іsn’t the same old rehashed materiaⅼ.
Fantastic read!
Mike

May 10, 2021 at 9:55 am

Tһis iѕ my first time visit at here and і am genuinely happу to read all at single place.
Geni

May 15, 2021 at 8:57 pm

Good read about extract keywords from text.
Raja

May 15, 2021 at 9:02 pm

Best article for keyword extraction tutorial.
Amit

May 15, 2021 at 10:56 pm

Fantaѕtic post for nlp for text extraction.
คลิปโป๊

May 15, 2021 at 11:47 pm

Ⲩou made some dеcent points there for text extraction using nlp.
หนังโป๊ไทย

May 16, 2021 at 12:34 am

Appгeciate this post about extract keywords from text using NLP. Will try іt out.
Kate

May 16, 2021 at 12:36 am

Greate post for keyword extraction. Thanks
หวย

May 23, 2021 at 8:13 pm

Nice post rake keyword extraction
View Source

May 24, 2021 at 1:47 am

Good article about rake keyword extraction

Why to extract keywords:

Setting up RAKE in python:

Extract keyword using RAKE in Python:

How RAKE algorithm works?

Step 3 (Calculate Keyword Score):

Explanation of keyword co-occurrence graph of RAKE:

Calculate Word frequency (freq(w)):

Calculate Word Degree (deg(w)):

Conclusion:

Note:

Related Posts

882 thoughts on “Automatic Keyword extraction using RAKE in Python”

Leave a comment Cancel reply