Automatic Keyword extraction using Python TextRank

Keywords or entities are condensed form of the content are widely used to define queries within information Retrieval (IR).

Keyword extraction or key phrase extraction can be done by using various methods like TF-IDF of word, TF-IDF of n-grams, Rule based POS tagging etc.

But all of those need manual effort to find proper logic.

In this topic I will show you how to do (automatically) keyword extraction or key phrase extraction using package called Pytextrank in Python which is based on Text Rank algorithm of Google.

Also Read:

After reading this post you will know:

Why to do keyword extraction
Keyword extraction using TextRank in Python
What is happening at background?

Why to do keyword extraction:

You can judge a comment or sentence within a second just by looking at keyword of a sentence.
You can make decision whether the comment or sentence is worth reading or not.
Further you can categorize the sentence to any category. For example whether a certain comment is about mobile or hotel etc.
You can also use keywords or entity or key phrase as a feature for your supervised model to train.

Setting up PyTextRank in python:

Type python –m pip install pytextrank in cmd

Download SmartStoplist.txt from

https://github.com/ceteri/pytextrank/blob/master/stop.txt

Keep SmartStoplist.txt inside your working directory(the folder where you are saving your python directory)

Keyword extraction using PyTextRank in Python:

While doing key phrase extraction, Pytextrank process text into two stages to do keyword extraction.

Stage 1:

In stage 1 it do some text cleaning and processing stuff like below:

Index each word of text.
Lemmatize each word
POS tagging.

Stage2:

Stage 2 based on some logic it come up with important keywords or entity or key phrase with their ranks.

First let’s try to extract key phrases from sample text in python then will move on to understand how pytextrank algorithm works.

Note: If you are getting error saying:

TypeError Traceback (most recent call last)

<ipython-input-6-7dbb39972cd2> in <module>()

14 withopen(path_stage1, ‘w’) as f:

—> 15 forgraf inpytextrank.parse_doc(pytextrank.json_iter(path_stage0)):

16 f.write(“%sn” %pytextrank.pretty_print(graf._asdict()))

17 print(pytextrank.pretty_print(graf._asdict()))

C:UsersanindyaAnaconda2libsite-packagespytextrankpytextrank.pyc in parse_doc(json_iter)

259 print(“graf_text:”,graf_text)

260

–> 261 grafs,new_base_idx =parse_graf(meta[“id”],graf_text,base_idx)

262 base_idx =new_base_idx

263

C:UsersanindyaAnaconda2libsite-packagespytextrankpytextrank.pyc in parse_graf(doc_id, graf_text, base_idx, spacy_nlp)

191 markup = []

192 new_base_idx =base_idx

–> 193 doc =spacy_nlp(graf_text,parse=True)

194

195 forspan indoc.sents:

TypeError: __call__() got an unexpected keyword argument ‘parse’

Go to your Anaconda directory and then Lib->site-packages->pytextrank

(For me full path was C:UsersanindyaAnaconda2Libsite-packagespytextrank)

Open pytextrank.py, search for “parse=True”. You should find this term twice in whole script.

Remove “parse=True” from those line.

For example:

Previously:

doc = spacy_nlp(graf_text, parse=True)

After Removing:

doc = spacy_nlp(graf_text)

Stage1:

# Pytextrank
import pytextrank
import json

# Sample text
sample_text = 'I Like Flipkart. He likes Amazone. she likes Snapdeal. Flipkart and amazone is on top of google search.'

# Create dictionary to feed into json file

file_dic = {"id" : 0,"text" : sample_text}
file_dic = json.dumps(file_dic)
loaded_file_dic = json.loads(file_dic)

# Create test.json and feed file_dic into it.
with open('test.json', 'w') as outfile:
        json.dump(loaded_file_dic, outfile)

path_stage0 = "test.json"
path_stage1 = "o1.json"

# Extract keyword using pytextrank
with open(path_stage1, 'w') as f:
    for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
        f.write("%sn" % pytextrank.pretty_print(graf._asdict()))
        print(pytextrank.pretty_print(graf._asdict()))

Output:
{“graf”: [[0, “I”, “i”, “PRP”, 0, 0], [1, “Like”, “like”, “VBP”, 1, 1], [2, “Flipkart”, “flipkart”, “NNP”, 1, 2], [0, “.”, “.”, “.”, 0, 3]], “id”: 0, “sha1”: “c09f142dabeb8465f5c30d1c4ef7a0d842ffb99c”} {“graf”: [[0, “He”, “he”, “PRP”, 0, 4], [1, “likes”, “like”, “VBZ”, 1, 5], [3, “Amazone”, “amazone”, “NNP”, 1, 6], [0, “.”, “.”, “.”, 0, 7]], “id”: 0, “sha1”: “307041869f5cba8a798708d64ee83f290098e00a”} {“graf”: [[0, “she”, “she”, “PRP”, 0, 8], [1, “likes”, “like”, “VBZ”, 1, 9], [4, “Snapdeal”, “snapdeal”, “NNP”, 1, 10], [0, “.”, “.”, “.”, 0, 11]], “id”: 0, “sha1”: “cfd16187a0ddb43d4e412ae83acc9312fc0da922”} {“graf”: [[2, “Flipkart”, “flipkart”, “NN”, 1, 12], [0, “and”, “and”, “CC”, 0, 13], [3, “amazone”, “amazone”, “JJ”, 1, 14], [5, “is”, “be”, “VBZ”, 1, 15], [0, “on”, “on”, “IN”, 0, 16], [6, “top”, “top”, “NN”, 1, 17], [0, “of”, “of”, “IN”, 0, 18], [7, “google”, “google”, “NNP”, 1, 19], [8, “search”, “search”, “NN”, 1, 20], [0, “.”, “.”, “.”, 0, 21]], “id”: 0, “sha1”: “f70b4cb49e04ac5586a4b07898212a7bbb673649”}

Stage2:

path_stage1 = "o1.json"
path_stage2 = "o2.json"

graph, ranks = pytextrank.text_rank(path_stage1)
pytextrank.render_ranks(graph, ranks)

with open(path_stage2, 'w') as f:
    for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
        f.write("%sn" % pytextrank.pretty_print(rl._asdict()))
        print(pytextrank.pretty_print(rl))

Output:
[“google search”, 0.32500626838725255, [7, 8], “np”, 1] [“search”, 0.16250313419362628, [8], “nn”, 1] [“top”, 0.12701509697944047, [6], “nn”, 1] [“amazone”, 0.12383387876447217, [3], “np”, 1] [“google”, 0.08783938094837528, [7], “nnp”, 1] [“snapdeal”, 0.08690112036341668, [4], “np”, 2] [“flipkart”, 0.08690112036341668, [2], “np”, 3]

Also Read: Continuous Bag of Words (CBOW) - Multi Word Model - How It Works

How Pytextrank Algorithm Works?

Stage1:

Let’s recall what happened at stage 1.

Input:

“I like flipkart.”

Output:

[[0, “I”, “i”, “PRP”, 0, 0], [1, “Like”, “like”, “VBP”, 1, 1], [2, “Flipkart”, “flipkart”, “NNP”, 1, 2], [0, “.”, “.”, “.”, 0, 3]]

Now what is [0, “I”, “i”, “PRP”, 0, 0] this means.

This implies:

[word_id, raw_word, lemmetize_lower_word, POS_tag, keep, idx]

Now let’s understand what those tags are.

word_id: Id from unique word dictionary whose Parts-of-Speech is starting with ‘v’, ‘n’ and ‘j’.

For example : our sample text was:

‘I Like Flipkart.He likes Amazone. she likes Snapdeal. Flipkart and amazone is on top of google search.’

So dictionary of unique word, whose Parts-of-Speech (POS) starts with ‘v’, ‘n’ and ‘j’ will be like:

Word_Id	Raw_Word	Lemma_Word	POS	Keep	Idx
0	“I”	“i”	“PRP”	0	0
1	“Like”	“like”	“VBP”	1	1
2	“Flipkart”	“flipkart”	“NNP”	1	2
0	“.”	“.”	“.”	0	3
0	“He”	“he”	“PRP”	0	4
1	“likes”	“like”	“VBZ”	1	5
3	“Amazone”	“amazone”	“NNP”	1	6
0	“.”	“.”	“.”	0	7
0	“she”	“she”	“PRP”	0	8
1	“likes”	“like”	“VBZ”	1	9
4	“Snapdeal”	“snapdeal”	“NNP”	1	10
0	“.”	“.”	“.”	0	11
2	“Flipkart”	“flipkart”	“NN”	1	12
0	“and”	“and”	“CC”	0	13
3	“amazone”	“amazone”	“JJ”	1	14
5	“is”	“be”	“VBZ”	1	15
0	“on”	“on”	“IN”	0	16
6	“top”	“top”	“NN”	1	17
0	“of”	“of”	“IN”	0	18
7	“google”	“google”	“NNP”	1	19
8	“search”	“search”	“NN”	1	20
0	“.”	“.”	“.”	0	21

word_id: After filtering out data based on Parts-of-Speech (starting from v, n, j). Word_id will be id of those unique words based on appearance in the text.

Raw_word: Raw word is input word like: “Flipkart”, “Amazone” etc.

Also Read: Understand LSTM Neural Network Model from Scratch

Related Article:

Automatic Keywordextraction using Topica in Python

Automatic Keywordextraction using RAKE in Python

lemmatize_lower_word:

1. Lower version of each word like “Flipkart” to “flipkart”

2. Lemmatize version of each word like: “is” to “be”. “likes” to “like”. i/he/she will converted to same word whose pos will be -PRON- etc.

POS_tag: Parts of speech like Verb, Adjective etc.

keep: If pos starts with ‘v’, ‘n’, ‘j’ then keep = 1 else 0

idx: It will auto increase word by word.

Now Stage2

Stage2:

Stage 2 is doing its job in two steps.

Step1:

First it draw weighted graph

Then it calculates rank of each word based on google page rank algorithm.

import networkx as nx
import pylab as plt

# Draw network plot
nx.draw(graph, with_labels=True) 
plt.show()

Output:

Words from Google Page Rank Algorithm (Page Rank words)

ranks = nx.pagerank(graph)
print(ranks)

Output:

{u’be’: 0.12312834625371089, u’search’: 0.25443914067175, u’google’: 0.13753443412982308, u’like’: 0.053012554310560484, u’top’: 0.19887377734684572, u’snapdeal’: 0.06803267671851249, u’amazone’: 0.09694639385028486, u’flipkart’: 0.06803267671851249}

Step2:

1. Collect Keyword:

Select those words if word_id > 0, word is available in ranks words (PageRank words, words which are labelled in above network plot), words is noun or verb i.e. POS starts with “N”or “V”, and not a stop word.

Word_Id	Raw_Word	Lemma_Word	POS	Is_Stop
0	“I”	“i”	“PRP”	False
1	“Like”	“like”	“VBP”	True
2	“Flipkart”	“flipkart”	“NNP”	False
0	“.”	“.”	“.”	False
0	“He”	“he”	“PRP”	False
1	“likes”	“like”	“VBZ”	True
3	“Amazone”	“amazone”	“NNP”	False
0	“.”	“.”	“.”	False
0	“she”	“she”	“PRP”	True
1	“likes”	“like”	“VBZ”	True
4	“Snapdeal”	“snapdeal”	“NNP”	False
0	“.”	“.”	“.”	False
2	“Flipkart”	“flipkart”	“NN”	False
0	“and”	“and”	“CC”	True
3	“amazone”	“amazone”	“JJ”	False
5	“is”	“be”	“VBZ”	True
0	“on”	“on”	“IN”	True
6	“top”	“top”	“NN”	True
0	“of”	“of”	“IN”	True
7	“google”	“google”	“NNP”	False
8	“search”	“search”	“NN”	False
0	“.”	“.”	“.”	False

2. Collect Entities:

Select word after removing cardinal type entities and stop words.

Word_Id	Raw_Word	Lemma_Word	POS	Is_Stop
2	“Flipkart”	“flipkart”	“NNP”	False
3	“Amazone”	“amazone”	“NNP”	False
4	“Snapdeal”	“snapdeal”	“NNP”	False
2	“Flipkart”	“flipkart”	“NN”	False
7	“google”	“google”	“NNP”	False
8	“search”	“search”	“NN”	False

3. Collect Phrases:

a) Select those two words which comming one ofter one and if word_id > 0, word is in rank list (PageRank)

b) Rank of a phrase will be the rank of first word of that phrase from PageRank(google PageRank).

Word_Id	Raw_Word	Lemma_Word	POS	Is_Stop
0	“I”	“i”	“PRP”	False
1	“Like”	“like”	“VBP”	True
2	“Flipkart”	“flipkart”	“NNP”	False
0	“.”	“.”	“.”	False
0	“He”	“he”	“PRP”	False
1	“likes”	“like”	“VBZ”	True
3	“Amazone”	“amazone”	“NNP”	False
0	“.”	“.”	“.”	False
0	“she”	“she”	“PRP”	True
1	“likes”	“like”	“VBZ”	True
4	“Snapdeal”	“snapdeal”	“NNP”	False
0	“.”	“.”	“.”	False
2	“Flipkart”	“flipkart”	“NN”	False
0	“and”	“and”	“CC”	True
3	“amazone”	“amazone”	“JJ”	False
5	“is”	“be”	“VBZ”	True
0	“on”	“on”	“IN”	True
6	“top”	“top”	“NN”	True
0	“of”	“of”	“IN”	True
7	“google”	“google”	“NNP”	False
8	“search”	“search”	“NN”	False
0	“.”	“.”	“.”	False

Combine all:

a) Scale all ranks between 0 to 1

b) Combine all together.

import pytextrank
import sys
import json

# Stage 1:
sample_text = 'I Like Flipkart. He likes Amazone. she likes Snapdeal. Flipkart and amazone is on top of google search.'

# Create dictionary to feed into json file

file_dic = {"id" : 0,"text" : sample_text}
file_dic = json.dumps(file_dic)

loaded_file_dic = json.loads(file_dic)

# Create test.json and feed file_dic into it.
with open('test.json', 'w') as outfile:
        json.dump(loaded_file_dic, outfile)

path_stage0 = "test.json"
path_stage1 = "o1.json"

with open(path_stage1, 'w') as f:
    for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
        f.write("%sn" % pytextrank.pretty_print(graf._asdict()))
        print(pytextrank.pretty_print(graf._asdict()))

# Stage 2 extract keywords
path_stage1 = "o1.json"
path_stage2 = "o2.json"

graph, ranks = pytextrank.text_rank(path_stage1)
pytextrank.render_ranks(graph, ranks)

with open(path_stage2, 'w') as f:
    for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
        f.write("%sn" % pytextrank.pretty_print(rl._asdict()))
        print(pytextrank.pretty_print(rl))

        
## Google Page Rank        
import networkx as nx
import pylab as plt

nx.draw(graph, with_labels=True) 
plt.show()

Conclusion:

In this tutorial you learned:

What is keyword?
Why to do keyword extraction?
Setting up PyTextRank for python
Keyword extraction using PyTextRank in python
What is happening at background of PyTextRank?

Related Article:

If you have any questions type those in comment box. I will try my best to answer those.

Anindya Naskar

Automatic Keyword extraction using Python TextRank

Why to do keyword extraction:

Setting up PyTextRank in python:

Keyword extraction using PyTextRank in Python:

How Pytextrank Algorithm Works?

Words from Google Page Rank Algorithm (Page Rank words)

42 thoughts on “Automatic Keyword extraction using Python TextRank”

Leave a comment Cancel reply

Why to do keyword extraction:

Setting up PyTextRank in python:

Keyword extraction using PyTextRank in Python:

How Pytextrank Algorithm Works?

Words from Google Page Rank Algorithm (Page Rank words)

Related Posts

42 thoughts on “Automatic Keyword extraction using Python TextRank”

Leave a comment Cancel reply