Automatic Keyword extraction using Python TextRank

Keywords or entities are condensed form of the content are widely used to define queries within information Retrieval (IR).
Keyword extraction or key phrase extraction can be done by using various methods like TF-IDF of word, TF-IDF of n-grams, Rule based POS tagging etc.
But all of those need manual effort to find proper logic.
In this topic I will show you how to do (automatically) keyword extraction or key phrase extraction using package called Pytextrank in Python which is based on Text Rank algorithm of Google.
After reading this post you will know:
  •          Why to do keyword extraction
  •          Keyword extraction using TextRank in Python
  •          What is happening at background?

Why to do keyword extraction:

  • You can judge a comment or sentence within a second just by looking at keyword of a sentence.
  • You can make decision whether the comment or sentence is worth reading or not.
  • Further you can categorize the sentence to any category. For example whether a certain comment is about mobile or hotel etc.
  • You can also use keywords or entity or key phrase as a feature for your supervised model to train.

Setting up PyTextRank in python:

Type python –m pip install pytextrank in cmd
Download SmartStoplist.txt from
Keep SmartStoplist.txt inside your working directory(the folder where you are saving your python directory)
 

Keyword extraction using PyTextRank in Python:

While doing key phrase extraction, Pytextrank process text into two stages to do keyword extraction.
Stage 1: 
 
In stage 1 it do some text cleaning and processing stuff like below:
  •         Index each word of text.
  •         Lemmatize each word
  •         POS tagging.

Stage2:
Stage 2 based on some logic it come up with important keywords or entity or key phrase with their ranks.
First let’s try to extract key phrases from sample text in python then will move on to understand how pytextrank algorithm works.

Note: If you are getting error saying:
TypeError                                 Traceback (most recent call last)
<ipython-input-6-7dbb39972cd2> in <module>()
     13
     14 withopen(path_stage1, ‘w’) as f:
—> 15     forgraf inpytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
     16         f.write(“%sn” %pytextrank.pretty_print(graf._asdict()))
     17         print(pytextrank.pretty_print(graf._asdict()))
 
C:UsersanindyaAnaconda2libsite-packagespytextrankpytextrank.pyc in parse_doc(json_iter)
    259                 print(“graf_text:”,graf_text)
    260
–> 261             grafs,new_base_idx =parse_graf(meta[“id”],graf_text,base_idx)
    262             base_idx =new_base_idx
    263
 
C:UsersanindyaAnaconda2libsite-packagespytextrankpytextrank.pyc in parse_graf(doc_id, graf_text, base_idx, spacy_nlp)
    191     markup = []
    192     new_base_idx =base_idx
–> 193     doc =spacy_nlp(graf_text,parse=True)
    194
    195     forspan indoc.sents:
 
TypeError: __call__() got an unexpected keyword argument ‘parse’

Go to your Anaconda directory and then Lib->site-packages->pytextrank
(For me full path was C:UsersanindyaAnaconda2Libsite-packagespytextrank)
Open pytextrank.py, search for “parse=True”. You should find this term twice in whole script.
Remove “parse=True” from those line.
For example:
Previously:
doc = spacy_nlp(graf_text, parse=True)
After Removing:

doc = spacy_nlp(graf_text)

 
Stage1:
 
# Pytextrank
import pytextrank
import json

# Sample text
sample_text = 'I Like Flipkart. He likes Amazone. she likes Snapdeal. Flipkart and amazone is on top of google search.'

# Create dictionary to feed into json file

file_dic = {"id" : 0,"text" : sample_text}
file_dic = json.dumps(file_dic)
loaded_file_dic = json.loads(file_dic)

# Create test.json and feed file_dic into it.
with open('test.json', 'w') as outfile:
        json.dump(loaded_file_dic, outfile)

path_stage0 = "test.json"
path_stage1 = "o1.json"

# Extract keyword using pytextrank
with open(path_stage1, 'w') as f:
    for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
        f.write("%sn" % pytextrank.pretty_print(graf._asdict()))
        print(pytextrank.pretty_print(graf._asdict()))



Output:
{“graf”: [[0, “I”, “i”, “PRP”, 0, 0], [1, “Like”, “like”, “VBP”, 1, 1], [2, “Flipkart”, “flipkart”, “NNP”, 1, 2], [0, “.”, “.”, “.”, 0, 3]], “id”: 0, “sha1”: “c09f142dabeb8465f5c30d1c4ef7a0d842ffb99c”} {“graf”: [[0, “He”, “he”, “PRP”, 0, 4], [1, “likes”, “like”, “VBZ”, 1, 5], [3, “Amazone”, “amazone”, “NNP”, 1, 6], [0, “.”, “.”, “.”, 0, 7]], “id”: 0, “sha1”: “307041869f5cba8a798708d64ee83f290098e00a”} {“graf”: [[0, “she”, “she”, “PRP”, 0, 8], [1, “likes”, “like”, “VBZ”, 1, 9], [4, “Snapdeal”, “snapdeal”, “NNP”, 1, 10], [0, “.”, “.”, “.”, 0, 11]], “id”: 0, “sha1”: “cfd16187a0ddb43d4e412ae83acc9312fc0da922”} {“graf”: [[2, “Flipkart”, “flipkart”, “NN”, 1, 12], [0, “and”, “and”, “CC”, 0, 13], [3, “amazone”, “amazone”, “JJ”, 1, 14], [5, “is”, “be”, “VBZ”, 1, 15], [0, “on”, “on”, “IN”, 0, 16], [6, “top”, “top”, “NN”, 1, 17], [0, “of”, “of”, “IN”, 0, 18], [7, “google”, “google”, “NNP”, 1, 19], [8, “search”, “search”, “NN”, 1, 20], [0, “.”, “.”, “.”, 0, 21]], “id”: 0, “sha1”: “f70b4cb49e04ac5586a4b07898212a7bbb673649”}

Stage2:
path_stage1 = "o1.json"
path_stage2 = "o2.json"

graph, ranks = pytextrank.text_rank(path_stage1)
pytextrank.render_ranks(graph, ranks)

with open(path_stage2, 'w') as f:
    for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
        f.write("%sn" % pytextrank.pretty_print(rl._asdict()))
        print(pytextrank.pretty_print(rl))

Output:
[“google search”, 0.32500626838725255, [7, 8], “np”, 1] [“search”, 0.16250313419362628, [8], “nn”, 1] [“top”, 0.12701509697944047, [6], “nn”, 1] [“amazone”, 0.12383387876447217, [3], “np”, 1] [“google”, 0.08783938094837528, [7], “nnp”, 1] [“snapdeal”, 0.08690112036341668, [4], “np”, 2] [“flipkart”, 0.08690112036341668, [2], “np”, 3]

Also Read:  Latent Dirichlet Allocation for Beginners: A high level overview


How Pytextrank Algorithm Works?

Stage1:
Let’s recall what happened at stage 1.
Input:
 
“I like flipkart.”
Output:
 
[[0, “I”, “i”, “PRP”, 0, 0], [1, “Like”, “like”, “VBP”, 1, 1], [2, “Flipkart”, “flipkart”, “NNP”, 1, 2], [0, “.”, “.”, “.”, 0, 3]]
Now what is [0, “I”, “i”, “PRP”, 0, 0] this means.
 
This implies:
 
[word_id, raw_word, lemmetize_lower_word, POS_tag, keep, idx]
 
Now let’s understand what those tags are.
 
word_id: Id from unique word dictionary whose Parts-of-Speech is starting with ‘v’, ‘n’ and ‘j’.

For example : our sample text was:
 
‘I Like Flipkart.He likes Amazone. she likes Snapdeal. Flipkart and amazone is on top of google search.’
 
So dictionary of unique word, whose Parts-of-Speech (POS) starts with ‘v’, ‘n’ and ‘j’ will be like:
Word_Id Raw_Word Lemma_Word POS Keep Idx
0 “I” “i” “PRP” 0 0
1 “Like” “like” “VBP” 1 1
2 “Flipkart” “flipkart” “NNP” 1 2
0 “.” “.” “.” 0 3
0 “He” “he” “PRP” 0 4
1 “likes” “like” “VBZ” 1 5
3 “Amazone” “amazone” “NNP” 1 6
0 “.” “.” “.” 0 7
0 “she” “she” “PRP” 0 8
1 “likes” “like” “VBZ” 1 9
4 “Snapdeal” “snapdeal” “NNP” 1 10
0 “.” “.” “.” 0 11
2 “Flipkart” “flipkart” “NN” 1 12
0 “and” “and” “CC” 0 13
3 “amazone” “amazone” “JJ” 1 14
5 “is” “be” “VBZ” 1 15
0 “on” “on” “IN” 0 16
6 “top” “top” “NN” 1 17
0 “of” “of” “IN” 0 18
7 “google” “google” “NNP” 1 19
8 “search” “search” “NN” 1 20
0 “.” “.” “.” 0 21


word_id: After filtering out data based on Parts-of-Speech (starting from v, n, j). Word_id will be id of those unique words based on appearance in the text.
Raw_word: Raw word is input word like: “Flipkart”, “Amazone” etc.

Also Read:  Difference between stemming and lemmatizing and where to use

lemmatize_lower_word:

       1. Lower version of each word like “Flipkart” to “flipkart”

       2. Lemmatize version of each word like: “is” to “be”. “likes” to “like”. i/he/she will converted to same word whose pos will be -PRON- etc.

POS_tag: Parts of speech like Verb, Adjective etc.
keep: If pos starts with ‘v’, ‘n’, ‘j’ then keep = 1 else 0
idx: It will auto increase word by word.
Now Stage2
Stage2:
Stage 2 is doing its job in two steps.
Step1:
  •        First it draw weighted graph
  • Then it calculates rank of each word based on google page rank algorithm.
import networkx as nx
import pylab as plt

# Draw network plot
nx.draw(graph, with_labels=True) 
plt.show()

Output:

 

Words from Google Page Rank Algorithm (Page Rank words)

ranks = nx.pagerank(graph)
print(ranks)



Output:

{u’be’: 0.12312834625371089, u’search’: 0.25443914067175, u’google’: 0.13753443412982308, u’like’: 0.053012554310560484, u’top’: 0.19887377734684572, u’snapdeal’: 0.06803267671851249, u’amazone’: 0.09694639385028486, u’flipkart’: 0.06803267671851249}

Step2:

1. Collect Keyword:
 
Select those words if word_id > 0, word is available in ranks words (PageRank words, words which are labelled in above network plot), words is noun or verb i.e. POS starts with “N”or “V”, and not a stop word.
Word_Id Raw_Word Lemma_Word POS Is_Stop
0 “I” “i” “PRP” False
1 “Like” “like” “VBP” True
2 “Flipkart” “flipkart” “NNP” False
0 “.” “.” “.” False
0 “He” “he” “PRP” False
1 “likes” “like” “VBZ” True
3 “Amazone” “amazone” “NNP” False
0 “.” “.” “.” False
0 “she” “she” “PRP” True
1 “likes” “like” “VBZ” True
4 “Snapdeal” “snapdeal” “NNP” False
0 “.” “.” “.” False
2 “Flipkart” “flipkart” “NN” False
0 “and” “and” “CC” True
3 “amazone” “amazone” “JJ” False
5 “is” “be” “VBZ” True
0 “on” “on” “IN” True
6 “top” “top” “NN” True
0 “of” “of” “IN” True
7 “google” “google” “NNP” False
8 “search” “search” “NN” False
0 “.” “.” “.” False

2. Collect Entities:

Select word after removing cardinal type entities and stop words.

Word_Id Raw_Word Lemma_Word POS Is_Stop
2 “Flipkart” “flipkart” “NNP” False
3 “Amazone” “amazone” “NNP” False
4 “Snapdeal” “snapdeal” “NNP” False
2 “Flipkart” “flipkart” “NN” False
7 “google” “google” “NNP” False
8 “search” “search” “NN” False

 

3. Collect Phrases:

a)     Select those two words which comming one ofter one and if word_id > 0, word is in rank list (PageRank)
 
b)  Rank of a phrase will be the rank of first word of that phrase from PageRank(google PageRank).

Word_Id Raw_Word Lemma_Word POS Is_Stop
0 “I” “i” “PRP” False
1 “Like” “like” “VBP” True
2 “Flipkart” “flipkart” “NNP” False
0 “.” “.” “.” False
0 “He” “he” “PRP” False
1 “likes” “like” “VBZ” True
3 “Amazone” “amazone” “NNP” False
0 “.” “.” “.” False
0 “she” “she” “PRP” True
1 “likes” “like” “VBZ” True
4 “Snapdeal” “snapdeal” “NNP” False
0 “.” “.” “.” False
2 “Flipkart” “flipkart” “NN” False
0 “and” “and” “CC” True
3 “amazone” “amazone” “JJ” False
5 “is” “be” “VBZ” True
0 “on” “on” “IN” True
6 “top” “top” “NN” True
0 “of” “of” “IN” True
7 “google” “google” “NNP” False
8 “search” “search” “NN” False
0 “.” “.” “.” False
          Combine all:

a)     Scale all ranks between 0 to 1
b)     Combine all together.
import pytextrank
import sys
import json

# Stage 1:
sample_text = 'I Like Flipkart. He likes Amazone. she likes Snapdeal. Flipkart and amazone is on top of google search.'

# Create dictionary to feed into json file

file_dic = {"id" : 0,"text" : sample_text}
file_dic = json.dumps(file_dic)

loaded_file_dic = json.loads(file_dic)

# Create test.json and feed file_dic into it.
with open('test.json', 'w') as outfile:
        json.dump(loaded_file_dic, outfile)

path_stage0 = "test.json"
path_stage1 = "o1.json"

with open(path_stage1, 'w') as f:
    for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
        f.write("%sn" % pytextrank.pretty_print(graf._asdict()))
        print(pytextrank.pretty_print(graf._asdict()))

# Stage 2 extract keywords
path_stage1 = "o1.json"
path_stage2 = "o2.json"

graph, ranks = pytextrank.text_rank(path_stage1)
pytextrank.render_ranks(graph, ranks)

with open(path_stage2, 'w') as f:
    for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
        f.write("%sn" % pytextrank.pretty_print(rl._asdict()))
        print(pytextrank.pretty_print(rl))

        
## Google Page Rank        
import networkx as nx
import pylab as plt

nx.draw(graph, with_labels=True) 
plt.show()


Conclusion:
 
In this tutorial you learned:
  •         What is keyword?
  •         Why to do keyword extraction?
  •         Setting up PyTextRank for python
  •         Keyword extraction using PyTextRank in python
  •         What is happening at background of PyTextRank?


If you have any questions type those in comment box. I will try my best to answer those.

42 thoughts on “Automatic Keyword extraction using Python TextRank”

Leave a comment