Automatic Keyword extraction using RAKE in Python


Keyword extraction of Entity extraction are widely used to define queries within information Retrieval (IR) in the field of Natural Language Processing (NLP).
You can extract keyword or important words or phrases by various methods like TF-IDF of word, TF-IDF of n-grams, Rule based POS tagging etc.
But all of those need manual effort to find proper logic.
In this topic I will show you how to extract keywords or important terms of given text automatically using package called RAKE in Python which is based on Rapid Automatic Keyword Extraction technique.
Also Read:

After reading this post you will know:
  •       Why to extract keywords
  •       Setting up  RAKE in python
  •       Extract keyword using RAKE in Python
  •       What is happening at background?

Why to extract keywords:

  • You can judge a comment or sentence within a second just by looking at keyword of a sentence.
  • You can make decision whether the comment or sentence is worth reading or not.
  •  Further you can categorize the sentence to any category. For example whether a certain comment is about mobile or hotel etc.
  • You can also use keywords or entity as a feature for your supervised model to train.

Setting up RAKE in python:

Type pip install python-rake==1.4.4
Or you can download and install RAKE from https://pypi.org/project/python-rake/#files
Or
    1. 1. Clone https://github.com/zelandiya/RAKE-tutorialthis github repo
    2. 2. Go to cmd (Windows-key + R-key then type “cmd” hit enter) and type cd the_cloned_folder_name 
    3. 3. Inside this directory (the_cloned_folder_name) work with python.

    Extract keyword using RAKE in Python:

    Download SmartStoplist.txt from
    # Reka
    import RAKE
    import operator

    # Reka setup with stopword directory
    stop_dir = "SmartStoplist.txt"
    rake_object = RAKE.Rake(stop_dir)

    # Sample text to test RAKE
    text = """Google quietly rolled out a new way for Android users to listen
    to podcasts and subscribe to shows they like, and it already works on
    your phone. Podcast production company Pacific Content got the exclusive
    on it.This text is taken from Google news."""

    # Extract keywords
    keywords = rake_object.run(text)
    print ("keywords: ", keywords)

    Output:

    (‘keywords: ‘, [(‘podcast production company pacific content’, 25.0), (‘google quietly rolled’, 8.5), (‘google news’, 4.5), (‘android users’, 4.0), (‘exclusive’, 1.0), (‘works’, 1.0), (‘phone’, 1.0), (‘text’, 1.0), (‘podcasts’, 1.0), (‘subscribe’, 1.0), (‘listen’, 1.0), (‘shows’, 1.0)])


    Great! Looks like RAKE is working fine and its giving great accuracy isn’t it?
    Also Read:  Difference between stemming and lemmatizing and where to use

    Now what is happening at background?

    How RAKE algorithm works?

    Step 1:

    First convert all text to lower case (ex: ‘Google’ -> ‘google’ or ‘GOOGLE’ -> ‘google’)
    Then split into array of words (tokens) by the specified word delimiters (space, comma, dot etc.)
    google quietly rolled out a new way for android users to listen to podcasts and subscribe to shows they like, and it already works on your phone. Podcast production company pacific content got the exclusive on it.this text is taken from google news.

    [‘google’,
     ‘quietly’,
     ‘rolled’,
     ‘out’,
     ‘a’,
     ‘new’,
     ‘way’,
     ‘for’,
     ‘android’,
     ‘users’,
     ‘to’,
     ‘listen’,
     ‘to’,
     ‘podcasts’,
     ‘and’,
     ‘subscribe’,
     ‘to’,
     ‘shows’,
     ‘they’,
     ‘like’,
     ‘,’,
     ‘and’,
     ‘it’,
     ‘already’,
     ‘works’,
     ‘on’,
     ‘your’,
     ‘phone’,
     ‘.’,
     ‘podcast’,
     ‘production’,
     ‘company’,
     ‘pacific’,
     ‘Content’,
     ‘got’,
     ‘the’,
     ‘exclusive’,
     ‘on’,
     ‘it’,
     ‘.’,
     ‘this’,
     ‘text’,
     ‘is’,
     ‘taken’,
     ‘from’,
     ‘google’,
     ‘news’,
     ‘.’]

    Step 2:
    Now this array is then split into sequences of contiguous words by phrase delimiters and stop word positions.
    Words within a sequence are assigned the same position in the text and together are considered a candidate keyword.
    Split by delimiters       Split by stop word        Candidate Keyword
    [‘google’,                ‘google’                  [‘google’, ‘quietly’]
     ‘quietly’,               ‘quietly’
     ‘rolled’,
     ‘out’,
     ‘a’,
     ‘new’,
     ‘way’,
     ‘for’,
     ‘android’,               ‘android’                 [‘android’, ‘users’]
     ‘users’,                 ‘users’
     ‘to’,
     ‘listen’,                ‘listen’                  [‘listen’]
     ‘to’,
     ‘podcasts’,              ‘podcasts’                [‘podcasts’]
     ‘and’,
     ‘subscribe’,             ‘subscribe’               [‘subscribe’]
     ‘to’,
     ‘shows’,                 ‘shows’                   [‘shows’]
     ‘they’,
     ‘like’,
     ‘,’,
     ‘and’,
     ‘it’,
     ‘already’,
     ‘works’,                 ‘works’                   [‘works’]
     ‘on’,
     ‘your’,
     ‘phone’,                 ‘phone’                   [‘phone’]
     ‘.’,
     ‘podcast’,               ‘podcast’ [‘podcast’,’production’,’company’,’pacific’,’content’]
     ‘production’,            ‘production’
     ‘company’,               ‘company’
     ‘pacific’,               ‘pacific’
     ‘content’,               ‘content’
     ‘got’,
     ‘the’,
     ‘exclusive’,             ‘exclusive’               [‘exclusive’]
     ‘on’,
     ‘it’,
     ‘.’,
     ‘this’,
     ‘text’,                  ‘text’                    [‘text’]
     ‘is’,
     ‘taken’,
     ‘from’,
     ‘google’,                ‘google’                  [‘google’, ‘news’]
     ‘news’,                  ‘news’
     ‘.’]

    Note: Red font character/ words are either stop words or punctuation to be removed.
    Now from the above table we have all candidate words:

    ‘google’, ‘quietly’, ‘rolled’
    ‘android’, ‘users’
    ‘listen’
    ‘podcasts’
    ‘subscribe’
    ‘shows’
    ‘works’
    ‘phone’
    ‘podcast’, ‘production’, ‘company’, ‘pacific’, ‘content’
    ‘exclusive’
    ‘text’
    ‘google’, ‘news’

    These candidate keywords are nothing but output keywords by RAKE (ex: ‘google quietly rolled’). But RAKE is doing one step further, calculate score for each keyword.

    Step 3 (Calculate Keyword Score):

    Let’s have a look at individual word appearing to candidate keyword, which are as follows.
    ‘google’, ‘quietly’, ‘rolled’, ‘android’, ‘users’, ‘listen’, ‘podcasts’, ‘subscribe’, ‘shows’, ‘works’, ‘phone’,‘podcast’, ‘production’, ‘company’, ‘pacific’, ‘content’, ‘exclusive’, ‘text’, ‘google’, ‘news’
    RAKE is calculating keyword by taking ratio of degree to frequency or words. Let’s see how RAKE is calculating keyword score.
    To do that first we need to draw a word co-occurrence graph.
    If you don’t know what is co-occurrence graph:

    It is same like term document matrix with one extra count of each word coming in a phrase. 

    Let’s see how its looks.


    Explanation of keyword co-occurrence graph of RAKE:

    Let’s take an example; first candidate keyword is ‘google’, ‘quietly’, ‘rolled’

    So in table: 

     google-google = 2 (as ‘google’ appeared 2 time at all content keywords i.e. ‘google guilty rolled’ and ‘google news’)

    Also Read:  Continuous Bag of Words (CBOW) - Multi Word Model - How It Works


    (google-quietly) / (quietly/google) = 1 (as ‘google’ and ‘quietly’ together appeared 1 time at all content keywords)


    Same logic is applicable for each word in the table.


    Now to calculate score, we need to calculate two things for each word:

    1. Word Frequency (freq(w))

    2. Word Degree (deg(w))

    Calculate Word frequency (freq(w)):

    This is the count says how many times a particular word appeared among all candidate keywords.


    Simply take the value of that word-word row.


    Example for word ‘google’


    freq(google) = 2 (value of row name ‘google’ and column name ‘google’)


    Calculate Word Degree (deg(w)):

    To calculate word degree for a particular word in the above table sum all numbers row wise.


    Example for word ‘google’


    deg(google)) = (2 + 1 + 1 + 1) => 5
    Now finally we can calculate keyword score.

    Keyword score = (deg(w)/freq(w)).

    Let’s test it for a one keyword ‘google quietly rolled’

    google
    quietly
    rolled
    deg(w)
    5
    3
    3
    freq(w)
    2
    1
    1
    Keyword score (‘google quietly rolled’) = (5/2 +3/1 +3/1) = (2.5+3+3) = 8.5

    Can we recall what was the score RAKE has given?

    (‘keywords: ‘, [(‘podcast production company pacific content’, 25.0), (‘google quietly rolled’, 8.5), (‘google news’, 4.5), (‘android users’, 4.0), (‘exclusive’, 1.0), (‘works’, 1.0), (‘phone’, 1.0), (‘text’, 1.0), (‘podcasts’, 1.0), (‘subscribe’, 1.0), (‘listen’, 1.0), (‘shows’, 1.0)])


    Yes it is same!!!!


    Conclusion:

    In this tutorial you learned:

    •         What is keyword?
    •         Why to extract keyword?
    •         Setting up RAKE for python
    •         Extract keywords using RAKE in python
    •        How Rapid Automatic Keyword Extraction (RAKE) works

    Related Article:

    Note:

    Though you have seen that RAKE is almost accurately but sometime it won’t if some keyword contain some stop words.


    For example, ‘new’ is listed in RAKE’s stopword list. This means that neither ‘New York’ nor ‘New Zealand’ can be ever a keyword.


    If you have some question about this topic place those in comment box. I will try my best to answer those.

    882 thoughts on “Automatic Keyword extraction using RAKE in Python”

    1. Ԍreat work! This is the type of infοrmation that
      shoulⅾ be shared around the net. Shamе on the search engіnes for
      now not positioning thіs post higher! Come on over and dіscuss with my webѕite .
      Thank you =)

      Reply
    2. Apρreciating tһe time and effort you put intо your site
      and in depth information you proviԁe. It’ѕ great to come across a blog every once
      in a while that іsn’t the same old rehashed materiaⅼ.
      Fantastic read!

      Reply

    Leave a comment