Automatic Keyword extraction using Topica in Python



Keywords or entities are condensed form of the content are widely used to define queries within information Retrieval (IR).
You can extract keyword or entity by various methods like TF-IDF of word, TF-IDF of n-grams, Rule based POS tagging etc.
But all of those need manual effort to find proper logic.
In this topic I will show you how to extract important keywords or entities automatically using package called Topica in Python which is based on POS tagging technique.
After reading this post you will know:
  • Why to extract keywords
  • Extract keyword using Topica in Python
  • What is happening at background?

Also Read:

Why to extract keywords:

  • You can judge a comment or sentence within a second just by looking at keyword of a sentence.
  • You can make decision whether the comment or sentence is worth reading or not.
  • Further you can categorize the sentence to any category. For example whether a certain comment is about mobile or hotel etc.
  • You can also use keywords or entities as a feature for your supervised model to train.

Setting up Topica for python:

Type python –m pip install topia.termextract in cmd to install topica. That’s it.

Extract keyword using Topica in Python:

from topia.termextract import extract
from topia.termextract import tag

# Setup Term Extractor
extractor = extract.TermExtractor()

# Some sample text
text ='''
... Police shut Palestinian theatre in Jerusalem.
...
... Israeli police have shut down a Palestinian theatre in East Jerusalem.
...
... The action, on Thursday, prevented the closing event of an international
... literature festival from taking place.
...
... Police said they were acting on a court order, issued after intelligence
... indicated that the Palestinian Authority was involved in the event.
...
... Israel has occupied East Jerusalem since 1967 and has annexed the
... area. This is not recognised by the international community.
...
... The British consul-general in Jerusalem , Richard Makepeace, was
... attending the event.
...
... "I think all lovers of literature would regard this as a very
... regrettable moment and regrettable decision," he added.
...
... Mr Makepeace said the festival's closing event would be reorganised to
... take place at the British Council in Jerusalem.
...
... The Israeli authorities often take action against events in East
... Jerusalem they see as connected to the Palestinian Authority.
...
... Saturday's opening event at the same theatre was also shut down.
...
... A police notice said the closure was on the orders of Israel's internal
... security minister on the grounds of a breach of interim peace accords
... from the 1990s.
...
... These laid the framework for talks on establishing a Palestinian state
... alongside Israel, but left the status of Jerusalem to be determined by
... further negotiation.
...
... Israel has annexed East Jerusalem and declares it part of its eternal
... capital.
...
... Palestinians hope to establish their capital in the area.
... '''

# Extract Keywords
keywords_topica = extractor(text)
print(keywords_topica)

Output:

[('East Jerusalem', 3, 2), ('Palestinian Authority', 2, 2), ('Palestinian', 6, 1), ('peace accords', 1, 2), ('East', 4, 1), ('event', 6, 1), ('literature festival', 1, 2), ('Israel', 4, 1), ('security minister', 1, 2), ('Israeli police', 1, 2), ('Palestinian state', 1, 2), ('police notice', 1, 2), ('Richard Makepeace', 1, 2), ('theatre', 3, 1), ('Palestinians hope', 1, 2), ('Palestinian theatre', 2, 2), ('British consul-general', 1, 2), ('British Council', 1, 2), ('court order', 1, 2), ('Mr Makepeace', 1, 2), ('Israeli authorities', 1, 2), ('opening event', 1, 2), ('Jerusalem', 8, 1)]

Also Read:  Latent Dirichlet Allocation explained

How Topica Works?

Topica extract entity from text in two steps.

Step 1 (POS tagging of text):

from topia.termextract import tag

# Setup POS tagger of topica
tagger = tag.Tagger()
tagger.initialize()

# POS tagging using topica
print(tagger(text))

Output: 
[[‘…’, ‘:’, ‘…’],
 [‘Police’, ‘NNP’, ‘Police’],
 [‘shut’, ‘VBN’, ‘shut’],
 [‘Palestinian’, ‘JJ’, ‘Palestinian’],
 [‘theatre’, ‘NN’, ‘theatre’],
 [‘in’, ‘IN’, ‘in’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘Israeli’, ‘JJ’, ‘Israeli’],
 [‘police’, ‘NN’, ‘police’],
 [‘have’, ‘VBP’, ‘have’],
 [‘shut’, ‘VBN’, ‘shut’],
 [‘down’, ‘RB’, ‘down’],
 [‘a’, ‘DT’, ‘a’],
 [‘Palestinian’, ‘JJ’, ‘Palestinian’],
 [‘theatre’, ‘NN’, ‘theatre’],
 [‘in’, ‘IN’, ‘in’],
 [‘East’, ‘NNP’, ‘East’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘The’, ‘DT’, ‘The’],
 [‘action’, ‘NN’, ‘action’],
 [‘,’, ‘,’, ‘,’],
 [‘on’, ‘IN’, ‘on’],
 [‘Thursday’, ‘NNP’, ‘Thursday’],
 [‘,’, ‘,’, ‘,’],
 [‘prevented’, ‘VBN’, ‘prevented’],
 [‘the’, ‘DT’, ‘the’],
 [‘closing’, ‘VBG’, ‘closing’],
 [‘event’, ‘NN’, ‘event’],
 [‘of’, ‘IN’, ‘of’],
 [‘an’, ‘DT’, ‘an’],
 [‘international’, ‘JJ’, ‘international’],
 [‘…’, ‘:’, ‘…’],
 [‘literature’, ‘NN’, ‘literature’],
 [‘festival’, ‘NN’, ‘festival’],
 [‘from’, ‘IN’, ‘from’],
 [‘taking’, ‘VBG’, ‘taking’],
 [‘place’, ‘NN’, ‘place’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘Police’, ‘NNP’, ‘Police’],
 [‘said’, ‘VBD’, ‘said’],
 [‘they’, ‘PRP’, ‘they’],
 [‘were’, ‘VBD’, ‘were’],
 [‘acting’, ‘VBG’, ‘acting’],
 [‘on’, ‘IN’, ‘on’],
 [‘a’, ‘DT’, ‘a’],
 [‘court’, ‘NN’, ‘court’],
 [‘order’, ‘NN’, ‘order’],
 [‘,’, ‘,’, ‘,’],
 [‘issued’, ‘VBN’, ‘issued’],
 [‘after’, ‘IN’, ‘after’],
 [‘intelligence’, ‘NN’, ‘intelligence’],
 [‘…’, ‘:’, ‘…’],
 [‘indicated’, ‘VBD’, ‘indicated’],
 [‘that’, ‘IN’, ‘that’],
 [‘the’, ‘DT’, ‘the’],
 [‘Palestinian’, ‘JJ’, ‘Palestinian’],
 [‘Authority’, ‘NNP’, ‘Authority’],
 [‘was’, ‘VBD’, ‘was’],
 [‘involved’, ‘VBN’, ‘involved’],
 [‘in’, ‘IN’, ‘in’],
 [‘the’, ‘DT’, ‘the’],
 [‘event’, ‘NN’, ‘event’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘Israel’, ‘NNP’, ‘Israel’],
 [‘has’, ‘VBZ’, ‘has’],
 [‘occupied’, ‘VBN’, ‘occupied’],
 [‘East’, ‘NNP’, ‘East’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘since’, ‘IN’, ‘since’],
 [‘1967’, ‘NN’, ‘1967’],
 [‘and’, ‘CC’, ‘and’],
 [‘has’, ‘VBZ’, ‘has’],
 [‘annexed’, ‘VBD’, ‘annexed’],
 [‘the’, ‘DT’, ‘the’],
 [‘…’, ‘:’, ‘…’],
 [‘area’, ‘NN’, ‘area’],
 [‘.’, ‘.’, ‘.’],
 [‘This’, ‘DT’, ‘This’],
 [‘is’, ‘VBZ’, ‘is’],
 [‘not’, ‘RB’, ‘not’],
 [‘recognised’, ‘VBD’, ‘recognised’],
 [‘by’, ‘IN’, ‘by’],
 [‘the’, ‘DT’, ‘the’],
 [‘international’, ‘JJ’, ‘international’],
 [‘community’, ‘NN’, ‘community’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘The’, ‘DT’, ‘The’],
 [‘British’, ‘JJ’, ‘British’],
 [‘consul-general’, ‘NN’, ‘consul-general’],
 [‘in’, ‘IN’, ‘in’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘,’, ‘,’, ‘,’],
 [‘Richard’, ‘NNP’, ‘Richard’],
 [‘Makepeace’, ‘NNP’, ‘Makepeace’],
 [‘,’, ‘,’, ‘,’],
 [‘was’, ‘VBD’, ‘was’],
 [‘…’, ‘:’, ‘…’],
 [‘attending’, ‘VBG’, ‘attending’],
 [‘the’, ‘DT’, ‘the’],
 [‘event’, ‘NN’, ‘event’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘”‘, ‘”‘, ‘”‘],
 [‘I’, ‘PRP’, ‘I’],
 [‘think’, ‘VBP’, ‘think’],
 [‘all’, ‘DT’, ‘all’],
 [‘lovers’, ‘NNS’, ‘lover’],
 [‘of’, ‘IN’, ‘of’],
 [‘literature’, ‘NN’, ‘literature’],
 [‘would’, ‘MD’, ‘would’],
 [‘regard’, ‘VB’, ‘regard’],
 [‘this’, ‘DT’, ‘this’],
 [‘as’, ‘IN’, ‘as’],
 [‘a’, ‘DT’, ‘a’],
 [‘very’, ‘RB’, ‘very’],
 [‘…’, ‘:’, ‘…’],
 [‘regrettable’, ‘JJ’, ‘regrettable’],
 [‘moment’, ‘NN’, ‘moment’],
 [‘and’, ‘CC’, ‘and’],
 [‘regrettable’, ‘JJ’, ‘regrettable’],
 [‘decision’, ‘NN’, ‘decision’],
 [‘,”‘, ‘,’, ‘,”‘],
 [‘he’, ‘PRP’, ‘he’],
 [‘added’, ‘VBD’, ‘added’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘Mr’, ‘NNP’, ‘Mr’],
 [‘Makepeace’, ‘NNP’, ‘Makepeace’],
 [‘said’, ‘VBD’, ‘said’],
 [‘the’, ‘DT’, ‘the’],
 [‘festival’, ‘NN’, ‘festival’],
 [“‘s”, ‘POS’, “‘s”],
 [‘closing’, ‘VBG’, ‘closing’],
 [‘event’, ‘NN’, ‘event’],
 [‘would’, ‘MD’, ‘would’],
 [‘be’, ‘VB’, ‘be’],
 [‘reorganised’, ‘NN’, ‘reorganised’],
 [‘to’, ‘TO’, ‘to’],
 [‘…’, ‘:’, ‘…’],
 [‘take’, ‘VB’, ‘take’],
 [‘place’, ‘NN’, ‘place’],
 [‘at’, ‘IN’, ‘at’],
 [‘the’, ‘DT’, ‘the’],
 [‘British’, ‘JJ’, ‘British’],
 [‘Council’, ‘NNP’, ‘Council’],
 [‘in’, ‘IN’, ‘in’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘The’, ‘DT’, ‘The’],
 [‘Israeli’, ‘JJ’, ‘Israeli’],
 [‘authorities’, ‘NNS’, ‘authority’],
 [‘often’, ‘RB’, ‘often’],
 [‘take’, ‘VB’, ‘take’],
 [‘action’, ‘NN’, ‘action’],
 [‘against’, ‘IN’, ‘against’],
 [‘events’, ‘NNS’, ‘event’],
 [‘in’, ‘IN’, ‘in’],
 [‘East’, ‘NNP’, ‘East’],
 [‘…’, ‘:’, ‘…’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘they’, ‘PRP’, ‘they’],
 [‘see’, ‘VB’, ‘see’],
 [‘as’, ‘IN’, ‘as’],
 [‘connected’, ‘VBN’, ‘connected’],
 [‘to’, ‘TO’, ‘to’],
 [‘the’, ‘DT’, ‘the’],
 [‘Palestinian’, ‘JJ’, ‘Palestinian’],
 [‘Authority’, ‘NNP’, ‘Authority’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘Saturday’, ‘NNP’, ‘Saturday’],
 [“‘s”, ‘POS’, “‘s”],
 [‘opening’, ‘NN’, ‘opening’],
 [‘event’, ‘NN’, ‘event’],
 [‘at’, ‘IN’, ‘at’],
 [‘the’, ‘DT’, ‘the’],
 [‘same’, ‘JJ’, ‘same’],
 [‘theatre’, ‘NN’, ‘theatre’],
 [‘was’, ‘VBD’, ‘was’],
 [‘also’, ‘RB’, ‘also’],
 [‘shut’, ‘VBN’, ‘shut’],
 [‘down’, ‘RB’, ‘down’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘A’, ‘DT’, ‘A’],
 [‘police’, ‘NN’, ‘police’],
 [‘notice’, ‘NN’, ‘notice’],
 [‘said’, ‘VBD’, ‘said’],
 [‘the’, ‘DT’, ‘the’],
 [‘closure’, ‘NN’, ‘closure’],
 [‘was’, ‘VBD’, ‘was’],
 [‘on’, ‘IN’, ‘on’],
 [‘the’, ‘DT’, ‘the’],
 [‘orders’, ‘NNS’, ‘order’],
 [‘of’, ‘IN’, ‘of’],
 [‘Israel’, ‘NNP’, ‘Israel’],
 [“‘s”, ‘POS’, “‘s”],
 [‘internal’, ‘JJ’, ‘internal’],
 [‘…’, ‘:’, ‘…’],
 [‘security’, ‘NN’, ‘security’],
 [‘minister’, ‘NN’, ‘minister’],
 [‘on’, ‘IN’, ‘on’],
 [‘the’, ‘DT’, ‘the’],
 [‘grounds’, ‘NNS’, ‘ground’],
 [‘of’, ‘IN’, ‘of’],
 [‘a’, ‘DT’, ‘a’],
 [‘breach’, ‘NN’, ‘breach’],
 [‘of’, ‘IN’, ‘of’],
 [‘interim’, ‘JJ’, ‘interim’],
 [‘peace’, ‘NN’, ‘peace’],
 [‘accords’, ‘NNS’, ‘accord’],
 [‘…’, ‘:’, ‘…’],
 [‘from’, ‘IN’, ‘from’],
 [‘the’, ‘DT’, ‘the’],
 [‘1990’, ‘NN’, ‘1990’],
 [‘s’, ‘PRP’, ‘s’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘These’, ‘DT’, ‘These’],
 [‘laid’, ‘VBN’, ‘laid’],
 [‘the’, ‘DT’, ‘the’],
 [‘framework’, ‘NN’, ‘framework’],
 [‘for’, ‘IN’, ‘for’],
 [‘talks’, ‘NNS’, ‘talk’],
 [‘on’, ‘IN’, ‘on’],
 [‘establishing’, ‘VBG’, ‘establishing’],
 [‘a’, ‘DT’, ‘a’],
 [‘Palestinian’, ‘JJ’, ‘Palestinian’],
 [‘state’, ‘NN’, ‘state’],
 [‘…’, ‘:’, ‘…’],
 [‘alongside’, ‘IN’, ‘alongside’],
 [‘Israel’, ‘NNP’, ‘Israel’],
 [‘,’, ‘,’, ‘,’],
 [‘but’, ‘CC’, ‘but’],
 [‘left’, ‘VBN’, ‘left’],
 [‘the’, ‘DT’, ‘the’],
 [‘status’, ‘NN’, ‘status’],
 [‘of’, ‘IN’, ‘of’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘to’, ‘TO’, ‘to’],
 [‘be’, ‘VB’, ‘be’],
 [‘determined’, ‘VBN’, ‘determined’],
 [‘by’, ‘IN’, ‘by’],
 [‘…’, ‘:’, ‘…’],
 [‘further’, ‘JJ’, ‘further’],
 [‘negotiation’, ‘NN’, ‘negotiation’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘Israel’, ‘NNP’, ‘Israel’],
 [‘has’, ‘VBZ’, ‘has’],
 [‘annexed’, ‘VBD’, ‘annexed’],
 [‘East’, ‘NNP’, ‘East’],
 [‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
 [‘and’, ‘CC’, ‘and’],
 [‘declares’, ‘VBZ’, ‘declares’],
 [‘it’, ‘PRP’, ‘it’],
 [‘part’, ‘NN’, ‘part’],
 [‘of’, ‘IN’, ‘of’],
 [‘its’, ‘PRP$’, ‘its’],
 [‘eternal’, ‘JJ’, ‘eternal’],
 [‘…’, ‘:’, ‘…’],
 [‘capital’, ‘NN’, ‘capital’],
 [‘.’, ‘.’, ‘.’],
 [‘…’, ‘:’, ‘…’],
 [‘…’, ‘:’, ‘…’],
 [‘Palestinians’, ‘NNPS’, ‘Palestinian’],
 [‘hope’, ‘NN’, ‘hope’],
 [‘to’, ‘TO’, ‘to’],
 [‘establish’, ‘VB’, ‘establish’],
 [‘their’, ‘PRP$’, ‘their’],
 [‘capital’, ‘NN’, ‘capital’],
 [‘in’, ‘IN’, ‘in’],
 [‘the’, ‘DT’, ‘the’],
 [‘area’, ‘NN’, ‘area’],
 [‘.’, ‘.’, ‘.’],

 [‘…’, ‘:’, ‘…’]]

Step 2:

Treat all POS starts with “N” as Noun and treat all POS of “JJ” and whose word starts with capital letter (upper case) as Noun.

Now select all nouns as keyword. (Treat back to back Nouns as a phrase of keyword)
Let’s see what we have extracted by the above logic.
[‘Palestinian’, ‘JJ’, ‘Palestinian’],   –> POS ‘JJ’ and word starts with capital letter
[‘theatre’, ‘NN’, ‘theatre’],
[‘Israeli’, ‘JJ’, ‘Israeli’],
[‘police’, ‘NN’, ‘police’],
[‘Palestinian’, ‘JJ’, ‘Palestinian’],
[‘theatre’, ‘NN’, ‘theatre’],
[‘East’, ‘NNP’, ‘East’],
[‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
[‘literature’, ‘NN’, ‘literature’],
[‘festival’, ‘NN’, ‘festival’],
[‘court’, ‘NN’, ‘court’],
[‘order’, ‘NN’, ‘order’],
[‘Palestinian’, ‘JJ’, ‘Palestinian’],
[‘Authority’, ‘NNP’, ‘Authority’],
[‘East’, ‘NNP’, ‘East’],
[‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
[‘British’, ‘JJ’, ‘British’],
[‘consul-general’, ‘NN’, ‘consul-general’],
[‘Richard’, ‘NNP’, ‘Richard’],
[‘Makepeace’, ‘NNP’, ‘Makepeace’],
[‘regrettable’, ‘JJ’, ‘regrettable’],
[‘decision’, ‘NN’, ‘decision’],
[‘Mr’, ‘NNP’, ‘Mr’],
[‘Makepeace’, ‘NNP’, ‘Makepeace’],
[‘Palestinian’, ‘JJ’, ‘Palestinian’],
[‘Authority’, ‘NNP’, ‘Authority’],
[‘opening’, ‘NN’, ‘opening’],
[‘event’, ‘NN’, ‘event’],
[‘police’, ‘NN’, ‘police’],
[‘notice’, ‘NN’, ‘notice’],
[‘security’, ‘NN’, ‘security’],
[‘minister’, ‘NN’, ‘minister’],
[‘peace’, ‘NN’, ‘peace’],
[‘accords’, ‘NNS’, ‘accord’],
[‘Palestinian’, ‘JJ’, ‘Palestinian’],
[‘state’, ‘NN’, ‘state’],
[‘East’, ‘NNP’, ‘East’],
[‘Jerusalem’, ‘NNP’, ‘Jerusalem’],
[‘Palestinians’, ‘NNPS’, ‘Palestinian’],

[‘hope’, ‘NN’, ‘hope’],

Yes exactly same to our output which topica extracted.

Can we recall our output again?

Output:

[(‘East Jerusalem’, 3, 2),
 (‘Palestinian Authority’, 2, 2),
 (‘Palestinian’, 6, 1),
 (‘peace accords’, 1, 2),
 (‘East’, 4, 1),
 (‘event’, 6, 1),
 (‘literature festival’, 1, 2),
 (‘Israel’, 4, 1),
 (‘security minister’, 1, 2),
 (‘Israeli police’, 1, 2),
 (‘Palestinian state’, 1, 2),
 (‘police notice’, 1, 2),
 (‘Richard Makepeace’, 1, 2),
 (‘theatre’, 3, 1),
 (‘Palestinians hope’, 1, 2),
 (‘Palestinian theatre’, 2, 2),
 (‘British consul-general’, 1, 2),
 (‘British Council’, 1, 2),
 (‘court order’, 1, 2),
 (‘Mr Makepeace’, 1, 2),
 (‘Israeli authorities’, 1, 2),
 (‘opening event’, 1, 2),
 (‘Jerusalem’, 8, 1)]

Now what are those numbers coming with keywords?
  1. First number is nothing but number of time that keyword appears in the document as Noun.
  2. Second Number is indicating how many word to make that keyword phrase.
Also Read:  Finetune LLM (LLAMA, GPT-Neo, GPT-J, Pythia) in Google Colab

Also Read:

Conclusion:

In this tutorial you learned:
  • What is keyword?
  • Why to extract keyword?
  • Setting up topica for python
  • Extract keywords automatically using Topica in python
  • What is happening at background of Topica/ How Topica works?

18 thoughts on “Automatic Keyword extraction using Topica in Python”

Leave a comment