NLTK is a most popular package among all NLP packages available for Python. It can be used to solve all kind of basic to advanced level of NLP task.
Important thing is NLTK requires lots of data or corpus to process any NLP task. Without those NLTK can’t do anything.
Course for You: Natural Language Processing (NLP) in Python with 8 Projects
For example if you are trying to do POS tagging (one NLP task) by following code.
from nltk.tokenize import word_tokenize from nltk import pos_tag s = 'There is a problem with Traffic Light' tokens = word_tokenize(s) # Generate list of tokens tokens_pos = pos_tag(tokens) print(tokens_pos)
If you do not have required corpus (data), you are supposed to get lookup error like:
LookupError: ********************************************************************** Resource u’taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle’ not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
Searched in:
– ‘C:\Users\anindya/nltk_data’
– ‘C:\nltk_data’
– ‘D:\nltk_data’
– ‘E:\nltk_data’
– ‘C:\Users\anindya\Anaconda2\nltk_data’
– ‘C:\Users\anindya\Anaconda2\lib\nltk_data’
– ‘C:\Users\anindya\AppData\Roaming\nltk_data’
**********************************************************************
How to find required corpus name by looking at error?
At the First line of Lookup error will have that information. Let’s look back to the first line.
Resource ‘taggers/averaged_perceptron_tagger/averaged_perceptron
_tagger.pickle‘ not found.
It says:
“averaged_perceptron_tagger.pickle” corpus is required to execute my script. Which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).
How to download NLTK corpus from Python?
There are three ways to download NLTK corpus automatically
- By GUI (Select corpus name from GUI to download)
- By corpus name.
- Download all corpus
By GUI
Type the code in python
import nltk
nltk.download()
A window should pop up called “NLTK Downloader”
Click on corpora……..
Download by NLTK corpus name:
import nltk
nltk.download('averaged_perceptron_tagger.pickle')
Download all NLTK corpus:
import nltk
nltk.download('all')
Now sometimes if you are doing all this in your office computer you might get error like:
[nltk_data] Error loading all: <urlopen error [WinError 10060] A
[nltk_data] connection attempt failed because the connected party
[nltk_data] did not properly respond after a period of time or
[nltk_data] established connection failed because connected host
[nltk_data] has failed to respond>
False
This is because of proxy settings; you will not be able to download anything through python. In this situation you have to download NLTK corpus manually.
Manually download NLTK corpus?
Step 1:
Go to http://www.nltk.org/nltk_data/ and search for “tagger” and download “averaged_perceptron_tagger”
Now if you unzip the downloaded file you can see inside “averaged_perceptron_tagger” folder “averaged_perceptron_tagger.pickle” corpus is there (which is required).
Step 2 (Find folder where to move):
Recall our error once again.
LookupError: ********************************************************************** Resource u’taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle’ not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
Searched in:
– ‘C:\Users\anindya/nltk_data’
– ‘C:\nltk_data’
– ‘D:\nltk_data’
– ‘E:\nltk_data’
– ‘C:\Users\anindya\Anaconda2\nltk_data’
– ‘C:\Users\anindya\Anaconda2\lib\nltk_data’
– ‘C:\Users\anindya\AppData\Roaming\nltk_data’
**********************************************************************
Choose any of the above folder I’m choosing ‘C:nltk_data’
So let’s create a folder called nltk_data inside C drive
Create a folder called “taggers” inside nltk_data folder as first line of error says:
Note:
“averaged_perceptron_tagger.pickle” corpus is required to execute above script. Which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).
Full path:
Now just copy and paste “averaged_perceptron_tagger.zip” (without unzipping) inside “taggeres” folder.
nltk_dataètaggers è averaged_perceptron_tagger.zip
That’s all!!
This is how you can deal with any other nltk data related issue.
Now you can test with same code which we were testing at the beginning of this page.
from nltk.tokenize import word_tokenize from nltk import pos_tag s = 'There is a problem with Traffic Light' tokens = word_tokenize(s) # Generate list of tokens tokens_pos = pos_tag(tokens) print(tokens_pos)
This time it should work as expected.
[(‘There’, ‘EX’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘problem’, ‘NN’), (‘with’, ‘IN’), (‘Traffic’, ‘NNP’), (‘Light’, ‘NNP’)]
Do you have any question?
Ask your question in the comment below and I will do my best to answer.
Your means of explaining everything in this paragraph is genuinely pleasant, all can easily know it, Thanks a lot.
This site can’t be reached
raw.githubusercontent.com took too long to respond.
this is happeneing when i try to download punkt manually
Maybe you are using your office computer where antivirus or firewall blocking to visit that site. Try to use it on your personal computer.
i am using my personal laptop and still i am getting this error “This site can’t be reached
raw.githubusercontent.com took too long to respond.”.