How to download NLTK corpus manually

NLTK is a most popular package among all NLP packages available for Python. It can be used to solve all kind of basic to advanced level of NLP task.
Important thing is NLTK requires lots of data or corpus to process any NLP task. Without those NLTK can’t do anything.
For example if you are trying to do POS tagging (one NLP task) by following code.
from nltk.tokenize import word_tokenize
from nltk import pos_tag

s = 'There is a problem with Traffic Light'
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens)
print(tokens_pos)
If you do not have required corpus (data), you are supposed to get lookup error like:

LookupError: ********************************************************************** Resource u’taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle’ not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
  Searched in:
    – ‘C:\Users\anindya/nltk_data’
    – ‘C:\nltk_data’
    – ‘D:\nltk_data’
    – ‘E:\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\lib\nltk_data’
    – ‘C:\Users\anindya\AppData\Roaming\nltk_data’
**********************************************************************
How to find required corpus name by looking at error?
At the First line of Lookup error will have that information. Let’s look back to the first line.
Resource ‘taggers/averaged_perceptron_tagger/averaged_perceptron
  _tagger.pickle‘ not found.
It says:
“averaged_perceptron_tagger.pickle” corpus is required to execute my script. Which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).
How to download NLTK corpus from Python?
There are three ways to download NLTK corpus automatically
  •     By GUI (Select corpus name from GUI to download)
  •     By corpus name.
  •     Download all corpus
By GUI
Type the code in python

import nltk
nltk.download()
A window should pop up called “NLTK Downloader”
 
Click on corpora……..

Download by NLTK corpus name:

import nltk
nltk.download('averaged_perceptron_tagger.pickle')

Download all NLTK corpus:

import nltk
nltk.download('all')
Now sometimes if you are doing all this in your office computer you might get error like:
[nltk_data] Error loading all: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>
False
This is because of proxy settings; you will not be able to download anything through python. In this situation you have to download NLTK corpus manually.
Manually download NLTK corpus?
Step 1:
Go to http://www.nltk.org/nltk_data/ and search for “tagger” and download “averaged_perceptron_tagger
 
Now if you unzip the downloaded file you can see inside “averaged_perceptron_tagger” folder “averaged_perceptron_tagger.pickle” corpus is there (which is required).
 
Step 2 (Find folder where to move):
Recall our error once again.
LookupError: ********************************************************************** Resource u’taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle’ not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
  Searched in:
    – ‘C:\Users\anindya/nltk_data’
    – ‘C:\nltk_data’
    – ‘D:\nltk_data’
    – ‘E:\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\lib\nltk_data’
    – ‘C:\Users\anindya\AppData\Roaming\nltk_data’
**********************************************************************
Choose any of the above folder I’m choosing ‘C:nltk_data’
So let’s create a folder called nltk_data inside C drive 
Create a folder called “taggers” inside nltk_data folder as first line of error says:
Note:
“averaged_perceptron_tagger.pickle” corpus is required to execute above script. Which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).
Full path:

Also Read:  Fine tune BERT for Question Answering

Now just copy and paste “averaged_perceptron_tagger.zip” (without unzipping) inside “taggeres” folder.

nltk_dataètaggers è averaged_perceptron_tagger.zip
That’s all!!
This is how you can deal with any other nltk data related issue.
Now you can test with same code which we were testing at the beginning of this page.
from nltk.tokenize import word_tokenize
from nltk import pos_tag

s = 'There is a problem with Traffic Light'
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens)
print(tokens_pos)
This time it should work as expected.
[(‘There’, ‘EX’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘problem’, ‘NN’), (‘with’, ‘IN’), (‘Traffic’, ‘NNP’), (‘Light’, ‘NNP’)]
Do you have any question?
Ask your question in the comment below and I will do my best to answer.

4 thoughts on “How to download NLTK corpus manually”

  1. Mauli Bhalotiya

    This site can’t be reached
    raw.githubusercontent.com took too long to respond.
    this is happeneing when i try to download punkt manually

  2. i am using my personal laptop and still i am getting this error “This site can’t be reached
    raw.githubusercontent.com took too long to respond.”.

Leave a Comment

Your email address will not be published. Required fields are marked *