How to download NLTK corpus manually

NLTK is the most popular package among all NLP packages available for Python. It can be used to solve all kind of basic to advanced level of NLP task. Important thing is NLTK requires lots of data or corpus to process any NLP task. Without those NLTK can’t do anything.

Course for You: Natural Language Processing (NLP) in Python with 8 Projects

For example, if you are trying to do POS tagging (one NLP task) by following code.

from nltk.tokenize import word_tokenize
from nltk import pos_tag

s = 'There is a problem with Traffic Light'
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens)
print(tokens_pos)

If you do not have the required corpus (data), you are supposed to get lookup error like below:

LookupError: ********************************************************************** Resource u’taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle’ not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
  Searched in:
    – ‘C:\Users\anindya/nltk_data’
    – ‘C:\nltk_data’
    – ‘D:\nltk_data’
    – ‘E:\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\lib\nltk_data’
    – ‘C:\Users\anindya\AppData\Roaming\nltk_data’
**********************************************************************

How to find required corpus name by looking at error?

At the First line of Lookup error will have that information. Let’s look back to the first line.

Resource ‘taggers/averaged_perceptron_tagger/averaged_perceptron
  _tagger.pickle‘ not found.

It says:

averaged_perceptron_tagger.pickle” corpus is required to execute my script, which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).

How to download NLTK corpus from Python?

There are mainly three ways to download NLTK corpus automatically

  •     By GUI (Select corpus name from GUI to download)
  •     By corpus name.
  •     Download all corpus

By GUI

Type the code in Python to download NLTK courses using the default UI application.

import nltk
nltk.download()

After executing the above Python code, a window should pop up called “NLTK Downloader

download-nltk-corpus-automatically-using-default-ui-application-from-python-code

Click on corpora to download all required NLTK corpus…

Also Read:  Finetune LLM (LLAMA, GPT-Neo, GPT-J, Pythia) in Google Colab
Download by NLTK corpus name:

You can also download a specific NLTK corpus executing Python code below.

import nltk
nltk.download('averaged_perceptron_tagger')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.
True
Download all NLTK corpus:

Now if you are not sure which corpus you need for your NLTK project then you can download entire list of NLTK corpus using below Python code.

import nltk
nltk.download('all')
[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\abc.zip.
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\alpino.zip.
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping
[nltk_data]    |       taggers\averaged_perceptron_tagger_ru.zip.
[nltk_data]    | Downloading package basque_grammars to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\basque_grammars.zip.
[nltk_data]    | Downloading package bcp47 to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    | Downloading package biocreative_ppi to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\biocreative_ppi.zip.
[nltk_data]    | Downloading package bllip_wsj_no_aux to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping models\bllip_wsj_no_aux.zip.
[nltk_data]    | Downloading package book_grammars to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping grammars\book_grammars.zip.
[nltk_data]    | Downloading package brown to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown.zip.
[nltk_data]    | Downloading package brown_tei to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\brown_tei.zip.
[nltk_data]    | Downloading package cess_cat to
[nltk_data]    |     C:\Users\Anindya\AppData\Roaming\nltk_data...
[nltk_data]    |   Unzipping corpora\cess_cat.zip.
...
...
...

There are so many NLTK corpus, in the above output I have shown some of them.

So you understood how to download NLTK corpus automatically using Python code. Now sometimes if you are doing all this in your office computer you might get error like:

[nltk_data] Error loading all: <urlopen error [WinError 10060] A
[nltk_data]     connection attempt failed because the connected party
[nltk_data]     did not properly respond after a period of time or
[nltk_data]     established connection failed because connected host
[nltk_data]     has failed to respond>
False

This is because of proxy settings; you will not be able to download anything through Python. In this situation, you have to download NLTK corpus (or data for example stopword) manually.

Also Read:  CondaError: An error occurred when loading cached repodata

Manually download NLTK corpus

I will divide entire process of downloading NLTK corpus manually into two steps:

Step 1: Download exact NLTK corpus Manually

In this manual process you need to know which NLTK corpus you want to download. For this example I am going to download “averaged_perceptron_tagger”. To do that, go to http://www.nltk.org/nltk_data/ and search for “tagger” and download “averaged_perceptron_tagger”.

download-nltk-corpus-manually

Now if you unzip the downloaded file you can see inside “averaged_perceptron_tagger” folder “averaged_perceptron_tagger.pickle” corpus is there (which is required).

how-to-download-nltk-corpus-manually-from-the-nltk-data-site

Step 2 (Find folder where to move):

Let’s recall our error once again.

LookupError: ********************************************************************** Resource u’taggers/averaged_perceptron_tagger/averaged_perceptro n_tagger.pickle’ not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download()
  Searched in:
    – ‘C:\Users\anindya/nltk_data’
    – ‘C:\nltk_data’
    – ‘D:\nltk_data’
    – ‘E:\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\nltk_data’
    – ‘C:\Users\anindya\Anaconda2\lib\nltk_data’
    – ‘C:\Users\anindya\AppData\Roaming\nltk_data’
**********************************************************************

Choose any of the above folder listed in the error. I’m choosing ‘C:\nltk_data’.

So let’s create a folder called nltk_data inside C:\ drive. Then create a folder called “taggers” inside nltk_data folder as first line of error says:

Note:averaged_perceptron_tagger.pickle” corpus is required to execute above script. Which should be inside taggers/averaged_perceptron_tagger folder (folder “averaged_perceptron_tagger” inside folder “taggers”).

Now just copy and paste “averaged_perceptron_tagger.zip” (without unzipping) inside “taggeres” folder.

nltk_data —–> taggers —–> averaged_perceptron_tagger.zip

That’s all!!

This is how you can deal with any other nltk data related issue and download them manually. Now you can test with the same code which we were testing at the beginning of this article.

from nltk.tokenize import word_tokenize
from nltk import pos_tag

s = 'There is a problem with Traffic Light'
tokens = word_tokenize(s) # Generate list of tokens
tokens_pos = pos_tag(tokens)
print(tokens_pos)

This time it should work as expected. Below is the output for the above Python code.

[(‘There’, ‘EX’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘problem’, ‘NN’), (‘with’, ‘IN’), (‘Traffic’, ‘NNP’), (‘Light’, ‘NNP’)]

Wrap-up

I faced one NLTK corpus related error while working with my office computer. I thought to make an article how to solve that NLTK error.

Also Read:  Use Opencv with GPU with just 2 lines of code

In this article, I first showed different techniques how you can download NLTK corpus automatically, and then I showed you how you can download and place NLTK corpus manually without any code.

In this manual process, first you need to download a specific NLTK corpus from the NLTK data archive in zip format, then place that corpus file inside a specific folder path.

This is it for this article. If you have any questions or suggestions regarding this article, please let me know in the comment section below.

4 thoughts on “How to download NLTK corpus manually”

  1. This site can’t be reached
    raw.githubusercontent.com took too long to respond.
    this is happeneing when i try to download punkt manually

    Reply
  2. i am using my personal laptop and still i am getting this error “This site can’t be reached
    raw.githubusercontent.com took too long to respond.”.

    Reply

Leave a comment