How To Remove Stop words In Python

How To Remove Stop words In Python

Stop word removal is an basic and important technique in any Natural Language Processing task. In this tutorial I will walk you through different techniques to remove stop words from text file.

What are stop words in NLP?

Stop words usually are the most common words (ex: a, the, is, shall) in any language. There is no standard list of stop words.
The list of stop words needs to be created based on your requirement.

In this article I will show you different techniques of stop word removal in Python:

  1. How to remove stop words using Spacy python
  2. How to remove stop words with NLTK python
  3. How to remove stop words with Gensim python
  4. How to add or remove words from default stop word list in Spacy, NLTK and Gensim

Remove Stop Words Python Spacy

To remove stop words using Spacy you need to install Spacy with one of it’s model (I am using small english model).
Commands to install Spacy with it’s small model:

$ pip install -U spacy
$ python -m spacy download en_core_web_sm

Now let’s see how to remove stop words from text file in python with Spacy.

import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

# Sample text
txt = 'This is a sample sentence, to explain filtration of stopwords, which is part of text normalization'

# Convert text into spacy formatted document
doc = nlp(txt)

clean_token = []
for token in doc:
    if not (token.is_stop):
        clean_token.append(token.text)
        
print('Before:-------')
print(doc,'\n')

# Join sentence without stop words and print
print('After:-------')
' '.join(clean_token)
Before:-------
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization 

After:-------
'sample sentence , explain filtration stopwords , text normalization'

Above stop word removal is done by replacing words which were listed in Spacy default stop list (Spacy stopword dictionary). Now let’s see complete list of default stop words dictionary of Spacy.

# Print complete list of stop words dictionary of Spacy
# Spacy stopword dictionary
all_stopwords = nlp.Defaults.stop_words
all_stopwords
{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
  .....
  .....

Adding Stop Words in Spacy Default Stop Word List

In above example you have seen, after removing Stop Words by using Spacy default stop list, word “sample” and “explain” was not removed. Now let’s say you want eliminate those words from you text after doing stopword removal.

In this case you can add those words into the Spacy default Stop Word list. To do so you can use below code.

# Add multiple stopwords to spacy stopword list
nlp = en_core_web_sm.load()
nlp.Defaults.stop_words |= {'explain','sample'}

# load text again to spacy doc
doc = nlp(txt)

# Print text without stopwords
no_stop_txt = []
for token in doc:
    if not (token.is_stop):
        no_stop_txt.append(token.text)
        
print('Before:-------')
print(doc,'\n')

# Join tokens without stop words and print
print('After:-------')
' '.join(no_stop_txt)
Raw text:-------
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization 

After Default Stop word removal Spacy:-------
'sample sentence , explain filtration stopwords , text normalization'

After Custom Stop word removal Spacy:-------
'sentence , filtration stopwords , text normalization'

Removing Stop Words from Spacy Default Stop Word List

Now let’s say you don’t want to remove words “which” and “this” after stop word removal. In this case you need to remove those words from default Spacy stop word list. To do so you can follow below code:

# Remove multiple stopwords at once from spacy stopword list
nlp = en_core_web_sm.load()
nlp.Defaults.stop_words -= {"which", "this"}

# load text again to spacy doc
doc = nlp(txt)

# Print text without stopwords
no_stop_txt = []
for token in doc:
    if not (token.is_stop):
        no_stop_txt.append(token.text)
        
print('Before:-------')
print(doc,'\n')

# Join tokens without stop words and print
print('After:-------')
' '.join(no_stop_txt)
Raw text:-------
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization 

After Default Stop word removal Spacy:-------
'sample sentence , explain filtration stopwords , text normalization'

After Custom Stop word removal Spacy:-------
'This sentence , filtration stopwords , which text normalization'

You have seen how to remove stop words using Spacy. Now let me show you how to use most popular library of Python for Natural Language Processing. NLTK to remove stop words.

Stopword Removal using NLTK

To remove stop words using NLTK in Python, you need to install NLTK and its datasets. To download required NLTK dataset inside Python follow below code:

# For stopword removal
import nltk
nltk.download('stopwords')
# For tokenization
nltk.download('punkt')

Now like Spacy let’s first see entire NLTK stopwords list by below code:

from nltk.corpus import stopwords
stop_word = set(stopwords.words('english'))

# Print complete list of stop words dictionary of NLTK
stop_word
{'a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
  ....
  ....

Now let’s try to remove stop words using NLTK python by using below code:

txt = 'This is a sample sentence, to explain filtration of stopwords, which is part of text normalization'

# Print sentence after stopword removal
' '.join([i for i in txt.lower().split() if i not in stop_word])
Before:-------
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization 

After:-------
'sample sentence, explain filtration stopwords, part text normalization'

Similar to Spacy you can add words to NLTK stopwords list.

Adding Words in NLTK stopwords list

# Adding multiple words to nltk stoplist
stop = list(set(stopwords.words('english')))
stop = stop+['explain', 'sample']

print('Before:-------')
print(txt,'\n')

print('After:-------')
# Print sentence after stopword removal
' '.join([i for i in txt.lower().split() if i not in stop])

Raw text:-------
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization 

After Default Stop word removal NLTK:-------
'sample sentence, explain filtration stopwords, part text normalization'

After Custom Stop word removal NLTK:-------
'sentence, filtration stopwords, part text normalization'

Removing Words from NLTK stopwords list

# Remove multiple stopword from NLTK stopword list
stop = list(set(stopwords.words('english')))
stop = [e for e in stop if e not in ('this', 'is')]

print('Before:-------')
print(txt,'\n')

print('After:-------')
# Print sentence after stopword removal
' '.join([i for i in txt.lower().split() if i not in stop])
Raw text:-------
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization 

After Default Stop word removal NLTK:-------
'sample sentence, explain filtration stopwords, part text normalization'

After Custom Stop word removal NLTK:-------
'this is sample sentence, explain filtration stopwords, is part text normalization'

Remove Stop words using Gensim

Gensim is one of the most important Python library for advanced Natural Language Processing. Gensim is popular for NLP job like Topic Modeling, Word2vec, document indexing etc.

In this section I will show you how to use Gensim remove stop words from text file.

Likse Spacy and NLTK Gensim also have it’s own Stopwords list. Let’s print all of them.

from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import STOPWORDS

# Print complete list of stop words dictionary of Gensim
STOPWORDS
frozenset({'a',
           'about',
           'above',
           'across',
           'after',
           'afterwards',
           'again',
           'against',
           'all',
           'almost',
           'alone',
           'along',
           'already',
            ............
            ............

Now let’s remove stop words from a sample text file.

from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import STOPWORDS

txt = 'This is a sample sentence, to explain filtration of stopwords, which is part of text normalization'
print('Raw Text:-----')
print(txt)
print('\n')
print('After Default Stop word removal Gensim:-----')
remove_stopwords(txt)
Raw Text:-----
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization


After Default Stop word removal Gensim:-----
'This sample sentence, explain filtration stopwords, text normalization'

Adding Words in Gensim stopwords list

Now similar to Spacy and NLTK you can also add some words to Gensim Stopwords list using below code:

from nltk.tokenize import word_tokenize
from gensim.parsing.preprocessing import STOPWORDS

# Adding multiple words to gensim stoplist
new_stopword = STOPWORDS.union(set(['this', 'sample']))

txt_tokens = word_tokenize(txt)
tok_without_sw = [word for word in txt_tokens if not word.lower() in new_stopword]

print('Raw Text:-----')
print(txt)
print('\n')
print('After Default Stop word removal Gensim:-----')
print(remove_stopwords(txt))
print('\n')
# Print sentence after stopword removal
print('After Custom Stop word removal Gensim:-----')
' '.join(tok_without_sw)
Raw Text:-----
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization


After Default Stop word removal Gensim:-----
This sample sentence, explain filtration stopwords, text normalization


After Custom Stop word removal Gensim:-----
'sentence , explain filtration stopwords , text normalization'

Removing Words in Gensim Stopwords list

You can also remove words form Gensim stopwords list by following below code:

from nltk.tokenize import word_tokenize

# Remove multiple words to gensim stoplist
remove_list = {'is', 'to'}
new_stopword = STOPWORDS.difference(remove_list)

txt_tokens = word_tokenize(txt)
tok_without_sw = [word for word in txt_tokens if not word.lower() in new_stopword]

print('Raw Text:-----')
print(txt)
print('\n')
print('After Default Stop word removal Gensim:-----')
print(remove_stopwords(txt))
print('\n')
# Print sentence after stopword removal
print('After Custom Stop word removal Gensim:-----')
' '.join(tok_without_sw)
Raw Text:-----
This is a sample sentence, to explain filtration of stopwords, which is part of text normalization


After Default Stop word removal Gensim:-----
This sample sentence, explain filtration stopwords, text normalization


After Custom Stop word removal Gensim:-----
'is sample sentence , to explain filtration stopwords , is text normalization'

Conclusion

In this article you learned how to remove stop words from text file in Python using different libraries. You learned not only NLTK but you can remove stop words python without NLTK. You learned how to add or remove stop words from default Stop words list of various libraries for stop word removal in Python.

1 thought on “How To Remove Stop words In Python”

Leave a Comment

Your email address will not be published. Required fields are marked *