Word similarity matching using Soundex algorithm in python

Word similarity matching is an essential part for text cleaning or text analysis.

Let’s say in your text there are lots of spelling mistakes for any proper nouns like name, place etc. and you need to convert all similar names or places in a standard form.

This is where Soundex algorithm is needed to match similarity between two words.

Soundex is a phonetic algorithm which can find similar sounding terms.

A Soundex search algorithm takes a word, such as a person’s name, as input, and produces a character string that identifies a set of words that are (roughly) phonetically alike or sound (roughly) is equal.

The Soundex method is based on six phonetic types of human speech sounds (bilabial, labiodental, dental, alveolar, velar, and glottal). These are based on where you put your lips and tongue to make the sounds.

Let’s test something in python.

(Note: I have used fuzzy==1.1 version)

import fuzzy
# Convert up to 10 characters to phonetic character
soundex = fuzzy.Soundex(10)
# Text to process
word = 'phone'
soundex(word)

Output:

‘P500000000’

Here P is for first letter of word ‘phone’

Now if someone misspelled ‘phone’ to ‘fone’ let’s see if Soundex can identify or not.

import fuzzy
# Convert up to 10 characters to phonetic character
soundex = fuzzy.Soundex(10)
# Text to process
word = 'fone'
soundex(word)

Output:

‘F500000000’

You can see apart from first character phonetic characters are same for both ‘phone’ and ‘fone’. So we can use Soundex algorithm to solve this kind of problems.

word = 'fone'
soundex(word)[1:]
word = 'phone'
soundex(word)[1:]

Output:

‘500000000’

Also Read: Natural Language Processing Using TextBlob

How Soundex algorithm works?

Step1:

Store the first letter of the word and keep it separate .Convert it to upper case.

‘phone’=> ‘P’, ‘hone’ => ‘hone’

Step2:

Drop all a, e, i, o, u, y, h, w from word.

‘hone’=> ‘hone’ => ‘n’

Step3:

Replace characters with digits as follows (after the first letter):

b, f, p, v → 1

c, g, j, k, q, s, x, z → 2

d, t → 3

l → 4

m, n → 5

r → 6

‘n’ => 5

Step4:

Join uppercase first character with step 3 value.

So phonetic value of word ‘phone’ will be P5.

As we had given 10 characters to convert [soundex = fuzzy.Soundex(10); soundex(word)]

8 zeros will be filled after P5 (total 10 characters) so final value will be P500000000

Let’s test for one more word.

Word = ‘Rupert‘

Step1:

‘Rupert’ => ‘R’, ‘upert’

Step2:

‘upert’ => ‘prt’

Step3:

‘prt’ => 163

Step4:

R163 => R163000000 (6 zeros added to make character length to 10)

Conclusion:

In this tutorial you learned:

What is word similarity matching

Why word similarity matching is important for text analysis.

How to do word similarity matching using Soundex algorithm in python.

How Soundex algorithm works.

Anindya Naskar

114 thoughts on “Word similarity matching using Soundex algorithm in python”

Pianino Tapety

February 5, 2021 at 8:05 pm

Thanks for sharing, please keep an update about this info. love to read it more. i like this site too much. Good theme .
www.Filozofia.XMC.pl

February 19, 2021 at 2:15 am

Your weblog is excellent. Thank you greatly for delivering a lot of very helpful information and facts.
Diabetyka

March 1, 2021 at 2:45 am

Someone I work with visits your site regularly and recommended it in my experience to read as well. The writing style is great and also the content is top-notch. Thanks for the insight you supply the readers!
- Anindya
  
  March 27, 2021 at 11:28 am
  
  Thanks

How Soundex algorithm works?

Conclusion:

Related Posts

114 thoughts on “Word similarity matching using Soundex algorithm in python”

Leave a comment Cancel reply