Accurate Language Detection Using FastText & Python

fasttext-language-detection-python-implementation

Are you looking for an easy and efficient way to detect languages in Python? Then definitely you should go for FastText. FastText is a powerful machine learning library that can be used for language detection. With FastText, you can quickly and accurately identify the language of any text. In this blog post, I will explain how FastText works and how you can use it for language detection.

Application of Language Detection

Language detection is used in various real-world NLP projects such as:

Machine Translation: Google Translate is one of the most famous applications for this kind of project.
- Google uses language detection to determine the source language before translating it to the target language
- Facebook uses language identification techniques to translate posts and comments automatically. This technique is also used to suggest relevant language-specific groups to join.
Text Classification: While working on any NLP project, you may face that users have used their regional language in some comments. In that case, language identification can be beneficial to translate it back to the English language as you have trained the classification model using the English language.
Spell Checking: If you are not able to recognize the language then your spell checker algorithm can go wrong.
Customer Support: Language detection can help route customer support requests to the appropriate agent who speaks the customer’s language.

What is FastText?

FastText is an open-source and lightweight library developed by Facebook AI Research. It efficiently handles large datasets, allowing for the classification of billions of words in milliseconds. FastText also provides pre-trained word vectors for 157 languages, allowing for transfer learning and quick integration into NLP models. With its user-friendly interface and high performance, FastText has become a popular choice for NLP tasks such as sentiment analysis, spam filtering, and language identification.

The main key point of FastText is that it is extremely fast in training word vector models. You can train a custom word vector model with billions of words in less than 10 minutes.

If you are interested in how to implement FastText word embedding in Python then read this post.

Why FastText is super Fast?

There are several reasons why FastText is fast while maintaining performance. Some of them are:

It is implemented in C++
It allows you to use multiprocessing during training
Based on a simple neural network architecture

How does FastText work?

FastText works by representing words as vectors in a high-dimensional space, using a neural network architecture. It takes into account both the internal structure of words (n-grams) and their distribution in the training corpus to create a dense vector representation. This vector representation can then be used for a variety of NLP tasks such as text classification, sentiment analysis, and language identification.

Why use FastText for language detection?

There are several libraries for language identification in Python. Such as:

Langdetect
Polyglot
TextBlob
Pycld2
langid

Each of the above-mentioned libraries has its own advantages and some limitations. In this example, we will be using the FastText library for language detection in Python as its accuracy is high and extremely fast.

FastText language detection Python

To use fastText for language detection in Python, you will first need to install the library. This can be done using the following command:

Install Library

pip install fasttext

FastText language detection model

fastText provides a variety of pre-trained models for language detection for a different number of languages and sizes. Let’s explore those before loading any model blindly.

lid.176.ftz: A model trained on text of 176 different languages. This is the most common pre-trained model available for language detection.
lid.176.bin: A binary version of the lid.176.ftz model.
lid.176.ftz.ngrams: Same model with n-gram features added to improve the performance.
lid.300.ftz: A model trained on text of 300 different languages, this is a larger version of the lid.176.ftz model.
lid.6B.300d.ftz: Same model trained using 6B words and 300-dimensional word vectors.
lid.6B.300d.bin: Binary version of that model

Load pre-trained model

For this fastText language detection example, I am going to use lid.176.ftz model. You can use a different model and let me know your experience in the comment section. Now let’s load the pre-trained model for language identification.

import fasttext

# Load pre-trained model
model = fasttext.load_model('lid.176.ftz')

The above code loads a pre-trained model that has been trained on top of 176 different languages.

Also Read: Install TensorFlow GPU with Jupiter notebook for Windows

Note: If you are facing an error saying : ValueError: lid.176.ftz cannot be opened for loading!. In that case, download the model manually and put it in your working directory. You can download lid.176.ftz by clicking this link.

Language identification from text

Once you load the pre-trained model successfully, time to recognize the language of any input text by calling predict function.

# Find the language of a given text
language = model.predict('This is a sentence in English')
# Print predicted language
print(language)

(('__label__en',), array([0.96859217]))

The predict function will return the output in a tuple format. The tuple will contain the most likely language and the probability of that language (language accuracy).

For the above example, it successfully predicted the input text as English with 97% detection accuracy.

FastText language codes

As you can see in the above example, fastText is predicting the languages in its language code. So we need to know all available fastText language codes with their language names. Below is the list of 176 language codes used in fastText lid.176.ftz model.

af – Afrikaans	lv – Latvian	ho – Hiri Motu
am – Amharic	mg – Malagasy	hz – Herero
an – Aragonese	mi – Maori	ii – Sichuan Yi
ar – Arabic	mk – Macedonian	inh – Ingush
as – Assamese	ml – Malayalam	jbo – Lojban
az – Azerbaijani	mn – Mongolian	kl – Kalaallisut
be – Belarusian	mr – Marathi	ks – Kashmiri
bg – Bulgarian	ms – Malay	ku-Latn – Central Kurdish (Latin)
bn – Bengali	mt – Maltese	kv – Komi
br – Breton	my – Burmese	lb – Luxembourgish
bs – Bosnian	ne – Nepali	lg – Ganda
ca – Catalan	nl – Dutch	mh – Marshallese
ceb – Cebuano	no – Norwegian	mi – Māori
co – Corsican	ny – Chichewa	mrj – Hill Mari
cs – Czech	pa – Punjabi	nah – Nahuatl
cy – Welsh	pl – Polish	nap – Neapolitan
da – Danish	pt – Portuguese	nb – Norwegian Bokmål
de – German	ro – Romanian	nn – Norwegian Nynorsk
el – Greek	ru – Russian	no – Norwegian
en – English	sd – Sindhi	oc – Occitan
eo – Esperanto	si – Sinhala	os – Ossetian
es – Spanish	sk – Slovak	pi – Pali
et – Estonian	sl – Slovenian	ps – Pashto
eu – Basque	sm – Samoan	qu – Quechua
fa – Persian	sn – Shona	rm – Romansh
fi – Finnish	so – Somali	rn – Rundi
fr – French	sq – Albanian	rw – Kinyarwanda
fy – Western Frisian	sr – Serbian	sc – Sardinian
ga – Irish	st – Sotho	se – Northern Sami
gd – Scottish Gaelic	su – Sundanese	sg – Sango
gl – Galician	sv – Swedish	tk – Turkmen
gu – Gujarati	sw – Swahili	tlh – Klingon
ha – Hausa	ta – Tamil	tn – Tswana
haw – Hawaiian	te – Telugu	to – Tongan
hi – Hindi	tg – Tajik	tw – Twi
hmn – Hmong	th – Thai	ty – Tahitian
hr – Croatian	tl – Tagalog	ug – Uighur
ht – Haitian Creole	tr – Turkish	ve – Venda
hu – Hungarian	uk – Ukrainian	vo – Volapük
hy – Armenian	ur – Urdu	wa – Walloon
id – Indonesian	uz – Uzbek	wo – Wolof
ig – Igbo	vi – Vietnamese	xog – Soga
is – Icelandic	xh – Xhosa	yue – Cantonese
it – Italian	yi – Yiddish	za – Zhuang
iw – Hebrew	yo – Yoruba	ace – Acehnese
ja – Japanese	zh – Chinese	ach – Akan
jw – Javanese	zu – Zulu	ady – Adyghe
ka – Georgian	ak – Akan	aln – Gheg Albanian
kk – Kazakh	bh – Bihari	alt – Southern Altai
km – Khmer	bi – Bislama	anp – Angika
kn – Kannada	bm – Bambara	arn – Mapudungun
ko – Korean	chr – Cherokee	arq – Algerian Arabic
ku – Kurdish	dv – Divehi	ast – Asturian
ky – Kyrgyz	dz – Dzongkha	av – Avar
la – Latin	ee – Ewe	azb – South Azerbaijani
lb – Luxembourgish	ff – Fulah	ba – Bashkir
lo – Lao	gv – Manx	bar – Bavarian
lt – Lithuanian	hak – Hakka Chinese	bbc – Batak Toba
bcl – Central Bicolano	be-tarask – Belarusian (Taraškievica)

Also Read: Latent Dirichlet Allocation explained

Train custom language detection model with fastText

You can also train your own language detection model using fastText. This feature is putting fastText in the top position among other language detection libraries in Python.

To train your custom language identification model, you need to prepare a training dataset for different languages. The dataset should be in a specific format, with one line of text per file. The filename should indicate the language of the text. Sample training data should look like below.

__label__en This is an example of English text.
__label__fr Voici un exemple de texte en français.
__label__de Hier ist ein Beispiel für deutschen Text.

Format of the training data:

__label__<language_code> <text>

Where <language_code> is the code for the language of the text and <text> is the actual text.

By following the above format, you need to create your own training data. Once you are ready with your training data, you need to save it in .train format. Once you have your final dataset in .train format, you can train your custom language detection model using the following code:

import fasttext

# Train the model
model = fasttext.train_supervised('data.train')

# Save the model in binary format
model.save_model('model.bin')

In the above code, we are using the save_model() function of fastText to save your trained language detection model.

You can later load the saved model using the load_model() function:

import fasttext

# Load the saved model
model = fasttext.load_model('model.bin')

# Predict the language of a given text
language = model.predict('This is a sentence in English')
print(language)

(('__label__en',), array([0.33344826]))

As you can see the accuracy is very low (33%). This is because we only used 3 training data mentioned above.

Make sure that you have a good variety of text for each language, at least several thousand sentences for each language, in order to get a robust model that can provide good accuracy.

FAQs

Now let me list down some common questions and concerns about language detection with fastText and python.

.`ftz` or `.bin` which model is better to use?

The choice between using a .ftz or .bin model depends on your specific use case and requirements.

.ftz models are text files that contain the model parameters in a human-readable format. This format makes it easy to inspect and understand the model parameters. It can be easily modified or fine-tuned. The downside is that they take up more space and may be slower to load and use because of their larger file size
.bin models are binary files that contain the model parameters in a compressed format. This format makes the models smaller in size and faster to load and use. It more difficult to inspect or modify the model parameters in binary-formated model

If you want to analyze, understand, or want to modify the model parameters, you should use .ftz models. But, if you are looking for a faster and more efficient way to load and use the model, you should go with .bin models.

What is language detection aggressive?

Language detection can be considered “aggressive” when it is designed to identify the language of a text with a high degree of confidence, even in cases where the text is short, noisy, or written in a mixed or unknown language. This can be useful in situations where the language of a text needs to be identified quickly or with a high degree of accuracy, such as in text-to-speech systems or machine translation applications.

An aggressive language detection model is accurate and faster because advanced techniques such as deep learning, character-level embeddings, or n-grams are used to train this kind of model. This type of model is also customizable as per your dataset or for your specific languages.

fastText provides an option for aggressive language detection. By default, the predict() method returns the top-scoring language prediction with a confidence score greater than 0.5. But, if you want to be more aggressive in your language detection, you can pass k=-1 as an argument to the predict() function. It will return all the predictions with a confidence score greater than 0.5.

Here is an example of how you can use aggressive language detection with fastText in Python:

import fasttext

# Load the pre-trained model
model = fasttext.load_model("lid.176.ftz")

# Use the model to make a prediction
text = "This is a sample text in English"

# Use aggressive language detection
predictions = model.predict(text, k=-1)

# Print the predictions
print(predictions)

(('__label__en', '__label__de', '__label__te', '__label__th', '__label__bn', '__label__id', '__label__ru', '__label__fa', '__label__ta', '__label__kn', '__label__es', '__label__hi', '__label__it', '__label__ml', '__label__zh', '__label__ja', '__label__pt', '__label__nl', '__label__km', '__label__vi', '__label__ms', '__label__tr', '__label__ro', '__label__my', '__label__el', '__label__si', '__label__ne', '__label__gu', '__label__fr', '__label__he', '__label__ar', '__label__mk', '__label__pl', '__label__uk', '__label__or', '__label__sw', '__label__as', '__label__sco', '__label__mr', '__label__sa', '__label__hu', '__label__ko', '__label__tl', '__label__pa', '__label__ca', '__label__fi', '__label__bg', '__label__hy', '__label__ka', '__label__af', '__label__az', '__label__sr', '__label__tg', '__label__jv', '__label__mn', '__label__da', '__label__sv', '__label__cy', '__label__azb', '__label__cs', '__label__no', '__label__sh', '__label__arz', '__label__su', '__label__uz', '__label__ur', '__label__sl', '__label__sq', '__label__eu', '__label__pam', '__label__sd', '__label__bh', '__label__gom', '__label__new', '__label__pnb', '__label__bo', '__label__lv', '__label__ps', '__label__gd', '__label__la', '__label__br', '__label__sah', '__label__so', '__label__bar', '__label__hif', '__label__ba', '__label__xmf', '__label__mai', '__label__ckb', '__label__bs', '__label__hr', '__label__eo', '__label__kk', '__label__yo', '__label__ast', '__label__gl', '__label__et', '__label__min', '__label__fy', '__label__ceb', '__label__ku', '__label__diq', '__label__dty', '__label__sk', '__label__cv', '__label__mg', '__label__be', '__label__ky', '__label__frr', '__label__am', '__label__lb', '__label__lt', '__label__yi', '__label__ug', '__label__tt', '__label__dv', '__label__ce'), array([9.64780271e-01, 3.08053242e-03, 1.97254401e-03, 1.81053276e-03,
       1.61182613e-03, 1.56028639e-03, 1.54142629e-03, 1.24220003e-03,
       1.04459631e-03, 1.04103796e-03, 9.76120180e-04, 9.31559014e-04,
       8.80905485e-04, 8.77901795e-04, 8.41499481e-04, 8.40898661e-04,
       8.13336519e-04, 7.24010402e-04, 7.14088266e-04, 6.21536397e-04,
       5.70030359e-04, 5.37957530e-04, 5.11918159e-04, 5.06383483e-04,
       4.91981569e-04, 4.86537203e-04, 4.74088913e-04, 3.76311538e-04,
       3.74388386e-04, 3.67953617e-04, 3.48389411e-04, 3.32949829e-04,
       2.93608144e-04, 2.88937095e-04, 2.72503763e-04, 2.09198479e-04,
       2.09184713e-04, 2.02362629e-04, 1.93820830e-04, 1.84631805e-04,
       1.76725473e-04, 1.75931098e-04, 1.75901558e-04, 1.66520083e-04,
       1.65673104e-04, 1.63752324e-04, 1.59383111e-04, 1.57409100e-04,
       1.57213028e-04, 1.44165824e-04, 1.42352059e-04, 1.34115253e-04,
       1.31551846e-04, 1.22665064e-04, 1.22609621e-04, 1.21908888e-04,
       1.07098684e-04, 9.73919668e-05, 9.39379825e-05, 9.12295509e-05,
       8.10417187e-05, 7.88749312e-05, 7.74870496e-05, 7.04112026e-05,
       6.85247214e-05, 6.81404345e-05, 6.60218066e-05, 6.56088669e-05,
       6.48355344e-05, 6.12535296e-05, 6.11406576e-05, 6.02545588e-05,
       6.02264117e-05, 5.31237238e-05, 4.74484914e-05, 4.72255233e-05,
       4.37675772e-05, 4.16412986e-05, 4.06144245e-05, 4.02175829e-05,
       3.72897302e-05, 3.60325903e-05, 3.51678318e-05, 3.04828318e-05,
       2.97485421e-05, 2.79343385e-05, 2.75584916e-05, 2.73486967e-05,
       2.66454481e-05, 2.60915804e-05, 2.42980277e-05, 2.35217030e-05,
       2.33252595e-05, 2.25593740e-05, 2.14378433e-05, 2.00410086e-05,
       1.99838669e-05, 1.98738817e-05, 1.87539226e-05, 1.86188790e-05,
       1.69365539e-05, 1.67025282e-05, 1.66536047e-05, 1.46203838e-05,
       1.38812202e-05, 1.32007854e-05, 1.28564443e-05, 1.28536249e-05,
       1.28255597e-05, 1.19186952e-05, 1.19031674e-05, 1.18591252e-05,
       1.15433177e-05, 1.11187519e-05, 1.10003712e-05, 1.09521461e-05,
       1.07508677e-05]))

What is automatic language detection?

Automatic language detection is a computational task that involves determining the language of a given text automatically, without any human intervention. This technology is used in various applications of natural language processing tools such as search engines, machine translation systems, etc. fastText is a good example of an automatic language detection library in python.

What is language identification chart?

language-identification-chart-fasttext-lid-chart — One example of LID chart

A language identification chart (LID chart) is a graphical representation of the results of a language identification system. This kind of chart typically used to evaluate the performance of a language identification model. The chart displays the predicted language of a text or document on the y-axis, and the true language of the text or document on the x-axis.

Using LID chart you can quickly analyze the performance of a language identification system, such as which languages are being misclassified, which languages have low accuracy, or which languages are most similar to each other. It can also be used to compare the performance of different language identification systems or models.

End Note

In conclusion, fastText is a powerful and flexible library for language detection in Python. With its built-in language detection system, fastText can accurately identify the language of a text with high accuracy.

The library implementation in Python makes it easy to use, and the code samples provided in this article demonstrate how to train and use a language detection model using fastText.

Whether you’re a beginner or an experienced practitioner in NLP, fastText is a great tool to implement in your project. Try it out today and see how it can help you solve your language detection needs.

If you have any questions or suggestions regarding this tutorial, please let me know in the comment section below.

Similar Read:

Anindya

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.