Accurate Language Detection Using FastText & Python

fasttext-language-detection-python-implementation

Are you looking for an easy and efficient way to detect languages in Python? Then definitely you should go for FastText. FastText is a powerful machine learning library that can be used for language detection. With FastText, you can quickly and accurately identify the language of any text. In this blog post, I will explain how FastText works and how you can use it for language detection.

Application of Language Detection

Language detection is used in various real-world NLP projects such as:

  • Machine Translation: Google Translate is one of the most famous applications for this kind of project.
    • Google uses language detection to determine the source language before translating it to the target language
    • Facebook uses language identification techniques to translate posts and comments automatically. This technique is also used to suggest relevant language-specific groups to join.
  • Text Classification: While working on any NLP project, you may face that users have used their regional language in some comments. In that case, language identification can be beneficial to translate it back to the English language as you have trained the classification model using the English language.
  • Spell Checking: If you are not able to recognize the language then your spell checker algorithm can go wrong.
  • Customer Support: Language detection can help route customer support requests to the appropriate agent who speaks the customer’s language.

What is FastText?

FastText is an open-source and lightweight library developed by Facebook AI Research. It efficiently handles large datasets, allowing for the classification of billions of words in milliseconds. FastText also provides pre-trained word vectors for 157 languages, allowing for transfer learning and quick integration into NLP models. With its user-friendly interface and high performance, FastText has become a popular choice for NLP tasks such as sentiment analysis, spam filtering, and language identification.

The main key point of FastText is that it is extremely fast in training word vector models. You can train a custom word vector model with billions of words in less than 10 minutes.

If you are interested in how to implement FastText word embedding in Python then read this post.

Why FastText is super Fast?

There are several reasons why FastText is fast while maintaining performance. Some of them are:

  • It is implemented in C++
  • It allows you to use multiprocessing during training
  • Based on a simple neural network architecture

How does FastText work?

FastText works by representing words as vectors in a high-dimensional space, using a neural network architecture. It takes into account both the internal structure of words (n-grams) and their distribution in the training corpus to create a dense vector representation. This vector representation can then be used for a variety of NLP tasks such as text classification, sentiment analysis, and language identification.

Why use FastText for language detection?

There are several libraries for language identification in Python. Such as:

  • Langdetect
  • Polyglot
  • TextBlob
  • Pycld2
  • langid

Each of the above-mentioned libraries has its own advantages and some limitations. In this example, we will be using the FastText library for language detection in Python as its accuracy is high and extremely fast.

FastText language detection Python

To use fastText for language detection in Python, you will first need to install the library. This can be done using the following command:

Install Library

pip install fasttext

FastText language detection model

fastText provides a variety of pre-trained models for language detection for a different number of languages and sizes. Let’s explore those before loading any model blindly.

  • lid.176.ftz: A model trained on text of 176 different languages. This is the most common pre-trained model available for language detection.
  • lid.176.bin: A binary version of the lid.176.ftz model.
  • lid.176.ftz.ngrams: Same model with n-gram features added to improve the performance.
  • lid.300.ftz: A model trained on text of 300 different languages, this is a larger version of the lid.176.ftz model.
  • lid.6B.300d.ftz: Same model trained using 6B words and 300-dimensional word vectors.
  • lid.6B.300d.bin: Binary version of that model

Load pre-trained model

For this fastText language detection example, I am going to use lid.176.ftz model. You can use a different model and let me know your experience in the comment section. Now let’s load the pre-trained model for language identification.

import fasttext

# Load pre-trained model
model = fasttext.load_model('lid.176.ftz')

The above code loads a pre-trained model that has been trained on top of 176 different languages.

Also Read:  Gensim Doc2Vec Python implementation

Note: If you are facing an error saying : ValueError: lid.176.ftz cannot be opened for loading!. In that case, download the model manually and put it in your working directory. You can download lid.176.ftz by clicking this link.

Language identification from text

Once you load the pre-trained model successfully, time to recognize the language of any input text by calling predict function.

# Find the language of a given text
language = model.predict('This is a sentence in English')
# Print predicted language
print(language)
(('__label__en',), array([0.96859217]))

The predict function will return the output in a tuple format. The tuple will contain the most likely language and the probability of that language (language accuracy).

For the above example, it successfully predicted the input text as English with 97% detection accuracy.

FastText language codes

As you can see in the above example, fastText is predicting the languages in its language code. So we need to know all available fastText language codes with their language names. Below is the list of 176 language codes used in fastText lid.176.ftz model.

af – Afrikaanslv – Latvianho – Hiri Motu
am – Amharicmg – Malagasyhz – Herero
an – Aragonesemi – Maoriii – Sichuan Yi
ar – Arabicmk – Macedonianinh – Ingush
as – Assameseml – Malayalamjbo – Lojban
az – Azerbaijanimn – Mongoliankl – Kalaallisut
be – Belarusianmr – Marathiks – Kashmiri
bg – Bulgarianms – Malayku-Latn – Central Kurdish (Latin)
bn – Bengalimt – Maltesekv – Komi
br – Bretonmy – Burmeselb – Luxembourgish
bs – Bosnianne – Nepalilg – Ganda
ca – Catalannl – Dutchmh – Marshallese
ceb – Cebuanono – Norwegianmi – Māori
co – Corsicanny – Chichewamrj – Hill Mari
cs – Czechpa – Punjabinah – Nahuatl
cy – Welshpl – Polishnap – Neapolitan
da – Danishpt – Portuguesenb – Norwegian Bokmål
de – Germanro – Romaniannn – Norwegian Nynorsk
el – Greekru – Russianno – Norwegian
en – Englishsd – Sindhioc – Occitan
eo – Esperantosi – Sinhalaos – Ossetian
es – Spanishsk – Slovakpi – Pali
et – Estoniansl – Slovenianps – Pashto
eu – Basquesm – Samoanqu – Quechua
fa – Persiansn – Shonarm – Romansh
fi – Finnishso – Somalirn – Rundi
fr – Frenchsq – Albanianrw – Kinyarwanda
fy – Western Frisiansr – Serbiansc – Sardinian
ga – Irishst – Sothose – Northern Sami
gd – Scottish Gaelicsu – Sundanesesg – Sango
gl – Galiciansv – Swedishtk – Turkmen
gu – Gujaratisw – Swahilitlh – Klingon
ha – Hausata – Tamiltn – Tswana
haw – Hawaiiante – Teluguto – Tongan
hi – Hinditg – Tajiktw – Twi
hmn – Hmongth – Thaity – Tahitian
hr – Croatiantl – Tagalogug – Uighur
ht – Haitian Creoletr – Turkishve – Venda
hu – Hungarianuk – Ukrainianvo – Volapük
hy – Armenianur – Urduwa – Walloon
id – Indonesianuz – Uzbekwo – Wolof
ig – Igbovi – Vietnamesexog – Soga
is – Icelandicxh – Xhosayue – Cantonese
it – Italianyi – Yiddishza – Zhuang
iw – Hebrewyo – Yorubaace – Acehnese
ja – Japanesezh – Chineseach – Akan
jw – Javanesezu – Zuluady – Adyghe
ka – Georgianak – Akanaln – Gheg Albanian
kk – Kazakhbh – Biharialt – Southern Altai
km – Khmerbi – Bislamaanp – Angika
kn – Kannadabm – Bambaraarn – Mapudungun
ko – Koreanchr – Cherokeearq – Algerian Arabic
ku – Kurdishdv – Divehiast – Asturian
ky – Kyrgyzdz – Dzongkhaav – Avar
la – Latinee – Eweazb – South Azerbaijani
lb – Luxembourgishff – Fulahba – Bashkir
lo – Laogv – Manxbar – Bavarian
lt – Lithuanianhak – Hakka Chinesebbc – Batak Toba
bcl – Central Bicolanobe-tarask – Belarusian (Taraškievica)
Also Read:  Advanced Natural Language Processing with Stanford CoreNLP

Train custom language detection model with fastText

You can also train your own language detection model using fastText. This feature is putting fastText in the top position among other language detection libraries in Python.

To train your custom language identification model, you need to prepare a training dataset for different languages. The dataset should be in a specific format, with one line of text per file. The filename should indicate the language of the text. Sample training data should look like below.

__label__en This is an example of English text.
__label__fr Voici un exemple de texte en français.
__label__de Hier ist ein Beispiel für deutschen Text.

Format of the training data:

__label__<language_code> <text>

Where <language_code> is the code for the language of the text and <text> is the actual text.

By following the above format, you need to create your own training data. Once you are ready with your training data, you need to save it in .train format. Once you have your final dataset in .train format, you can train your custom language detection model using the following code:

import fasttext

# Train the model
model = fasttext.train_supervised('data.train')

# Save the model in binary format
model.save_model('model.bin')

In the above code, we are using the save_model() function of fastText to save your trained language detection model.

You can later load the saved model using the load_model() function:

import fasttext

# Load the saved model
model = fasttext.load_model('model.bin')

# Predict the language of a given text
language = model.predict('This is a sentence in English')
print(language)
(('__label__en',), array([0.33344826]))

As you can see the accuracy is very low (33%). This is because we only used 3 training data mentioned above.

Make sure that you have a good variety of text for each language, at least several thousand sentences for each language, in order to get a robust model that can provide good accuracy.

FAQs

Now let me list down some common questions and concerns about language detection with fastText and python.

.ftz or .bin which model is better to use?

The choice between using a .ftz or .bin model depends on your specific use case and requirements.

  • .ftz models are text files that contain the model parameters in a human-readable format. This format makes it easy to inspect and understand the model parameters. It can be easily modified or fine-tuned. The downside is that they take up more space and may be slower to load and use because of their larger file size
  • .bin models are binary files that contain the model parameters in a compressed format. This format makes the models smaller in size and faster to load and use. It more difficult to inspect or modify the model parameters in binary-formated model

If you want to analyze, understand, or want to modify the model parameters, you should use .ftz models. But, if you are looking for a faster and more efficient way to load and use the model, you should go with .bin models.

What is language detection aggressive?

Language detection can be considered “aggressive” when it is designed to identify the language of a text with a high degree of confidence, even in cases where the text is short, noisy, or written in a mixed or unknown language. This can be useful in situations where the language of a text needs to be identified quickly or with a high degree of accuracy, such as in text-to-speech systems or machine translation applications.

An aggressive language detection model is accurate and faster because advanced techniques such as deep learning, character-level embeddings, or n-grams are used to train this kind of model. This type of model is also customizable as per your dataset or for your specific languages.

fastText provides an option for aggressive language detection. By default, the predict() method returns the top-scoring language prediction with a confidence score greater than 0.5. But, if you want to be more aggressive in your language detection, you can pass k=-1 as an argument to the predict() function. It will return all the predictions with a confidence score greater than 0.5.

Here is an example of how you can use aggressive language detection with fastText in Python:

import fasttext

# Load the pre-trained model
model = fasttext.load_model("lid.176.ftz")

# Use the model to make a prediction
text = "This is a sample text in English"

# Use aggressive language detection
predictions = model.predict(text, k=-1)

# Print the predictions
print(predictions)
(('__label__en', '__label__de', '__label__te', '__label__th', '__label__bn', '__label__id', '__label__ru', '__label__fa', '__label__ta', '__label__kn', '__label__es', '__label__hi', '__label__it', '__label__ml', '__label__zh', '__label__ja', '__label__pt', '__label__nl', '__label__km', '__label__vi', '__label__ms', '__label__tr', '__label__ro', '__label__my', '__label__el', '__label__si', '__label__ne', '__label__gu', '__label__fr', '__label__he', '__label__ar', '__label__mk', '__label__pl', '__label__uk', '__label__or', '__label__sw', '__label__as', '__label__sco', '__label__mr', '__label__sa', '__label__hu', '__label__ko', '__label__tl', '__label__pa', '__label__ca', '__label__fi', '__label__bg', '__label__hy', '__label__ka', '__label__af', '__label__az', '__label__sr', '__label__tg', '__label__jv', '__label__mn', '__label__da', '__label__sv', '__label__cy', '__label__azb', '__label__cs', '__label__no', '__label__sh', '__label__arz', '__label__su', '__label__uz', '__label__ur', '__label__sl', '__label__sq', '__label__eu', '__label__pam', '__label__sd', '__label__bh', '__label__gom', '__label__new', '__label__pnb', '__label__bo', '__label__lv', '__label__ps', '__label__gd', '__label__la', '__label__br', '__label__sah', '__label__so', '__label__bar', '__label__hif', '__label__ba', '__label__xmf', '__label__mai', '__label__ckb', '__label__bs', '__label__hr', '__label__eo', '__label__kk', '__label__yo', '__label__ast', '__label__gl', '__label__et', '__label__min', '__label__fy', '__label__ceb', '__label__ku', '__label__diq', '__label__dty', '__label__sk', '__label__cv', '__label__mg', '__label__be', '__label__ky', '__label__frr', '__label__am', '__label__lb', '__label__lt', '__label__yi', '__label__ug', '__label__tt', '__label__dv', '__label__ce'), array([9.64780271e-01, 3.08053242e-03, 1.97254401e-03, 1.81053276e-03,
       1.61182613e-03, 1.56028639e-03, 1.54142629e-03, 1.24220003e-03,
       1.04459631e-03, 1.04103796e-03, 9.76120180e-04, 9.31559014e-04,
       8.80905485e-04, 8.77901795e-04, 8.41499481e-04, 8.40898661e-04,
       8.13336519e-04, 7.24010402e-04, 7.14088266e-04, 6.21536397e-04,
       5.70030359e-04, 5.37957530e-04, 5.11918159e-04, 5.06383483e-04,
       4.91981569e-04, 4.86537203e-04, 4.74088913e-04, 3.76311538e-04,
       3.74388386e-04, 3.67953617e-04, 3.48389411e-04, 3.32949829e-04,
       2.93608144e-04, 2.88937095e-04, 2.72503763e-04, 2.09198479e-04,
       2.09184713e-04, 2.02362629e-04, 1.93820830e-04, 1.84631805e-04,
       1.76725473e-04, 1.75931098e-04, 1.75901558e-04, 1.66520083e-04,
       1.65673104e-04, 1.63752324e-04, 1.59383111e-04, 1.57409100e-04,
       1.57213028e-04, 1.44165824e-04, 1.42352059e-04, 1.34115253e-04,
       1.31551846e-04, 1.22665064e-04, 1.22609621e-04, 1.21908888e-04,
       1.07098684e-04, 9.73919668e-05, 9.39379825e-05, 9.12295509e-05,
       8.10417187e-05, 7.88749312e-05, 7.74870496e-05, 7.04112026e-05,
       6.85247214e-05, 6.81404345e-05, 6.60218066e-05, 6.56088669e-05,
       6.48355344e-05, 6.12535296e-05, 6.11406576e-05, 6.02545588e-05,
       6.02264117e-05, 5.31237238e-05, 4.74484914e-05, 4.72255233e-05,
       4.37675772e-05, 4.16412986e-05, 4.06144245e-05, 4.02175829e-05,
       3.72897302e-05, 3.60325903e-05, 3.51678318e-05, 3.04828318e-05,
       2.97485421e-05, 2.79343385e-05, 2.75584916e-05, 2.73486967e-05,
       2.66454481e-05, 2.60915804e-05, 2.42980277e-05, 2.35217030e-05,
       2.33252595e-05, 2.25593740e-05, 2.14378433e-05, 2.00410086e-05,
       1.99838669e-05, 1.98738817e-05, 1.87539226e-05, 1.86188790e-05,
       1.69365539e-05, 1.67025282e-05, 1.66536047e-05, 1.46203838e-05,
       1.38812202e-05, 1.32007854e-05, 1.28564443e-05, 1.28536249e-05,
       1.28255597e-05, 1.19186952e-05, 1.19031674e-05, 1.18591252e-05,
       1.15433177e-05, 1.11187519e-05, 1.10003712e-05, 1.09521461e-05,
       1.07508677e-05]))

What is automatic language detection?

Automatic language detection is a computational task that involves determining the language of a given text automatically, without any human intervention. This technology is used in various applications of natural language processing tools such as search engines, machine translation systems, etc. fastText is a good example of an automatic language detection library in python.

Also Read:  Simple: Doc2Vec explained

What is language identification chart?

language-identification-chart-fasttext-lid-chart
One example of LID chart

A language identification chart (LID chart) is a graphical representation of the results of a language identification system. This kind of chart typically used to evaluate the performance of a language identification model. The chart displays the predicted language of a text or document on the y-axis, and the true language of the text or document on the x-axis.

Using LID chart you can quickly analyze the performance of a language identification system, such as which languages are being misclassified, which languages have low accuracy, or which languages are most similar to each other. It can also be used to compare the performance of different language identification systems or models.

End Note

In conclusion, fastText is a powerful and flexible library for language detection in Python. With its built-in language detection system, fastText can accurately identify the language of a text with high accuracy.

The library implementation in Python makes it easy to use, and the code samples provided in this article demonstrate how to train and use a language detection model using fastText.

Whether you’re a beginner or an experienced practitioner in NLP, fastText is a great tool to implement in your project. Try it out today and see how it can help you solve your language detection needs.

If you have any questions or suggestions regarding this tutorial, please let me know in the comment section below.

Similar Read:

Leave a comment