Are you looking for an easy and efficient way to detect languages in Python? Then definitely you should go for FastText. FastText is a powerful machine learning library that can be used for language detection. With FastText, you can quickly and accurately identify the language of any text. In this blog post, I will explain how FastText works and how you can use it for language detection.
Application of Language Detection
Language detection is used in various real-world NLP projects such as:
- Machine Translation: Google Translate is one of the most famous applications for this kind of project.
- Google uses language detection to determine the source language before translating it to the target language
- Facebook uses language identification techniques to translate posts and comments automatically. This technique is also used to suggest relevant language-specific groups to join.
- Text Classification: While working on any NLP project, you may face that users have used their regional language in some comments. In that case, language identification can be beneficial to translate it back to the English language as you have trained the classification model using the English language.
- Spell Checking: If you are not able to recognize the language then your spell checker algorithm can go wrong.
- Customer Support: Language detection can help route customer support requests to the appropriate agent who speaks the customer’s language.
What is FastText?
FastText is an open-source and lightweight library developed by Facebook AI Research. It efficiently handles large datasets, allowing for the classification of billions of words in milliseconds. FastText also provides pre-trained word vectors for 157 languages, allowing for transfer learning and quick integration into NLP models. With its user-friendly interface and high performance, FastText has become a popular choice for NLP tasks such as sentiment analysis, spam filtering, and language identification.
The main key point of FastText is that it is extremely fast in training word vector models. You can train a custom word vector model with billions of words in less than 10 minutes.
If you are interested in how to implement FastText word embedding in Python then read this post.
Why FastText is super Fast?
There are several reasons why FastText is fast while maintaining performance. Some of them are:
- It is implemented in C++
- It allows you to use multiprocessing during training
- Based on a simple neural network architecture
How does FastText work?
FastText works by representing words as vectors in a high-dimensional space, using a neural network architecture. It takes into account both the internal structure of words (n-grams) and their distribution in the training corpus to create a dense vector representation. This vector representation can then be used for a variety of NLP tasks such as text classification, sentiment analysis, and language identification.
Why use FastText for language detection?
There are several libraries for language identification in Python. Such as:
Each of the above-mentioned libraries has its own advantages and some limitations. In this example, we will be using the FastText library for language detection in Python as its accuracy is high and extremely fast.
FastText language detection Python
To use fastText for language detection in Python, you will first need to install the library. This can be done using the following command:
pip install fasttext
FastText language detection model
fastText provides a variety of pre-trained models for language detection for a different number of languages and sizes. Let’s explore those before loading any model blindly.
lid.176.ftz: A model trained on text of 176 different languages. This is the most common pre-trained model available for language detection.
lid.176.bin: A binary version of the
lid.176.ftz.ngrams: Same model with n-gram features added to improve the performance.
lid.300.ftz: A model trained on text of 300 different languages, this is a larger version of the
lid.6B.300d.ftz: Same model trained using 6B words and 300-dimensional word vectors.
lid.6B.300d.bin: Binary version of that model
Load pre-trained model
For this fastText language detection example, I am going to use
lid.176.ftz model. You can use a different model and let me know your experience in the comment section. Now let’s load the pre-trained model for language identification.
import fasttext # Load pre-trained model model = fasttext.load_model('lid.176.ftz')
The above code loads a pre-trained model that has been trained on top of 176 different languages.
Note: If you are facing an error saying : ValueError: lid.176.ftz cannot be opened for loading!. In that case, download the model manually and put it in your working directory. You can download lid.176.ftz by clicking this link.
Language identification from text
Once you load the pre-trained model successfully, time to recognize the language of any input text by calling
# Find the language of a given text language = model.predict('This is a sentence in English') # Print predicted language print(language)
predict function will return the output in a tuple format. The tuple will contain the most likely language and the probability of that language (language accuracy).
For the above example, it successfully predicted the input text as English with 97% detection accuracy.
FastText language codes
As you can see in the above example, fastText is predicting the languages in its language code. So we need to know all available fastText language codes with their language names. Below is the list of 176 language codes used in fastText
Train custom language detection model with fastText
You can also train your own language detection model using fastText. This feature is putting fastText in the top position among other language detection libraries in Python.
To train your custom language identification model, you need to prepare a training dataset for different languages. The dataset should be in a specific format, with one line of text per file. The filename should indicate the language of the text. Sample training data should look like below.
__label__en This is an example of English text. __label__fr Voici un exemple de texte en français. __label__de Hier ist ein Beispiel für deutschen Text.
Format of the training data:
<language_code> is the code for the language of the text and
<text> is the actual text.
By following the above format, you need to create your own training data. Once you are ready with your training data, you need to save it in
.train format. Once you have your final dataset in
.train format, you can train your custom language detection model using the following code:
import fasttext # Train the model model = fasttext.train_supervised('data.train') # Save the model in binary format model.save_model('model.bin')
In the above code, we are using the
save_model() function of fastText to save your trained language detection model.
You can later load the saved model using the
import fasttext # Load the saved model model = fasttext.load_model('model.bin') # Predict the language of a given text language = model.predict('This is a sentence in English') print(language)
As you can see the accuracy is very low (33%). This is because we only used 3 training data mentioned above.
Make sure that you have a good variety of text for each language, at least several thousand sentences for each language, in order to get a robust model that can provide good accuracy.
Now let me list down some common questions and concerns about language detection with fastText and python.
.bin which model is better to use?
The choice between using a
.bin model depends on your specific use case and requirements.
- .ftz models are text files that contain the model parameters in a human-readable format. This format makes it easy to inspect and understand the model parameters. It can be easily modified or fine-tuned. The downside is that they take up more space and may be slower to load and use because of their larger file size
- .bin models are binary files that contain the model parameters in a compressed format. This format makes the models smaller in size and faster to load and use. It more difficult to inspect or modify the model parameters in binary-formated model
If you want to analyze, understand, or want to modify the model parameters, you should use
.ftz models. But, if you are looking for a faster and more efficient way to load and use the model, you should go with
What is language detection aggressive?
Language detection can be considered “aggressive” when it is designed to identify the language of a text with a high degree of confidence, even in cases where the text is short, noisy, or written in a mixed or unknown language. This can be useful in situations where the language of a text needs to be identified quickly or with a high degree of accuracy, such as in text-to-speech systems or machine translation applications.
An aggressive language detection model is accurate and faster because advanced techniques such as deep learning, character-level embeddings, or n-grams are used to train this kind of model. This type of model is also customizable as per your dataset or for your specific languages.
fastText provides an option for aggressive language detection. By default, the
predict() method returns the top-scoring language prediction with a confidence score greater than 0.5. But, if you want to be more aggressive in your language detection, you can pass
k=-1 as an argument to the
predict() function. It will return all the predictions with a confidence score greater than 0.5.
Here is an example of how you can use aggressive language detection with fastText in Python:
import fasttext # Load the pre-trained model model = fasttext.load_model("lid.176.ftz") # Use the model to make a prediction text = "This is a sample text in English" # Use aggressive language detection predictions = model.predict(text, k=-1) # Print the predictions print(predictions)
(('__label__en', '__label__de', '__label__te', '__label__th', '__label__bn', '__label__id', '__label__ru', '__label__fa', '__label__ta', '__label__kn', '__label__es', '__label__hi', '__label__it', '__label__ml', '__label__zh', '__label__ja', '__label__pt', '__label__nl', '__label__km', '__label__vi', '__label__ms', '__label__tr', '__label__ro', '__label__my', '__label__el', '__label__si', '__label__ne', '__label__gu', '__label__fr', '__label__he', '__label__ar', '__label__mk', '__label__pl', '__label__uk', '__label__or', '__label__sw', '__label__as', '__label__sco', '__label__mr', '__label__sa', '__label__hu', '__label__ko', '__label__tl', '__label__pa', '__label__ca', '__label__fi', '__label__bg', '__label__hy', '__label__ka', '__label__af', '__label__az', '__label__sr', '__label__tg', '__label__jv', '__label__mn', '__label__da', '__label__sv', '__label__cy', '__label__azb', '__label__cs', '__label__no', '__label__sh', '__label__arz', '__label__su', '__label__uz', '__label__ur', '__label__sl', '__label__sq', '__label__eu', '__label__pam', '__label__sd', '__label__bh', '__label__gom', '__label__new', '__label__pnb', '__label__bo', '__label__lv', '__label__ps', '__label__gd', '__label__la', '__label__br', '__label__sah', '__label__so', '__label__bar', '__label__hif', '__label__ba', '__label__xmf', '__label__mai', '__label__ckb', '__label__bs', '__label__hr', '__label__eo', '__label__kk', '__label__yo', '__label__ast', '__label__gl', '__label__et', '__label__min', '__label__fy', '__label__ceb', '__label__ku', '__label__diq', '__label__dty', '__label__sk', '__label__cv', '__label__mg', '__label__be', '__label__ky', '__label__frr', '__label__am', '__label__lb', '__label__lt', '__label__yi', '__label__ug', '__label__tt', '__label__dv', '__label__ce'), array([9.64780271e-01, 3.08053242e-03, 1.97254401e-03, 1.81053276e-03, 1.61182613e-03, 1.56028639e-03, 1.54142629e-03, 1.24220003e-03, 1.04459631e-03, 1.04103796e-03, 9.76120180e-04, 9.31559014e-04, 8.80905485e-04, 8.77901795e-04, 8.41499481e-04, 8.40898661e-04, 8.13336519e-04, 7.24010402e-04, 7.14088266e-04, 6.21536397e-04, 5.70030359e-04, 5.37957530e-04, 5.11918159e-04, 5.06383483e-04, 4.91981569e-04, 4.86537203e-04, 4.74088913e-04, 3.76311538e-04, 3.74388386e-04, 3.67953617e-04, 3.48389411e-04, 3.32949829e-04, 2.93608144e-04, 2.88937095e-04, 2.72503763e-04, 2.09198479e-04, 2.09184713e-04, 2.02362629e-04, 1.93820830e-04, 1.84631805e-04, 1.76725473e-04, 1.75931098e-04, 1.75901558e-04, 1.66520083e-04, 1.65673104e-04, 1.63752324e-04, 1.59383111e-04, 1.57409100e-04, 1.57213028e-04, 1.44165824e-04, 1.42352059e-04, 1.34115253e-04, 1.31551846e-04, 1.22665064e-04, 1.22609621e-04, 1.21908888e-04, 1.07098684e-04, 9.73919668e-05, 9.39379825e-05, 9.12295509e-05, 8.10417187e-05, 7.88749312e-05, 7.74870496e-05, 7.04112026e-05, 6.85247214e-05, 6.81404345e-05, 6.60218066e-05, 6.56088669e-05, 6.48355344e-05, 6.12535296e-05, 6.11406576e-05, 6.02545588e-05, 6.02264117e-05, 5.31237238e-05, 4.74484914e-05, 4.72255233e-05, 4.37675772e-05, 4.16412986e-05, 4.06144245e-05, 4.02175829e-05, 3.72897302e-05, 3.60325903e-05, 3.51678318e-05, 3.04828318e-05, 2.97485421e-05, 2.79343385e-05, 2.75584916e-05, 2.73486967e-05, 2.66454481e-05, 2.60915804e-05, 2.42980277e-05, 2.35217030e-05, 2.33252595e-05, 2.25593740e-05, 2.14378433e-05, 2.00410086e-05, 1.99838669e-05, 1.98738817e-05, 1.87539226e-05, 1.86188790e-05, 1.69365539e-05, 1.67025282e-05, 1.66536047e-05, 1.46203838e-05, 1.38812202e-05, 1.32007854e-05, 1.28564443e-05, 1.28536249e-05, 1.28255597e-05, 1.19186952e-05, 1.19031674e-05, 1.18591252e-05, 1.15433177e-05, 1.11187519e-05, 1.10003712e-05, 1.09521461e-05, 1.07508677e-05]))
What is automatic language detection?
Automatic language detection is a computational task that involves determining the language of a given text automatically, without any human intervention. This technology is used in various applications of natural language processing tools such as search engines, machine translation systems, etc. fastText is a good example of an automatic language detection library in python.
What is language identification chart?
A language identification chart (LID chart) is a graphical representation of the results of a language identification system. This kind of chart typically used to evaluate the performance of a language identification model. The chart displays the predicted language of a text or document on the y-axis, and the true language of the text or document on the x-axis.
Using LID chart you can quickly analyze the performance of a language identification system, such as which languages are being misclassified, which languages have low accuracy, or which languages are most similar to each other. It can also be used to compare the performance of different language identification systems or models.
In conclusion, fastText is a powerful and flexible library for language detection in Python. With its built-in language detection system, fastText can accurately identify the language of a text with high accuracy.
The library implementation in Python makes it easy to use, and the code samples provided in this article demonstrate how to train and use a language detection model using fastText.
Whether you’re a beginner or an experienced practitioner in NLP, fastText is a great tool to implement in your project. Try it out today and see how it can help you solve your language detection needs.
If you have any questions or suggestions regarding this tutorial, please let me know in the comment section below.
Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.