Improve Text Classification with Python and FastText

In my last few post, I showed you how to use fastText for word embedding and language detection. In this tutorial, I will show how to use fastText for text classification in Python.

If you want to know basics about fastText like: what is fastText? how does fastText work? how to use fastText embeddings? please read the below tutorials to clear your concept:

Text Classification example in Python

I will divide the entire process of text classification using fastText into some steps so that you can easily implement it in Python with simple code.

Step1: Install fastText

You can install fastText by running the below command in cmd:

pip install fasttext

Step2: Download Dataset

For this demonstration, I will use IMDb movie review dataset, which consists of 50,000 movie reviews. The dataset contains two columns:

review: This column contains actual text comments from users or customers about a movie
sentiment: This column contains the label of that comment as positive or negative

The size of the dataset is around 63 MB. Click here to download this dataset.

Step3: Load the Dataset

Once you downloaded the dataset, we now need to load it to use as training data for our custom text classifier using fastText and Python.

import fasttext
import pandas as pd

# Load the dataset
df = pd.read_csv("IMDB Dataset.csv")

Step4: Split the Dataset into train and test

As you know to train any machine learning or deep learning model, the basic step is to divide the dataset into train and test sets. The train set is to train the model and the test set is to evaluate the model.

# Split the dataset into train and test
train_data = df[:40000]
test_data = df[40000:]

Out of the total 50,000 rows in our dataset, we will use 40,000 as training data and 10,000 as test data.

Step4: Convert dataset to fastText format

FastText support specific format of training data to train custom text classification in Python. The data should be in plain text, where each line is a sample and consists of the label (prefixed by label). The format looks like below:

__label__positive This is a great product!
__label__negative This product is not worth the money.
__label__neutral The product is average.

Our IMDB dataset is a CSV file. So we need to convert the dataset to match the fastText training data format. The below code is to do that:

# Split the dataset into train and test
train_data = df[:40000]
test_data = df[40000:]

# Write the train data to file
with open("train_data.txt", "w", encoding="utf-8") as f:
    for i in range(len(train_data)):
        f.write("__label__" + str(train_data.iloc[i]["sentiment"]) + " " + train_data.iloc[i]["review"] + "\n")

# Write the test data to file
with open("test_data.txt", "w", encoding="utf-8") as f:
    for i in range(len(test_data)):
        f.write("__label__" + str(test_data.iloc[i]["sentiment"]) + " " + test_data.iloc[i]["review"] + "\n")

In this code, we are writing the train and test datasets to separate text files. Those txt files will be saved into your working directory. Each line contains a label and text separated by a space. The labels are prefixed with “label” to match them with the fastText training data format.

Also Read: Page Rank Algorithm and Implementation in python

In the above code, we are using encoding="utf-8" to avoid error: UnicodeEncodeError: 'charmap' codec can't encode character '\x97' in position 1166: character maps to

fasttext text classification python example training data format for custom model

Step5: Train the model

Now that our data is ready, we can train the model. FastText provides a simple API for training the text classification model in python. We will set the epochs to 50.

# Train the model
model = fasttext.train_supervised(input="train_data.txt", epoch=50)
# Save the model
model.save_model('text_classification_model.bin')

There are various parameters you can play with to improve the accuracy of your custom fastText text classifier. Such as:

lr: learning rate
minCount: Minimum frequency of word appearances
minn: Minimum character n-gram size
maxn: maximum character n-gram size
epoch: number of epochs
loss: loss function {softmax, ns, hs}

Step6: Test the model

Once the model is trained, we can evaluate its performance on the test set using the test() method.

# Test the model
result = model.test("test_data.txt")
print("Test Accuracy:", result[1])

Test Accuracy: 0.8877

As you can see the accuracy of our custom text classification model is a little bit low ~88%. This is because we have used a training dataset with only 50,000 observations. If you want to achieve accuracy over 95%, you need to use huge data to train the model, maybe around 50 lacks of observations.

Step7: Make predictions

Finally, we can use our trained model to make predictions on new text data or strings. You need to pass a text string to the predict() method of fastText. The predict() method will then returns the predicted label and its probability.

# Make predictions on new data
text = "The movie was fantastic! I loved every minute of it."
labels, prob = model.predict(text)
print("Label:", labels[0])
print("Probability:", prob[0])

Label: __label__positive
Probability: 1.0000100135803223

End Note

In this post, you learned how to use fastText for text classification and step-by-step implementation in Python. Try it out in your NLP project and let me know how much improvement you get.

FastText is a deep learning-based approach for sentiment analysis or text classification. If you are looking for a simple and rule-based approach to sentiment analysis read this article.

Also Read: Replace Text in a PDF File with Python

FAQs

Finally, let me list down some frequently asked questions about fastText for text classification python implementation.

What are the FastText model methods?

The following are some of the popular methods that are available in the FastText library:

train_supervised: trains a supervised text classification model using the given training data.
load_model: loads a pre-trained FastText model from a file.
predict: predicts the label of a text using the trained model.
test: tests the accuracy of the trained model on a set of test data.
get_nearest_neighbors: retrieves the nearest neighbors of a word in the embedding space.
get_word_vector: retrieves the vector representation of a word in the embedding space.
get_sentence_vector: retrieves the vector representation of a sentence in the embedding space.
get_dimension: retrieves the dimensionality of the embedding space.
save_model: saves the trained model to a file.
load_model: loads a pre-trained model from a file.
get_labels: retrieves the labels of the trained model.
get_input_matrix: retrieves the word vectors of the trained model.
word_ngrams: retrieves the n-grams of a word.

Which algorithm fastText used for text classification?

FastText uses a combination of two algorithms for text classification: the bag of n-grams approach and the hierarchical softmax.

The bag of n-grams approach is a traditional method for text classification. It uses the skip-gram model to represent a text as a bag of its constituent n-grams (contiguous sequences of n characters). This representation is then fed into a linear classifier, such as a logistic regression model, to make predictions.

The hierarchical softmax is an optimization that speeds up the computation of the softmax function. The hierarchical softmax decomposes the softmax computation into a binary tree to reduce the number of calculations required for each prediction.

The combination of the bag of n-grams approach and the hierarchical softmax enables FastText to handle large-scale text classification problems efficiently.

What are the advantages of FastText over Word2Vec?

FastText and Word2Vec are both libraries for text representation learning, but they have some differences.

Word Representation: FastText can learn both word and character-level representations, whereas Word2Vec only learns word-level representations. This makes FastText more versatile and it can handle out-of-vocabulary words more effectively.
Text Classification: FastText is specifically designed for text classification and can be used for both binary and multi-label classification. Word2Vec, on the other hand, is mainly used for unsupervised learning tasks such as word similarity, word embedding, and analogy.
Speed: FastText is generally faster than Word2Vec, especially for larger datasets. This is due to the use of the hierarchical softmax and the bag of n-grams approach, which are optimized specially for text classification.
Handling of Rare Words: FastText handles rare words better than Word2Vec by representing them as combinations of their constituent characters. This allows FastText to learn useful representations for words that are infrequent in the training data.

Also Read: Difference between stemming and lemmatizing and where to use

FastText is a good choice for text classification tasks, while Word2Vec is better suited for unsupervised learning tasks such as word similarity and analogy.

Which text classification is better FastText or Bert?

When it comes to text classification, both FastText and BERT have their own strengths and weaknesses. FastText is a fast and efficient text classification library that is ideal for text classification with a small amount of training data. It uses character-level information, subword information, and word information to perform text classification.

On the other hand, BERT is a transformer-based pre-trained language model that can be fine-tuned for various NLP tasks including text classification. It uses attention mechanisms and transformer blocks to capture long-range dependencies and contextual information bout the input text. This feature makes it particularly well-suited for tasks involving large amounts of text data.

In general, BERT may perform better on tasks requiring a deeper understanding of the input text. While FastText may perform better in cases where the amount of training data is limited and computational resources are scarce.

In general, BERT is better than fastText in terms of accuracy. You can definitely use BERT if you have a huge amount of training data. On the other hand, fastText will be super handy for small training data and small projects.

Similar Read:

Anindya

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.