Fine tune BERT for Question Answering


BERT is nothing but a transformer based musked language model. We can fine tune this BERT model for downstream NLP tasks like Named Entity Recognition (NER), Question Answering, Text Classification.

In this tutorial, I will show you a way to fine-tune all popular transformer based models like: BERT, Roberta, XlNet, etc. So not only BERT you can train other large language models with few lines of code. You can run this code in CPU and GPU both, this much flexible this approach is.

Why Fine-tune is Required?

In machine learning, we use fine-tuning techniques. Nowadays so much work is happening in the field of Deep Learning. These deep learning models are trained on huge datasets.

You can train those models on your dataset. For example, BERT is trained using Wikipedia data. You can train BERT model for your custom dataset like consumer data, banking data, etc.

But training a model from scratch with high volume of data needs huge computation power. You will not be able to do it with our daily use computers.

This is the reason we use pre-trained models. Someone trained these models and uploaded them somewhere. You can find lots of pretrained models in Huggingface and Tensorflow Hub.

These pre-trained models have a good understanding of a certain language (ex: English) for which it is trained. You can take these pre-trained models that have already been trained on a large dataset and further train (or fine-tune) them on a smaller, task-specific dataset.

In this example, I am going to fine tune BERT model for question answering using Amazon product description dataset.

There is a fine-tuned model available for question answering with BERT which I discussed in my previous tutorial. The fine tuned model name is: bert-large-uncased-whole-word-masking-finetuned-squad.

Why do we need to fine-tune our own model if there is already a pre-existing fine-tuned model for question answering with BERT? The reason is that the ‘bert-large-uncased-whole-word-masking-finetuned-squad‘ model has been fine-tuned using the SQuAD dataset. However, this pre-existing fine-tuned model may not perform accurately on your specific dataset (in this case Amazon product description dataset).

Course for You: Transformers for Natural Language Processing

What happens when you fine-tune BERT?

In transfer learning, we typically divide any model layers into two parts: top-level layers and lower-level layers. While finetuning YOLOv3 model, I explained this topic in detail. That can be a good read to understand this topic (transfer learning).

The lower layers capture generic and lower-level features. We can reuse these layers for any task. In simple words, lower layers are used for feature extraction.

For a model like BERT, there are so many lower layers. There are few top layers which we need to fine-tune. Top layers determine what kind of output we want from the model, whether it is Classification, Question Answering, or Named Entity Recognition.

Also Read:  FastText Word Embeddings Python implementation

This approach allows effective transfer learning because the model can make use of the knowledge it gained during pre-training and adjust to the specific details of the task at hand.

Fine-tune BERT Model

Now comes to the final part. I will break the entire step of fine tuning BERT model into some steps:

Step1: Prepare Dataset

In this tutorial, I am going to use Amazon product description dataset. In my last tutorial, I prepared this data to fine tune BERT model for question answering. The dataset format looks like below:


In my last tutorial, I created two datasets: amazon_data_train.json and amazon_data_test.json. I am going to use those two datasets for this tutorial.

Step2: Install Libraries

To finetune BERT or any other transaformer based popular models, you just need to install one package. Below is the command to install that library.

pip install simpletransformers

Step3: Read Dataset

Let’s first read our domain-specific data to fine tune BERT model for question answering. I am going to use the same data which I prepared in my last tutorial: Training Data Preparation for Custom BERT Model.

# Import required libraries to fine tune transformer models lik BERT
import json
import logging
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

# Read dataset to fine tune BERT model

# Training data
with open(r"amazon_data_train.json", "r") as read_file:
    train = json.load(read_file)

# Validation dataset
with open(r"amazon_data_test.json", "r") as read_file:
    test = json.load(read_file)

Step4: Define Model

At this stage, we need to select and define model which we want to fine-tune (in our case BERT). There are two parameters we need to define: model type and model name.

Below is the list of models for question answering supported by simpletransformers library:

ModelModel code in library

For this demo project, I am going to use BERT model type and the model I will use is bert-base-cased. You can choose other BERT models for fine-tuning. Available BERT models are listed below:

Model nameNo. of parametersLanguage

Okay now let’s define model in our Python code.

model_name= "bert-base-cased"

Step5: Create Output Directory

In this step, we need to create a folder where we can save the fine-tuned bert model. You can create it manually. I am doing it through Python code.

# Create folder to save fine tuned bert model inside working directory

import os
output_dir = 'bert_outputs'

This code will create a folder named “bert_outputs” inside your working directory.

Step6: Specify Training Arguments

Now let’s define some important training arguments. Below is the Python code to do that.

# Define transformer model arguments before training the BERT model

train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "use_cached_eval_features": True,
    "output_dir": f"outputs/{model_type}",
    "best_model_dir": f"{output_dir}/{model_type}/best_model",
    "evaluate_during_training": True,
    "max_seq_length": 128,
    "num_train_epochs": 30,
    "evaluate_during_training_steps": 1000,
    "save_model_every_epoch": False,
    "save_eval_checkpoints": False,
    "train_batch_size": 16,
    "eval_batch_size": 16

model = QuestionAnsweringModel(model_type,model_name, args=train_args, use_cuda=False)

I am using batch size 8 and keeping number of epochs to 30. You can change it as per your wish.

Also Read:  Page Rank Algorithm and Implementation in python

For this tutorial, I trained the model using CPU. It took me around 20 hours to train the BERT model for 30 epochs in CPU. You can also train it using GPU by making the use_cuda=True.

Step7: Train Model

Now comes the final part. You just need to run below Python code to train or fine tune the BERT model for question answering.

# Train the model
model.train_model(train, eval_data=test)

Final best model will be saved inside this directory: /bert_outputs/bert/best_model

Step8: Evaluate Model

To evaluate our trained transformer model, you can run below Python code.

# Evaluate the model
result, texts = model.eval_model(test)
{'correct': 5, 'similar': 2, 'incorrect': 2, 'eval_loss': -6.625}

Step9: Model Inference

Now let’s do the interesting part of this tutorial. Let’s test our model, whether it is performing well for our dataset or not. We first need to load the trained model from this directory: /bert_outputs/bert/best_model.

Let’s try for a question: What is the model name of the Samsung smartphone?

# Load model from training checkpoint
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

model = QuestionAnsweringModel("bert", "outputs/bert/best_model")

# Make predictions with the model
to_predict = [
        "context": "Samsung Galaxy M14 5G (Smoky Teal, 6GB, 128GB Storage) | 50MP Triple Cam | 6000 mAh Battery | 5nm Octa-Core Processor | 12GB RAM with RAM Plus | Android 13 | Without Charger",
        "qas": [
                "question": "What is the model name of the Samsung smartphone?",
                "id": "0",

answers, probabilities = model.predict(to_predict, n_best_size=None)
Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]
[{'id': '0', 'answer': ['Samsung Galaxy M14', 'Samsung Galaxy M14 5G', 'Android 13']}]

Wow, it is predicting the correct answer. We are getting three answers because we mentioned n_best_size=None. To get only one output you can define it like this way: n_best_size=1.

Let’s try another one.

# Make predictions with the model
to_predict = [
        "context": "Samsung Galaxy M14 5G (Smoky Teal, 6GB, 128GB Storage) | 50MP Triple Cam | 6000 mAh Battery | 5nm Octa-Core Processor | 12GB RAM with RAM Plus | Android 13 | Without Charger",
        "qas": [
                "question": "What is the color option available for the Samsung Galaxy M14 5G?",
                "id": "0",

answers, probabilities = model.predict(to_predict, n_best_size=None)
Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]
[{'id': '0', 'answer': ['Smoky Teal', 'Teal', 'Smoky Tea']}]

It is working accurately. So we successfully fine-tuned BERT model for our custom question-answering dataset.


While reading this tutorial, some questions you may have which I may not included in the article. Let me answer those in this frequently asked question section.

Also Read:  Gensim word2vec python implementation
Can we run BERT on CPU?

Yes, you can train and run BERT on CPU and GPU both. You just need to mention that in the training arguments section. use_cuda=False for CPU and use_cuda=True for GPU.

Does fine-tuning improve accuracy?

It depends on what kind of data you are using. But in general you should get better accuracy than any generic question answering model like: bert-large-uncased-whole-word-masking-finetuned-squad.

Let me show you output of both bert-large-uncased-whole-word-masking-finetuned-squad model and our custom-finetuned model for our Amazon product description data.

Output for bert-large-uncased-whole-word-masking-finetuned-squad model for question: What is the color option available for the Samsung Galaxy M14 5G?


Note: I am not writing code for squad model here. The code you can find in this tutorial: Build Question Answering System with BERT model.

Output for our custom-finetuned model for our Amazon product description data

Running Prediction:   0%|          | 0/1 [00:00<?, ?it/s]
[{'id': '0', 'answer': ['Smoky Teal', 'Teal', 'Smoky Tea']}]
How much GPU do I need to fine-tune BERT?

It depends on the amount of data you are using while doing the fine-tuning. I have tested the dataset with my 4gb graphics card. It was working without any issues.

What is the Disadvantage of fine-tuning?

Always fine-tuning any model is a good idea. But it requires lots of data which you need to label it manually. That is a big challenge for an individual.

Along with that fine-tuning BERT models can be computationally expensive, especially for larger models like bert-large-uncased. The process typically requires powerful hardware and longer training times.

Final Thought

In this tutorial, I showed you the simplest way to fine tune BERT model for question answering. Not only BERT, but you can also train various other transformer-based models like RoBERTa, XLNet, MPNet, ELECTRA, etc. with this technique.

If you want to train on CPU, no worries, you can train BERT on CPU by this technique. It is not mandatory that everyone in this world will have GPU in their system.

I used Amazon product descriptions as my training dataset for question answering. But you can use your domain data or even business data in your project.

This is it for this tutorial. If you have any questions or suggestion regarding this tutorial, let me know in the comment section below.

Similar Read:

Leave a comment