Train BERT from Scratch on Custom Domain Data


Normally, we download any pre-trained model from hugging face or Tensorflow Hub and use it in our project. In my previous tutorial, I showed you how you can fine tune a pre trained BERT model. But Can you train your own BERT model from scratch? The answer is Yes.

Why Train BERT from Scratch?

In my last tutorial, I showed how you can finetune BERT models. We used a pre-trained BERT model named “bert-base-cased“.

Now bert-base-cased model is trained on Wikipedia dataset. So this model has a good understanding of general English language. But these default BERT models may not work properly on your domain-specific dataset.

This is because default BERT models do not have any information about your domain. For example, BERT model is pre-trained in 2019. Now during 2020 covid pandemic happend.

This means that the model was not trained specifically to understand or work with information about the coronavirus. It doesn’t even have a list of important words (or vocabulary) related to the virus like “coronavirus,” “COVID-19,” “COVID,” and so on.

Now if you want to use BERT for this health domain (covid-19), then it may not give you the proper result.

This is the reason we need to pre train BERT model from scratch on our custom domain dataset. For this tutorial, I will use Amazon cell phone review dataset to train BERT from scratch.

How to train BERT on own dataset

To implement BERT from scratch, we need to follow some steps. Let me explain those steps one by one.

Step1: Data Collection

For the demo purpose, I am going to use Amazon cell phone review dataset. But you can use your own dataset to train BERT model. Process will be same.

This 9MB data has around 68k customer reviews of various products. Download that data from Kaggle and use the following Python code to read that CSV file.

import pandas as pd

df = pd.read_csv('data/20191226-reviews.csv')


Step2: Pre-Processing

This review dataset has lots of comments. But to pre-train BERT model, we just use tle and body column.

mlm_df = df[['title', 'body']].copy()

Now we need to convert that data frame into text file. In the below code, we are writing those data frame values (column title and body) into review_data.txt file.

# Write the sub dataframe into txt file
with open('data/review_data.txt', 'w', encoding='utf-8') as f:
    for title, body in zip(mlm_df.title.values, mlm_df.body.values):
        f.write(title + '\n')
        f.write(body + '\n')

The text file format should look like below. The first row should contain the title, and the second row should contain the body of the dataframe. Similarly, the third row should have the next title, and the fourth row should have the body. This pattern continues for subsequent rows.


Step3: Train Custom BERT Tokenizer

Now we will use that text file to train our domain specific custom bert tokenizer. Below Python code will train our custom BERT tokenizer and save the vocabulary file inside our local directory.

from tokenizers import BertWordPieceTokenizer
tokenizer = BertWordPieceTokenizer()
tokenizer.train(files="data/review_data.txt", vocab_size=30522)
tokenizer.save_model('D:/Question_Answering', 'phone_review')

The vocabulary file will start with default BERT tokens [PAD], [UNK], [CLS], [SEP], [MASK]. Then numbers and special characters. After that, it will contain tokens of our input text documents.

Step4: Load Custom Tokenizer

Now let’s load our custom tokenizer to see how it is performing.

# Load the custom tokenizer
from transformers import BertTokenizer, LineByLineTextDataset

# Read the vocabulary file
vocab_file_dir = 'phone_review-vocab.txt' 

custom_tokenizer = BertTokenizer.from_pretrained(vocab_file_dir)

sentence = 'Motorola V860 is a good phone'

encoded_input = custom_tokenizer.tokenize(sentence)

As you can see, our custom tokenizer correctly and accurately tokenizes our input review text. Now let’s see how BERT default tokenizer is performing for our input review text.

# Load BERT default tokenizer
bert_default_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

sentence = 'Motorola V860 is a good phone'

encoded_input = bert_default_tokenizer.tokenize(sentence)

Hope you can see the difference. Our custom tokenizer is giving the phone model as 'v860' but default BERT tokenizer is tokenizing the phone model as 'v8', '##60'.

This is the reason we should train our custom tokenizer from scratch to train BERT model. Because the default BERT model may not have good understanding of your domain data.

Also Read:  Abstractive Text Summarization in 12 lines with T5

Step5: Convert Raw text data to Tokens

So we trained our custom tokenizer. Now we can convert our input text data (review_data.txt) to tokens. We can do that using LineByLineTextDataset class of transformers library.

# Convert input text data to tokens for custom bert model

dataset= LineByLineTextDataset(
    tokenizer = custom_tokenizer,
    file_path = 'data/review_data.txt',
    block_size = 128

print('No. of lines: ', len(dataset))
No. of lines:  54409

Step6: Configure Model Parameters

In this step, we need to define some important parameters of our custom BERT model.

# Define model parameters to train BERT model from scratch
from transformers import BertConfig, BertForMaskedLM, DataCollatorForLanguageModeling

config = BertConfig(
model = BertForMaskedLM(config)
print('No of parameters: ', model.num_parameters())

data_collator = DataCollatorForLanguageModeling(
    tokenizer=custom_tokenizer, mlm=True, mlm_probability=0.15
No of parameters:  66987066

In this code, line 5 is very important. Vocab size must be same as custom BERT tokenizer (step 3). While training custom tokenizer at step 3 we kept vocab_size=30522.

Also, note that our custom BERT model will have around ~67 Million parameters.

Step7: Define Training Arguments

Let’s now specify some important training arguments to train our custom BERT model from scratch.

# Specify model arguments to train bert from scratch

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(

trainer = Trainer(

Here, I am using 30 epochs. Saving trained BERT model inside custom_bert_output/ folder. I am keeping training batch size to 16. You can change it as per your RAM or GPU memory.

Note: I have tested this code for both CPU and GPU. Both are working correctly. If you have GPU, check whether your notebook is using Pytorch as GPU backend or not by using following code.

import torch

Step8: Train the BERT Model

Now finally, you can run below Python code to train our custom BERT model from scratch. Once the model is trained we are saving this custom BERT model inside custom_bert_output folder.


After successful training, your custom_bert_output folder should contain three files:

  1. config.json
  2. pytorch_model.bin
  3. training_args.bin

Step9: Test our Custom Bert Model

Here is the interesting part. Let’s see whether our custom BERT model learned about amazon product review data or not.

To do that first, we need to load our trained model and try to predict mask words. To check model accuracy, we will mask a certain word in our sentence and ask BERT model to predict that masked word.

# Load custom trained BERT model
from transformers import pipeline

model = BertForMaskedLM.from_pretrained('custom_bert_output')

fill_mask = pipeline(
# Actual Text: the battery life is bad
fill_mask('the battery [MASK] is bad')
[{'score': 0.08861254155635834,
  'token': 2166,
  'token_str': 'life',
  'sequence': 'the battery life is bad'},
 {'score': 0.041087497025728226,
  'token': 1996,
  'token_str': 'the',
  'sequence': 'the battery the is bad'},
 {'score': 0.03337299823760986,
  'token': 2204,
  'token_str': 'good',
  'sequence': 'the battery good is bad'},
 {'score': 0.030434928834438324,
  'token': 2307,
  'token_str': 'great',
  'sequence': 'the battery great is bad'},
 {'score': 0.02850954607129097,
  'token': 2025,
  'token_str': 'not',

See, it is correctly predicting the masked word as “life”. Let’s try one more:

# Best phone ever. Love samsung .
fill_mask('Best [MASK] ever. Love samsung .')
[{'score': 0.8440187573432922,
  'token': 3042,
  'token_str': 'phone',
  'sequence': 'best phone ever. love samsung.'},
 {'score': 0.013265354558825493,
  'token': 4031,
  'token_str': 'product',
  'sequence': 'best product ever. love samsung.'},
 {'score': 0.011673685163259506,
  'token': 2009,
  'token_str': 'it',
  'sequence': 'best it ever. love samsung.'},
 {'score': 0.0071221012622118,
  'token': 1045,
  'token_str': 'i',
  'sequence': 'best i ever. love samsung.'},
 {'score': 0.003735979786142707,
  'token': 1012,
  'token_str': '.',
  'sequence': 'best. ever. love samsung.'}]

It is also predicting masked word correctly as “phone”. So we can assume that our custom pretrained BERT model some what learned how people share reviews about products on amazon.

Also Read:  Document Similarity Matching using TF-IDF & Python

Fine-tune Custom BERT Model

Okay, so we trained our custom tokenizer and custom Bert model from scratch. Let’s now finetune this custom BERT model for question answering downstream tasks. Let’s see how we can do this step by step.

Note: In my last tutorial I showed you how easily you can fine tune BERT model, but that was using a pretrained model. I will use same method here and add little bit of twist so that we can finetune our custom BERT model.

Step1: Prepare & Read Dataset

I am going to use the same dataset which I used in fine-tuning BERT tutorial. If you want to know how I prepared the dataset, read this tutorial: Training Data Preparation for Custom BERT Model.

import json

# Read training data to finetune custom BERT model
with open(r"data/amazon_data_train.json", "r") as read_file:
    train = json.load(read_file)

# Read test data to evaluate finetuned model
with open(r"data/amazon_data_test.json", "r") as read_file:
    test = json.load(read_file)

Step4: Save Custom Tokenizer

In this step, we need to save our custom tokenizer (Created in Step 3 of Training BERT from scratch). While downloading pre-trained models, you may observe that the model folder also contains tokenizer.

In this step, we are converting our custom tokenizer txt file to the default BERT format and saving it into the trained model folder (Step 8 of Training BERT from scratch).

import logging
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs

from transformers import BertTokenizer

# Load your custom tokenizer
tokenizer = BertTokenizer.from_pretrained('phone_review-vocab.txt')

# Save the tokenizer to the model directory

As you can see in the above output, our custom txt tokenizer file is converted into four files and saved those files inside our custom model directory (custom_bert_output/).:

  1. tokenizer_config.json
  2. special_tokens_map.json
  3. vocab.txt
  4. added_tokens.json

So now our custom model folder structure should look like below:

	├── custom_bert_output
            ├── config.json
            ├── pytorch_model.bin
	    ├── special_tokens_map.json
	    ├── tokenizer_config.json
            ├── training_args.bin
	    └── vocab.txt

Step5: Load Custom BERT Model and Tokenizer

In this step, we just need to mention path of our custom BERT model folder and load it with our custom training arguments and tokenizer.

import logging
from simpletransformers.question_answering import QuestionAnsweringModel, QuestionAnsweringArgs
from transformers import BertTokenizer, BertForMaskedLM

# Define model type and custom bert model path

# Create a output folder to save fine tuned custom bert
import os
output_dir = 'finetune_bert_outputs'

# Set up training arguments
train_args = {
    "reprocess_input_data": True,
    "overwrite_output_dir": True,
    "use_cached_eval_features": True,
    "output_dir": f"outputs/{model_type}",
    "best_model_dir": f"{output_dir}/{model_type}/best_model",
    "evaluate_during_training": True,
    "max_seq_length": 128,
    "num_train_epochs": 30,
    "evaluate_during_training_steps": 1000,
    "save_model_every_epoch": False,
    "save_eval_checkpoints": False,
    "train_batch_size": 16,
    "eval_batch_size": 16

# Load custom model with tokenizer
model = QuestionAnsweringModel(model_type, model_name, args=train_args, use_cuda=False)

If you have any difficulty understanding this code, I will highly recommend you to read this post: Fine-tune BERT for Question Answering.

Step8: Train Model

Finally, you just need to execute below code to finetune our custom BERT model from scratch for amazon product description dataset.

# Train the model
model.train_model(train, eval_data=test)

Step9: Evaluate Model

Let’s evaluate our finetuned bert model on the test dataset to check accuracy.

# Evaluate the model
result, texts = model.eval_model(test)
{'correct': 1, 'similar': 5, 'incorrect': 3, 'eval_loss': -5.978515625}

Step10: Model Inference

Let’s now use our custom pre trained and finetuned model on real data and see how it is performing. Let’s do that.

# Make predictions with the model
to_predict = [
        "context": "Samsung Galaxy M14 5G (Smoky Teal, 6GB, 128GB Storage) | 50MP Triple Cam | 6000 mAh Battery | 5nm Octa-Core Processor | 12GB RAM with RAM Plus | Android 13 | Without Charger",
        "qas": [
                "question": "What is the model name of the Samsung smartphone?",
                "id": "0",

answers, probabilities = model.predict(to_predict, n_best_size=None)
[{'id': '0', 'answer': ['Samsung Galaxy M14', 'Samsung Galaxy M14 5G (Smoky Teal, 6GB, 128GB Storage) | 50MP Triple Cam | 6000 mAh Battery | 5nm Octa-Core Processor | 12GB RAM with RAM Plus | Android', 'Samsung Galaxy']}]

Great! It is performing pretty well, right? Let’s test for another one.

# Make predictions with the model
to_predict = [
        "context": "Samsung Galaxy M14 5G (Smoky Teal, 6GB, 128GB Storage) | 50MP Triple Cam | 6000 mAh Battery | 5nm Octa-Core Processor | 12GB RAM with RAM Plus | Android 13 | Without Charger",
        "qas": [
                "question": "Does the Samsung Galaxy M14 5G come with a charger?",
                "id": "0",

answers, probabilities = model.predict(to_predict, n_best_size=None)
[{'id': '0', 'answer': ['Android 13 | Without Charger', 'r', 'Without Charger']}]


Let me answer some commonly asked questions in this faq section.

Also Read:  Natural Language Processing Using TextBlob

Can you train your own BERT model?

Yes, we can train our own BERT model instead of using a pre-trained bert model like bert-base-cased. In this tutorial, I explained how you can train BERT model from scratch with your own dataset (amazon product review data) and further fine-tune it for questin answering task.

How long to train BERT from scratch?

It completely depends on your data and device type. For this tutorial, I tried pre training BERT using CPU for amazon product review dataset. It took me around 1 hour for only 1 epoch with batch size 16. Now you can calculate how long it may take for 30 epochs.

Can I train BERT with CPU?

Definitely, This is the reason I wrote this tutorial. All the code I provided in this tutorial, tested in both CPU and GPU.

What is BERT model trained on?

Pre-trained BERT models are trained on plain unlabeled English Wikipedia documents (articles). This is the reason it may not work for your domain say banking, telecom, etc. Thus we need to pre-train our own BERT model from scratch with our domain data.

Take Away

In this tutorial, I explained how to implement BERT from scratch on your own dataset. I started this tutorial from training our cuctom tokenizer. Then using that tokenizer, we tokenized our input data (Amazon product review dataset).

Next, we passed that tokenized data to train our custom Bert MLM model. Finally, using that custom model, we fine tuned or custom BERT model.

This is it for this tutorial. if you have any questions or suggestions regarding this tutorial, please let me know in the comment section below.

Similar Read:

Leave a comment