Finetune LLM (LLAMA, GPT-Neo, GPT-J, Pythia) in Google Colab

finetune-or-train-any-llm-with-h2o-llm-studio-with-or-without-code-in-google-colab

After releasing chatgpt, large language model or LLM is a hot cake in the market now. Every organization wants to make their own chatbot. But fine tuning a large language model (LLM) is a difficult task. In this tutorial, I will show you how you can easily finetune LLM like Pythia, LLAMA, GPT-J, GPT-NEO, Falcon, etc. using H2O LLM studio on Google Colab.

What is LLM?

Large Language model or LLM is a deep natural language model which is trained using tons of books, articles, and stuff from the internet. This kind of model stores information in its weight like a memory card.

This is the reason this type of model learned how a specific language works. So, when you ask it a question or tell it to write something, it thinks really hard and creates text that sounds like a human wrote it.

If you finetune a LLM with your own custom domain dataset, the model will grasp the details of that domain and will become that domain expert. You then can ask anything from that domain and it will answer like a real human domain expert. Memory of these models depends on the number of parameters or size of the model.

Course for You: Exploring The Technologies Behind ChatGPT, GPT4 & LLMs

What is LLM fine-tuning?

Finetuning a LLM means you take a pre-trained LLM model and make it even smarter by teaching it about a specific topic or job. It’s like giving it extra training so it can do a particular task or understand a particular subject really well. This helps the model become an expert in that area and do a better job when you use it for that task.

For example, T5 model is trained on C4 corpus. This kind of corpus has information about general topics. Now if I use that pre-trained T5 model and ask a domain-specific question (for example legal documents), it will not be able to answer.

This is the reason we need to train that T5 llm model using our domain-specific dataset. If you still have confusion about what is LLM finetuning then I will recommend you to read below articles.

Why should you fine-tune llms

Nowadays lots of organizations made API to access their LLM model. For example, ChatGPT provides API which you can use for any kind of downstream NLP task like: summarizing any document, question answering, etc.

But the problem with those APIs is they are not free. They will charge you depending on your usage. For example, OpenAI is charging $0.002/1K tokens to use their ChatGPT API call.

Another problem with those LLM APIs is that you are not sure whether they are going to store and utilize your business data or not. They can sale your private data or they even can train their models using your private domain data.

But if you want the same flavor of ChatGPT and don’t want to spend extra money for API calls, you have to use any open-source LLM and further need to fine-tune it using your domain data.

Course for You: Develop LLM-powered applications with LangChain

About H2O LLM Studio

H2O LLM studio is a framework developed by h2o.ai. The main focus of this framework is to easily train or finetune LLM models. There are mainly two methods of using this tool: Without code (using GUI) and with code.

To use GUI method (without code) you must have a Ubuntu operating system and 24 GB of GPU memory. With this method, you no need to write a single line of code, you just need to select model and parameters that’s all.

If you are like me who is using a Windows computer and has less or no GPU in our system, we also can use H2O llm studio using CLI method. In this method, you need to write or modify some code in Google Colab.

Train LLM model with custom dataset?

In this tutorial, I am going to use CLI method. So we need to write or modify some code to train or finetune our desired LLM model. Let me break the entire code into some steps.

Note: You can also train or finetune LLMs with H2O studio locally if you have GPU memory more than 30 GB. Codes and commands will be same.

Setup Google Colab

First, open a Google colab notebook and go to Runtime -> Change runtime type. Then select a GPU. In this case, I am selecting T4 GPU. You will get 14 GB of free GPU memory, which is enough to train a small LLM model.

select-gpu-to-train-llm-using-h2o-llm-studio-in-goolgle-colab-python-code

You can also check your runtime GPU by running the below command in Colab notebook.

!nvidia-smi -L

This is my output
GPU 0: Tesla T4 (UUID: GPU-d53c72da-5f47-2236-ed7d-46659c33010d)

Clone Github H2O

Now we need to clone H2O GitHub repository. You can do this by following command.

!git clone https://github.com/h2oai/h2o-llmstudio.git
!cd h2o-llmstudio && git checkout ce10af57ff118a2bbb81b5b3eae12273e290299a -q
!cp -r h2o-llmstudio/. ./
!rm -r h2o-llmstudio

Cloning into 'h2o-llmstudio'...
remote: Enumerating objects: 6622, done.
remote: Counting objects: 100% (930/930), done.
remote: Compressing objects: 100% (395/395), done.
remote: Total 6622 (delta 612), reused 799 (delta 518), pack-reused 5692
Receiving objects: 100% (6622/6622), 20.88 MiB | 34.82 MiB/s, done.
Resolving deltas: 100% (4627/4627), done.

Install Python 3.10

One important point is that H2O studio requires Python 3.10 or higher. Colab default Python version is 3.10 now. But to keep everything safe we should create a fresh virtual environment in Google Colab. Let’s do that using below commands.

# Install pyhon 3.10 that will be used within pipenv
!sudo add-apt-repository ppa:deadsnakes/ppa -y > /dev/null
!sudo apt install python3.10 python3.10-distutils psmisc -y > /dev/null
!curl -sS https://bootstrap.pypa.io/get-pip.py | python3.10 > /dev/null

# install requirements
!make setup > /dev/null

Creating a virtualenv for this project...
Pipfile: /content/Pipfile
Using /usr/local/bin/python (3.10.12) to create virtualenv...
⠸ Creating virtual environment...created virtual environment CPython3.10.12.final.0-64 in 890ms
  creator Venv(dest=/root/.local/share/virtualenvs/content-cQIIIOO2, clear=False, no_vcs_ignore=False, global=False, describe=CPython3Posix)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.2.1, setuptools==68.0.0, wheel==0.41.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator

✔ Successfully created virtual environment! 
Virtualenv location: /root/.local/share/virtualenvs/content-cQIIIOO2

Download Dataset

In this tutorial, I am going to finetune an LLM to make a chatbot, for that, we need a training data. For this demo, I am going to use Open Assist dataset.

Also Read: Convert Natural Language Text to SQL Query with LLM

You can find this dataset on Kaggle. It is a human-generated assistant-style conversation data. You can download that data manually and place it inside the Google Drive to use it.

But for this tutorial, I am going to use datasets library to download and read this openassist dataset in Python. Below is the command to install that library.

!python -m pip install datasets > /dev/null
!mkdir data
!mkdir data/oasst-data

Above command will also create a folder named data -> oasst-data inside our working directory. Now let’s download openassit dataset by running Python code.

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
from datasets import load_dataset

ds = load_dataset("OpenAssistant/oasst1")

Downloading readme: 100%
10.2k/10.2k [00:00<00:00, 487kB/s]
Downloading data files: 100%
2/2 [00:02<00:00, 1.10s/it]
Downloading data: 100%
39.5M/39.5M [00:02<00:00, 21.8MB/s]
Downloading data: 100%
2.08M/2.08M [00:00<00:00, 6.97MB/s]
Extracting data files: 100%
2/2 [00:00<00:00, 78.80it/s]
Generating train split: 100%
84437/84437 [00:00<00:00, 91327.91 examples/s]
Generating validation split: 100%
4401/4401 [00:00<00:00, 34319.46 examples/s]

Split Dataset

Like any other machine learning model, we need to split our dataset into train and test data. Below is the code to do that.

# Split dataset
train = ds['train']
val = ds['validation']

# Convert to data frame
train = pd.DataFrame(train)
val = pd.DataFrame(val)

train.head()

read-and-display-kaggle-open-assist-dataset-in-python-pandas

Data Pre-Processing

In the OpenAssist data, we’ll be mainly using two columns: “text” and “role“. The “text” part holds the chat conversation, and the “role” part tells us if the command is from the customer (who starts the conversation) or the agent (who assists).

To train a chatbot, we need to organize the data as question and answer format. This means we have to set up two columns: one for questions (what the customer asks) and the other for answers (what the assistant replies). Below is the code to manipulate our input data into question answering dataset.

def prep_data(df):
    df_assistant = df[(df.role=="assistant") & (df["rank"]==0.0)]
    df_prompter = df[(df.role=="prompter")]
    df_prompter = df_prompter.set_index("message_id")
    df_assistant["output"] = df_assistant["text"].values

    inputs = []
    for idx, row in df_assistant.iterrows():
        input = df_prompter.loc[row.parent_id]
        inputs.append(input.text)

    df_assistant["instruction"] = inputs

    df_assistant = df_assistant[df_assistant.lang=="en"]

    df_assistant = df_assistant[["instruction", "output"]]

    return df_assistant

df_train = prep_data(train)
df_val = prep_data(val)
# Print some data
df_train.head()

pre-process-open-assist-dataset-to-make-question-answering-format

Once the data is ready, we can join those train and test data and make a single CSV file and save it inside newly created folder inside our working directory.

# Saving csv file inside newly created folder inside our working directory
pd.concat([df_train, df_val]).reset_index(drop=True).to_csv("data/oasst-data/train_full.csv", index=False)

save-csv-trainig-data-to-finetune-pythia-llm-model

Configure LLM Parameters

In this step, we need to define some parameters to finetune our LLM based chatbot. H2O framework gives us so much flexibility to do that. We just need to modify cfg_notebook.py. Run below Python code to modify all required parameters.

%%writefile cfg_notebook.py

import os
from dataclasses import dataclass

from llm_studio.python_configs.text_causal_language_modeling_config import ConfigProblemBase, ConfigNLPCausalLMDataset, \
    ConfigNLPCausalLMTokenizer, ConfigNLPAugmentation, ConfigNLPCausalLMArchitecture, ConfigNLPCausalLMTraining, \
    ConfigNLPCausalLMPrediction, ConfigNLPCausalLMEnvironment, ConfigNLPCausalLMLogging


ROOT_DIR = "./data/oasst-data/"

@dataclass
class Config(ConfigProblemBase):
    # Define Doutput Directory
    output_directory: str = "/content/drive/MyDrive/LLM/output/demo_oasst-data/"
    experiment_name: str = "demo_experiment"
    # Define model name
    llm_backbone: str = "EleutherAI/pythia-1.4b-deduped"

    dataset: ConfigNLPCausalLMDataset = ConfigNLPCausalLMDataset(
        train_dataframe=os.path.join(ROOT_DIR, "train_full.csv"),

        validation_strategy="automatic",
        validation_dataframe="",
        validation_size=0.01,

        prompt_column=("instruction",),
        answer_column="output",
        text_prompt_start="",
        text_answer_separator="",

        add_eos_token_to_prompt=True,
        add_eos_token_to_answer=True,
        mask_prompt_labels=False,

    )
    tokenizer: ConfigNLPCausalLMTokenizer = ConfigNLPCausalLMTokenizer(
        max_length_prompt=128,
        max_length_answer=128,
        max_length=256,
        padding_quantile=1.0
    )
    augmentation: ConfigNLPAugmentation = ConfigNLPAugmentation(token_mask_probability=0.0)
    architecture: ConfigNLPCausalLMArchitecture = ConfigNLPCausalLMArchitecture(
        backbone_dtype="float16",
        gradient_checkpointing=False,
        force_embedding_gradients=False,
        intermediate_dropout=0
    )
    training: ConfigNLPCausalLMTraining = ConfigNLPCausalLMTraining(
        loss_function="CrossEntropy",
        optimizer="AdamW",

        learning_rate=0.00015,

        batch_size=4,
        drop_last_batch=True,
        epochs=1,
        schedule="Cosine",
        warmup_epochs=0.0,

        weight_decay=0.0,
        gradient_clip=0.0,
        grad_accumulation=1,

        lora=True,
        lora_r=4,
        lora_alpha=16,
        lora_dropout=0.05,
        lora_target_modules="",

        save_best_checkpoint=False,
        evaluation_epochs=1.0,
        evaluate_before_training=False,
    )
    prediction: ConfigNLPCausalLMPrediction = ConfigNLPCausalLMPrediction(
        metric="BLEU",

        min_length_inference=2,
        max_length_inference=256,
        batch_size_inference=0,

        do_sample=False,
        num_beams=2,
        temperature=0.3,
        repetition_penalty=1.2,
    )
    environment: ConfigNLPCausalLMEnvironment = ConfigNLPCausalLMEnvironment(
        mixed_precision=True,
        number_of_workers=4,
        seed=1
    )

Writing cfg_notebook.py

In the above configuration file, I just changed 4 parameters listing those below:

Line 11: This is the path of the training dataset
- ROOT_DIR = "./data/oasst-data/"
Line 16: Specify the output directory of trained model. I am saving inside google drive.
- output_directory: str = "/content/drive/MyDrive/LLM/output/demo_oasst-data/"
Line 19: Specify LLM model name which you want to finetune.
- llm_backbone: str = "EleutherAI/pythia-1.4b-deduped"
Line 22: Training data name
- train_dataframe=os.path.join(ROOT_DIR, "train_full.csv")
Line 5: Number of Epoch. I am using only 1 epoch just to show you the process.
- epochs=1

Here I am fine tuning pythia-1.4b-deduped but you can also try your favorite LLM model. Here is the list of LLM models supported by H2O Studio.

h2oai/h2ogpt-oig-oasst1-512-6.9b,
h2oai/h2ogpt-oasst1-512-20b,
h2oai/h2ogpt-gm-oasst1-en-1024-12b,
EleutherAI/gpt-neo-1.3B,
EleutherAI/gpt-j-6B,
EleutherAI/gpt-neox-20b,
facebook/opt-125m,
facebook/opt-2.7b,
facebook/opt-6.7b,
facebook/opt-13b,
EleutherAI/pythia-1b-deduped,
EleutherAI/pythia-2.8b-deduped,
EleutherAI/pythia-6.9b-deduped,
EleutherAI/pythia-12b-deduped,
cerebras/Cerebras-GPT-13B,
stabilityai/stablelm-base-alpha-7b,
togethercomputer/GPT-NeoXT-Chat-Base-20B,
h2oai/h2ogpt-4096-llama2-70b,
h2oai/h2ogpt-4096-llama2-70b-chat,
h2oai/h2ogpt-4096-llama2-13b,
h2oai/h2ogpt-4096-llama2-13b-chat,
h2oai/h2ogpt-4096-llama2-7b,
h2oai/h2ogpt-4096-llama2-7b-chat,
tiiuae/falcon-40b,
tiiuae/falcon-7b,
openlm-research/open_llama_13b,
openlm-research/open_llama_7b,
openlm-research/open_llama_3b,

List of these models may change over time. It is a good idea to check the updated list by going to this link: https://github.com/h2oai/h2o-llmstudio/blob/main/llm_studio/python_configs/text_causal_language_modeling_config.py.

After running above code cell you need to run the below command to update cfg_notebook.py file.

%%writefile run.sh

pipenv run python train.py -C cfg_notebook.py &

wait
echo "all done"

Writing run.sh

Train/ Finetune LLM model

Our training dataset is done, and configuration files are defined, so now we can run the last final command to finetune our LLM model. Below is the command.

# Train LLM Model
!sh run.sh

It took me arround 20 miniutes to train or finetune pythia-1.4b-deduped LLM model for only one epoch. Below is the training log.

2023-08-21 06:01:56,859 - INFO: Number of observations in train dataset: 8191
2023-08-21 06:01:56,860 - INFO: Number of observations in validation dataset: 83
Downloading (…)lve/main/config.json: 100% 570/570 [00:00<00:00, 393kB/s]
2023-08-21 06:01:57,253 - INFO: dtype: torch.float16
Downloading pytorch_model.bin: 100% 2.93G/2.93G [00:32<00:00, 90.6MB/s]
trainable params: 786432 || all params: 1415434240 || trainable%: 0.055561182411413196
2023-08-21 06:02:39,652 - INFO: Training Epoch: 1 / 1
2023-08-21 06:02:39,653 - INFO: train loss:   0%|          | 0/2047 [00:00<?, ?it/s]
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
2023-08-21 06:03:14,636 - INFO: train loss: 2.20:   5%|4         | 102/2047 [00:34<11:07,  2.92it/s]
2023-08-21 06:03:24,832 - INFO: train loss: 2.20:   5%|4         | 102/2047 [00:45<11:07,  2.92it/s]
2023-08-21 06:03:47,838 - INFO: train loss: 2.21:  10%|9         | 204/2047 [01:08<10:13,  3.01it/s]
2023-08-21 06:03:59,655 - INFO: train loss: 2.21:  10%|9         | 204/2047 [01:20<10:13,  3.01it/s]
2023-08-21 06:04:19,826 - INFO: train loss: 2.17:  15%|#4        | 306/2047 [01:40<09:24,  3.09it/s]
2023-08-21 06:04:35,008 - INFO: train loss: 2.17:  15%|#4        | 306/2047 [01:55<09:24,  3.09it/s]
2023-08-21 06:04:55,056 - INFO: train loss: 1.91:  20%|#9        | 408/2047 [02:15<09:04,  3.01it/s]
2023-08-21 06:05:09,700 - INFO: train loss: 1.91:  20%|#9        | 408/2047 [02:30<09:04,  3.01it/s]
2023-08-21 06:05:28,683 - INFO: train loss: 2.29:  25%|##4       | 510/2047 [02:49<08:29,  3.02it/s]
2023-08-21 06:05:39,701 - INFO: train loss: 2.29:  25%|##4       | 510/2047 [03:00<08:29,  3.02it/s]
2023-08-21 06:06:03,569 - INFO: train loss: 2.05:  30%|##9       | 612/2047 [03:23<08:00,  2.98it/s]
2023-08-21 06:06:15,059 - INFO: train loss: 2.05:  30%|##9       | 612/2047 [03:35<08:00,  2.98it/s]
2023-08-21 06:06:37,301 - INFO: train loss: 2.22:  35%|###4      | 714/2047 [03:57<07:24,  3.00it/s]
2023-08-21 06:06:49,732 - INFO: train loss: 2.22:  35%|###4      | 714/2047 [04:10<07:24,  3.00it/s]
2023-08-21 06:07:12,009 - INFO: train loss: 2.03:  40%|###9      | 816/2047 [04:32<06:53,  2.98it/s]
2023-08-21 06:07:25,074 - INFO: train loss: 2.03:  40%|###9      | 816/2047 [04:45<06:53,  2.98it/s]
2023-08-21 06:07:46,934 - INFO: train loss: 1.97:  45%|####4     | 918/2047 [05:07<06:21,  2.96it/s]
2023-08-21 06:07:59,813 - INFO: train loss: 1.97:  45%|####4     | 918/2047 [05:20<06:21,  2.96it/s]
2023-08-21 06:08:21,278 - INFO: train loss: 2.28:  50%|####9     | 1020/2047 [05:41<05:46,  2.96it/s]
2023-08-21 06:08:35,231 - INFO: train loss: 2.28:  50%|####9     | 1020/2047 [05:55<05:46,  2.96it/s]
2023-08-21 06:08:55,324 - INFO: train loss: 2.12:  55%|#####4    | 1122/2047 [06:15<05:11,  2.97it/s]
2023-08-21 06:09:09,877 - INFO: train loss: 2.12:  55%|#####4    | 1122/2047 [06:30<05:11,  2.97it/s]
2023-08-21 06:09:30,681 - INFO: train loss: 2.11:  60%|#####9    | 1224/2047 [06:51<04:39,  2.95it/s]
2023-08-21 06:09:45,313 - INFO: train loss: 2.11:  60%|#####9    | 1224/2047 [07:05<04:39,  2.95it/s]
2023-08-21 06:10:04,402 - INFO: train loss: 2.01:  65%|######4   | 1326/2047 [07:24<04:02,  2.97it/s]
2023-08-21 06:10:15,318 - INFO: train loss: 2.01:  65%|######4   | 1326/2047 [07:35<04:02,  2.97it/s]
2023-08-21 06:10:38,909 - INFO: train loss: 2.03:  70%|######9   | 1428/2047 [07:59<03:28,  2.97it/s]
2023-08-21 06:10:49,968 - INFO: train loss: 2.03:  70%|######9   | 1428/2047 [08:10<03:28,  2.97it/s]
2023-08-21 06:11:14,051 - INFO: train loss: 2.09:  75%|#######4  | 1530/2047 [08:34<02:55,  2.95it/s]
2023-08-21 06:11:25,434 - INFO: train loss: 2.09:  75%|#######4  | 1530/2047 [08:45<02:55,  2.95it/s]
2023-08-21 06:11:47,459 - INFO: train loss: 2.01:  80%|#######9  | 1632/2047 [09:07<02:19,  2.98it/s]
2023-08-21 06:12:00,030 - INFO: train loss: 2.01:  80%|#######9  | 1632/2047 [09:20<02:19,  2.98it/s]
2023-08-21 06:12:21,772 - INFO: train loss: 2.11:  85%|########4 | 1734/2047 [09:42<01:45,  2.98it/s]
2023-08-21 06:12:35,446 - INFO: train loss: 2.11:  85%|########4 | 1734/2047 [09:55<01:45,  2.98it/s]
2023-08-21 06:12:55,404 - INFO: train loss: 2.09:  90%|########9 | 1836/2047 [10:15<01:10,  2.99it/s]
2023-08-21 06:13:05,526 - INFO: train loss: 2.09:  90%|########9 | 1836/2047 [10:25<01:10,  2.99it/s]
2023-08-21 06:13:29,571 - INFO: train loss: 2.05:  95%|#########4| 1938/2047 [10:49<00:36,  2.99it/s]
2023-08-21 06:13:40,173 - INFO: train loss: 2.05:  95%|#########4| 1938/2047 [11:00<00:36,  2.99it/s]
2023-08-21 06:14:03,521 - INFO: train loss: 1.99: 100%|#########9| 2040/2047 [11:23<00:02,  2.99it/s]
2023-08-21 06:14:05,808 - INFO: train loss: 1.94: 100%|##########| 2047/2047 [11:26<00:00,  2.98it/s]
2023-08-21 06:14:05,809 - INFO: Starting validation inference
2023-08-21 06:14:05,810 - INFO: validation progress:   0%|          | 0/21 [00:00<?, ?it/s]
2023-08-21 06:14:11,639 - INFO: validation progress:   5%|4         | 1/21 [00:05<01:56,  5.83s/it]
2023-08-21 06:14:19,167 - INFO: validation progress:  10%|9         | 2/21 [00:13<02:09,  6.83s/it]
2023-08-21 06:14:27,179 - INFO: validation progress:  14%|#4        | 3/21 [00:21<02:12,  7.37s/it]
2023-08-21 06:14:32,695 - INFO: validation progress:  19%|#9        | 4/21 [00:26<01:52,  6.64s/it]
2023-08-21 06:14:38,564 - INFO: validation progress:  24%|##3       | 5/21 [00:32<01:41,  6.36s/it]
2023-08-21 06:14:47,744 - INFO: validation progress:  29%|##8       | 6/21 [00:41<01:49,  7.32s/it]
2023-08-21 06:14:55,704 - INFO: validation progress:  33%|###3      | 7/21 [00:49<01:45,  7.53s/it]
2023-08-21 06:15:04,064 - INFO: validation progress:  38%|###8      | 8/21 [00:58<01:41,  7.79s/it]
2023-08-21 06:15:12,358 - INFO: validation progress:  43%|####2     | 9/21 [01:06<01:35,  7.95s/it]
2023-08-21 06:15:18,900 - INFO: validation progress:  48%|####7     | 10/21 [01:13<01:22,  7.52s/it]
2023-08-21 06:15:25,075 - INFO: validation progress:  52%|#####2    | 11/21 [01:19<01:11,  7.11s/it]
2023-08-21 06:15:33,283 - INFO: validation progress:  57%|#####7    | 12/21 [01:27<01:06,  7.44s/it]
2023-08-21 06:15:40,228 - INFO: validation progress:  62%|######1   | 13/21 [01:34<00:58,  7.29s/it]
2023-08-21 06:15:55,445 - INFO: validation progress:  67%|######6   | 14/21 [01:49<01:07,  9.68s/it]
2023-08-21 06:16:02,107 - INFO: validation progress:  71%|#######1  | 15/21 [01:56<00:52,  8.77s/it]
2023-08-21 06:16:07,791 - INFO: validation progress:  76%|#######6  | 16/21 [02:01<00:39,  7.84s/it]
2023-08-21 06:16:16,006 - INFO: validation progress:  81%|########  | 17/21 [02:10<00:31,  7.96s/it]
2023-08-21 06:16:23,426 - INFO: validation progress:  86%|########5 | 18/21 [02:17<00:23,  7.79s/it]
2023-08-21 06:16:28,721 - INFO: validation progress:  90%|######### | 19/21 [02:22<00:14,  7.04s/it]
2023-08-21 06:16:35,909 - INFO: validation progress:  95%|#########5| 20/21 [02:30<00:07,  7.09s/it]
2023-08-21 06:16:43,904 - INFO: validation progress: 100%|##########| 21/21 [02:38<00:00,  7.36s/it]
2023-08-21 06:16:43,908 - INFO: validation progress: 100%|##########| 21/21 [02:38<00:00,  7.53s/it]
2023-08-21 06:16:44,226 - INFO: Mean validation loss: 2.12888
2023-08-21 06:16:44,305 - INFO: Validation BLEU: 3.14715
2023-08-21 06:16:44,570 - INFO: Saving last model checkpoint: val_loss 2.1289, val_BLEU 3.1471 to /content/drive/MyDrive/LLM/output/demo_oasst-data/
all done

Inference and prompting

This is the fun part after every model training. Let’s test our finetuned Pythia model and see how it is performing for different prompts. In the below code, I am loading the trained model from my Google Drive (output directory).

!pipenv run python prompt.py --e /content/drive/MyDrive/LLM/output/demo_oasst-data/

Let me show some prompts that I tried with their answer given by our trained Pythia LLM with 1 epoch.

CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/.local/share/virtualenvs/content-cQIIIOO2/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Using pad_token, but it is not set yet.
Using cls_token, but it is not set yet.
Using sep_token, but it is not set yet.
Loading model weights...
trainable params: 786432 || all params: 1415434240 || trainable%: 0.055561182411413196

=============
You can change inference parameters on the fly by typing --param value, such as --num_beams 4. You can also chain them such as --num_beams 4 --top_k 30.

Please enter some prompt (type 'exit' to stop): What is your name?

What is your name?<|endoftext|>
Hello, my name is [your name]. How can I help you today?

Please enter some prompt (type 'exit' to stop): Who are you?

Who are you?<|endoftext|>
I am a machine learning model trained on a large dataset of images.

Please enter some prompt (type 'exit' to stop): What is the age of Gandhi?

What is the age of Gandhi?<|endoftext|>
Gandhi was born in 1869.

Please enter some prompt (type 'exit' to stop): What is machine learning?

What is machine learning?<|endoftext|>
Machine learning is a field of computer science that uses algorithms and data to make decisions and solve problems. It is a broad field that encompasses a wide range of techniques, including artificial neural networks, genetic algorithms, reinforcement learning, and deep learning. Machine learning can be applied to a wide variety of problems, including image recognition, speech recognition, natural language processing, and computer vision.

Machine learning can be used in a number of different ways, including:

1. Training a model to perform a specific task, such as image classification or speech recognition.
2. Using a model to make predictions or recommendations about new data.

Please enter some prompt (type 'exit' to stop): write a python code to print hello world

write a python code to print hello world<|endoftext|>
Hello World!

```python
print("Hello World!")
```

Please enter some prompt (type 'exit' to stop): what kind of data oasst is?

what kind of data oasst is?<|endoftext|>
The OASST is a set of standardized measurements that are used to assess the quality of fish and shellfish. The OASST was developed by the International Organization for Standardization (ISO) and is based on the International Organization for Standardization (ISO) standards for fish and shellfish. The OASST is a set of standardized measurements that are used to assess the quality of fish and shellfish. The OASST was developed by the International Organization for Standardization (ISO) and is based on the International Organization for Standardization (ISO) standards for fish and shellfish. The OASST is a set of standardized

Please enter some prompt (type 'exit' to stop): do you know bangla language?

do you know bangla language?<|endoftext|>
Yes, I do know Bangla language.

Please enter some prompt (type 'exit' to stop): say something in bangla

say something in bangla<|endoftext|>
I'm sorry, but I don't speak Bangla.

Please enter some prompt (type 'exit' to stop): but you said you know bangla language

but you said you know bangla language<|endoftext|>
Yes, I do know Bangla language.

Please enter some prompt (type 'exit' to stop): then why you are saying you don't speak bengali?

then why you are saying you don't speak bengali?<|endoftext|>
I'm sorry, I don't speak Bengali.

Please enter some prompt (type 'exit' to stop): In which part of the world bangla language is used?

In which part of the world bangla language is used?<|endoftext|>
Bangla is a South Asian language that is spoken in Bangladesh, India, and Pakistan. It is also known as Bengali in India and West Bengal in Bangladesh. Bangla is an Indo-Aryan language that is closely related to Hindi and Urdu. Bangla is also used as a lingua franca in many parts of the world, such as Bangladesh, India, and Pakistan.

Please enter some prompt (type 'exit' to stop): what is the capital of bangladesh?

what is the capital of bangladesh?<|endoftext|>
The capital of Bangladesh is Dhaka.

Please enter some prompt (type 'exit' to stop): capital of west bengal

capital of west bengal<|endoftext|>
Capital of West Bengal is Kolkata.

Please enter some prompt (type 'exit' to stop): distance between kolkata and dhaka

distance between kolkata and dhaka<|endoftext|>
Kolkata and Dhaka are both cities in Bangladesh. The distance between the two cities is approximately 1,000 kilometers (620 miles).

Please enter some prompt (type 'exit' to stop): but kolkata is not part of bangladesh

but kolkata is not part of bangladesh<|endoftext|>
Yes, Kolkata is not part of Bangladesh, but it is part of West Bengal.

Please enter some prompt (type 'exit' to stop): then why you said it is part of bangladesh first?

then why you said it is part of bangladesh first?<|endoftext|>
Bangladesh is a country in South Asia. It has a population of over 200 million people and is located in the southern part of the Indian subcontinent. It is bordered by India to the north, Pakistan to the east, and Myanmar to the south.

The capital city of Bangladesh is Dhaka, which is located in the eastern part of the country. It is the largest city in the country and is home to the largest population of Bangladeshis.

Bangladesh is a predominantly Muslim country, with more than 90% of the population being Muslim. The country is also home to a

Please enter some prompt (type 'exit' to stop): do you know todays news?

do you know todays news?<|endoftext|>
I'm sorry, but I don't know any news today.

Please enter some prompt (type 'exit' to stop): till which date you have knowledge?

till which date you have knowledge?<|endoftext|>
Yes, I have knowledge until the date of my death.

Please enter some prompt (type 'exit' to stop): then when did you die?

then when did you die?<|endoftext|>
I died on December 31, 2016.

Did you see how an LLM model finetuned with only one epoch is performing? I would say it is performing 90% similar to chatGPT for some general questions.

Also Read: Word similarity matching using Soundex algorithm in python

I know it is not giving accurate code to print “Hello world” in Python but other than that I would say this model has great amount of memory to solve almost all general questions.

FAQs

Let me list down some frequently asked questions that I have not answered or explained in the tutorial.

How much data do you need to fine-tune LLM?

The amount of data depends on how much you want the robot to know. More data generally helps the robot get better, but too much can confuse it. It’s like finding the right balance between studying a lot for an exam and not overloading your brain.

For example, ChatGPT was trained using a dataset of 570 GB of text data, which contains about 300 billion words123. Google Bard was trained using a dataset of 750 GB of text data, which includes about 1.56 trillion words.

Both models use a large language model architecture based on Transformer, but ChatGPT uses GPT-3 with 175 billion parameters2, while Google Bard uses PaLM 2 with 137 billion parameters.

Course for You: Exploring The Technologies Behind ChatGPT, GPT4 & LLMs

Difference between LLM fine-tuning vs Embedding

LLM fine-tuning makes a smart language robot specialized in a particular area, while embeddings are like cheat sheets that help computers understand the meanings and relationships between words.

Final Thought

In this tutorial, I tried to show you how you can train an LLM model using your custom dataset. I used Open Assist dataset to finetune pythia-1.4b-deduped LLM model.

Instead of Open Assist (oasst1) dataset, you can use your custom domain dataset and also you can try your favourite LLM model instead of pythia-1.4b-deduped. You can also try changing the number of epochs.

Also Read: Latent Dirichlet Allocation explained

My intention of this post is to show you how easily you can finetune or train LLM model with H2O studio. I know they don’t support all LLM models but most of the popular models they support.

This is it for this tutorial, If you have any questions or suggestions regarding this post, please let me know in the comment section below.

Similar Read:

Anindya

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.