Convert Chat Conversation to Question Answering Data


In this article, I will share a different flavor of data preparation that I experienced recently. I had a natural chat conversation data which I had to convert it into a question answering data format.

Problem Statement

Before going directly to the coding part, let me first show you what I want to do in this coding tutorial. Often a chat conversation may look like below. In the below image you can see that question from the customer or a reply from an agent is not in a single line.


This is normal that a customer or an Agent ask or reply multiple things in separate rows (or after hitting enter key). In this post, our objective is to convert that messy chat conversation to a clean question answer dataframe. Later you can use this dataset to finetune your chatbot or use it in some other use cases.

Note: In this tutorial, I am going to use advanced pandas coding. If you want to be a master in pandas, I will recommend you to take this Udemy
course: Data Analysis with Pandas and Python.

Create Dummy Data

I did not find any sample dataset for this kind of use case. So thought of creating a dummy chat conversation data using pandas. This dummy data somewhat looks like exact chat conversation data.

In this dataframe “Message” column contains the actual reply from a customer or agent. “Is_Agent” column stores the information about whether that reply is from a customer or agent. Below is the Python code to prepare this dummy chat conversation data.

import pandas as pd

message = ['hi', 'hello', 'how', 'are', 'you', 'doing', 'today', 'fine', 'thank', 'you']
is_agent = [1, 0, 1, 0, 1, 1, 0, 0, 0, 0]

# Create dataframe from two list in pandas
input_df = pd.DataFrame({
    'Message': message,
    'Is_Agent': is_agent


Group Elements with consecutive occurrence

You may observe in the above data that a customer or agent can reply multiple times repeatedly. For example: ‘today’ -> ‘fine’ -> ‘thank’ -> ‘you’. To prepare clean question-answering data, we must join those repetitive replies from customers or agents. Let’s do that using below Python code.

# Converting column values to list
lst1 = list(input_df['Message'])
lst2 = list(input_df['Is_Agent'])

# Group list elements with consicutive occurance
output = []
output1 = []

for x, y, z in zip([float('nan')] + lst2, lst2, lst1):
    if x != y:
print('output: ', '\n', output, '\n\n', 'output1: ', '\n', output1)
 [['hi'], ['hello'], ['how'], ['are'], ['you', 'doing'], ['today', 'fine', 'thank', 'you']] 

 [[[1, 'hi']], [[0, 'hello']], [[1, 'how']], [[0, 'are']], [[1, 'you'], [1, 'doing']], [[0, 'today'], [0, 'fine'], [0, 'thank'], [0, 'you']]]

In the above code, we first convert our dataframe into a list then, in the output list variable, we are storing repetitive conversations either from agents or from customers. In the list variable output1, we also store a particular reply from the agent or customer (1 and 0).

Also Read:  Subset Pandas Dataframe by Column Value

Join list elements

In the above output list variable, all words or replies were in a separate list. We need to join those list of list elements together to make a single reply whether it is from an agent or a customer.

# Join list elements
joined_list = []
for list_val in output:
    if len(list_val) > 1:
        joined_list.append([' '.join(list_val)])
    elif len(list_val) == 1:
[['hi'], ['hello'], ['how'], ['are'], ['you doing'], ['today fine thank you']]

Keep track of message sender

In the above code, we are joining replies but do not have information about who sent that message. So in the below Python script, we are
tracking that information using output1 list variable.

# Adding agent lebel to the joined list
with_label = []
for a, b in zip(joined_list, output1):
    with_label.append([b[0][0], a])
[[1, ['hi']],
 [0, ['hello']],
 [1, ['how']],
 [0, ['are']],
 [1, ['you doing']],
 [0, ['today fine thank you']]]

Create Question Answer List

Now, using the above output, we are making question and answer list. Appending question list if Is_Agent label is 1 (assuming agent replies as a question). And appending the answer list if Is_Agent label is 0. You can change the label filter as per your project requirement.

# Create question answer list
question = []
answer = []

for label_value in with_label:
    if label_value[0] == 1:
    elif label_value[0] == 0:
print(question, '\n', answer)
[['hi'], ['how'], ['you doing']] 
 [['hello'], ['are'], ['today fine thank you']]

Convert list of list to Flat list

We are almost ready with our clean question-answering dataset. In the above output data, all comments are in separate list (technically we call it list of lists). We just need to convert that list of list to a flat list, then our data is ready.

# Convert list of list to simple or flat list
answer_flat = [item for sublist in answer for item in sublist]
question_flat = [item for sublist in question for item in sublist]

print(question_flat, '\n', answer_flat)
['hi', 'how', 'you doing'] 
 ['hello', 'are', 'today fine thank you']

Make final Dataframe

We are done with our final question-answering data. But that data is in list format. We just need to convert that list into a dataframe to make our final question-answering dataset.

# Make final question answer dataframe
output_df = pd.DataFrame({
    'Question': question_flat,
    'Answer': answer_flat



In this tutorial, I showed you the logic to convert any real life chat conversation to a question-answering dataset. This question-answering dataset then can be used to finetune your custom chatbot or for any other purpose.

Also Read:  Pandas reset index - How to reset index Pandas

I know the data which I used in this tutorial does not look real. But my intention was to show you the logic. You can take this logic and convert any chat conversation to a question-answering dataset.

This is it for this tutorial. If you have any questions or suggestions regarding this tutorial, please let me know in the comment section below.

Leave a comment