Top 23 Dataset for Chatbot Training

top-and-best-dataset-for-training-chatbot

Chatbots are becoming more popular and useful in various domains, such as customer service, e-commerce, education,entertainment, etc. However, building a chatbot that can understand and respond to natural language is not an easy task. It requires a lot of data (or dataset) for training machine-learning models of a chatbot and make them more intelligent and conversational.

Last few weeks I have been exploring question-answering models and making chatbots. This article is a result of that exploration. In this article, I will share top dataset to train and make your customize chatbot for a specific domain.

Course for You: How to Build Chatbot with Python & Rasa Open Source

I divided the entire chatbot datasets into some parts or types of chatbot which you are going to train. Those are:

Question-Answer Datasets
Customer Support Datasets
Dialogue Datasets
Multilingual chatbot Datasets

Question-Answer Datasets for Chatbot Training

Question-answer dataset are useful for training chatbot that can answer factual questions based on a given text or context or knowledge base. These datasets contain pairs of questions and answers, along with the source of the information (context).

In my last few tutorials, I explored this kind of dataset and model in detail. If you are interested, you can read below articles:

Some examples of question-answer datasets are:

The WikiQA Corpus

This dataset contains different sets of question and sentence pairs. They collected these pairs from Bing query logs and Wikipedia pages. You can use this dataset to train chatbots that can answer questions based on Wikipedia articles.

You can download this WikiQA corpus dataset by going to this link.

Question-Answer Database

This dataset contains Wikipedia articles along with manually generated factoid questions along with manually generated answers to those questions. You can use this dataset to train domain or topic specific chatbot for you.

There is a separate file named question_answer_pairs, which you can use as a training data to train your chatbot. To download this data, go to this link.

Yahoo Language Data

This dataset contains manually curated QA datasets from Yahoo’s Yahoo Answers platform. It covers various topics, such as health, education, travel, entertainment, etc. You can also use this dataset to train a chatbot for a specific domain you are working on.

Visit this link to download Yahoo language dataset. You just need to create a Yahoo account if you do not have one.

TREC QA Collection

trec-qa-sample-data-snippet-to-train-question-answering-chatbot

This collection of data includes questions and their answers from the Text REtrieval Conference (TREC) QA tracks. These questions are of different types and need to find small bits of information in texts to answer them. You can try this dataset to train chatbots that can answer questions based on web documents.

In this dataset, you will find two separate files for questions and answers for each question. You can download different version of this TREC AQ dataset from this website.

SQuAD Dataset

This dataset contains over 100,000 question-answer pairs based on Wikipedia articles. You can use this dataset to train chatbots that can answer factual questions based on a given text. You can SQuAD download this dataset in JSON format from this link. Below is a sample datapoint of SQuAD dataset.

{
  "data": [
    {
      "title": "Everest",
      "paragraphs": [
        {
          "context": "Mount Everest, known in Nepal as Sagarmatha and in Tibet as Chomolungma, is Earth's highest mountain above sea level. It is located in the Himalayas on the border between Nepal and the Tibet Autonomous Region of China. Mount Everest's peak rises to an elevation of 8,848.86 meters (29,031.7 feet) above sea level. It is a popular destination for mountaineers and has been the ultimate challenge for climbers from around the world. The first successful ascent to the summit was made by Sir Edmund Hillary of New Zealand and Tenzing Norgay, a Sherpa of Nepal, in 1953.",
          "qas": [
            {
              "question": "What is the height of Mount Everest?",
              "id": "2",
              "answers": [
                {
                  "text": "Mount Everest's peak rises to an elevation of 8,848.86 meters (29,031.7 feet) above sea level.",
                  "answer_start": 63
                }
              ],
              "is_impossible": false
            },
            {
              "question": "Who made the first successful ascent to the summit of Mount Everest?",
              "id": "3",
              "answers": [],
              "is_impossible": true
            }
          ]
        }
      ]
    }
  ]
}

In one of my NLP tutorial, I used a question answering model which is trained on top of this SquAD dataset. You can also read that tutorial: Build Question Answering System with BERT model.

Also Read: Recurrent Neural Network tutorial for Beginners

CoQA Dataset

This dataset contains over 8,000 conversations that consist of a series of questions and answers. You can use this dataset to train chatbots that can answer conversational questions based on a given text. This dataset is somewhat similar to SQuAD data. Below is a simple datapoint of CoQA dataset.

 "data": [
    {
      "source": "mctest",
      "id": "3dr23u6we5exclen4th8uq9rb42tel",
      "filename": "mc160.test.41",
      "story": "Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no.",
      "questions": [
        {
          "input_text": "What color was Cotton?",
          "turn_id": 1
        },
        {
          "input_text": "Where did she live?",
          "turn_id": 2
        },
        {
          "input_text": "Did she live alone?",
          "turn_id": 3
        },
        
        
      ],
      "answers": [
        {
          "span_start": 59,
          "span_end": 93,
          "span_text": "a little white kitten named Cotton",
          "input_text": "white",
          "turn_id": 1
        },
        {
          "span_start": 18,
          "span_end": 80,
          "span_text": "in a barn near a farm house, there lived a little white kitten",
          "input_text": "in a barn",
          "turn_id": 2
        },
        {
          "span_start": 196,
          "span_end": 215,
          "span_text": "Cotton wasn't alone",
          "input_text": "no",
          "turn_id": 3
        },        
      ]

You can download CoQA dataset in JSON format from the Stanford website, this is the link.

QuAC Chatbot Dataset

This dataset contains over 14,000 dialogues that involve asking and answering questions about Wikipedia articles. You can also use this dataset to train chatbots to answer informational questions based on a given text.

{
  "data": [
    {
      "paragraphs": [
        {
          "context": "In May 1983, she married Nikos Karvelas, a composer, with whom she collaborated in 1975 and in November she gave birth to her daughter Sofia. After their marriage, she started a close collaboration with Karvelas. Since 1975, all her releases have become gold or platinum and have included songs by Karvelas. In 1986, she participated at the Cypriot National Final for Eurovision Song Contest with the song Thelo Na Gino Star (\"I Want To Be A Star\"), taking second place.The lead single Pseftika (\"Fake\") became a big hit and the album reached platinum status, selling 180.000 copies and becoming the second best selling record of 1990. She performed at \"Diogenis Palace\" in that same year, Athens's biggest nightclub/music hall at the time. CANNOTANSWER",
          "qas": [
            {
              "followup": "y",
              "yesno": "x",
              "question": "what happened in 1983?",
              "answers": [
                {
                  "text": "In May 1983, she married Nikos Karvelas,",
                  "answer_start": 0
                },
                {
                  "text": "In May 1983, she married Nikos Karvelas, a composer,",
                  "answer_start": 0
                },
                {
                  "text": "In May 1983, she married Nikos Karvelas,",
                  "answer_start": 0
                },
                {
                  "text": "In May 1983, she married Nikos Karvelas, a composer,",
                  "answer_start": 0
                }
              ],
              "id": "C_5ab583f64dbb47b995cf59328ea0af43_1_q#0",
              "orig_answer": {
                "text": "In May 1983, she married Nikos Karvelas, a composer,",
                "answer_start": 0
              }
            },
            {
              "followup": "m",
              "yesno": "y",
              "question": "did they have any children?",
              "answers": [
                {
                  "text": "she gave birth to her daughter Sofia.",
                  "answer_start": 104
                },

Above data is just a sample of QuAC dataset. You can observe the format is different for each dataset. You can download QuAC dataset from this link.

Natural Questions

This dataset contains over 300,000 question-answer pairs based on Google search queries and web pages. You can use this unique dataset to train chatbots that can answer natural language questions based on a given web page.

You can download this research dataset from Google AI website by this link. Note that this is a huge dataset. So make sure you are using GPUs with high memory to train your chatbot using this dataset. If you want to learn how train chatbot using CPU then this tutorial is for you: Fine-tune T5 to Make Custom ChatBot.

MS MARCO

This dataset contains over one million question-answer pairs based on Bing search queries and web documents. This dataset is kind of similar to Natural Questions dataset. You can also use it to train chatbots that can answer real-world questions based on a given web document.

You can download MS MARCO dataset from this Microsoft website link.

Customer Support Datasets for Chatbot Training

Customer support dataset are useful for training chatbot that can handle customer queries and issues in different domains, such as travel, telecommunications, or e-commerce. These datasets contain dialogues or transcripts between customers and service representatives or agents. Some examples of customer support datasets are:

Also Read: Continuous Bag of Words (CBOW) - Single Word Model - How It Works

Relational Strategies in Customer Service Dataset

This dataset contains human-computer data from three live customer service representatives who were working in the domain of travel and telecommunications. It also contains information on airline, train, and telecom forums collected from TripAdvisor.com.

You can use this dataset to train chatbots that can adopt different relational strategies in customer service interactions. You can download this Relational Strategies in Customer Service (RSiCS) dataset from this link.

Ubuntu Dialogue Corpus

This dataset contains almost one million conversations between two people collected from the Ubuntu chat logs. The conversations are about technical issues related to the Ubuntu operating system.

It is a unique dataset to train chatbots that can give you a flavor of technical support or troubleshooting. Go to this Kaggle link to download Ubuntu Dialogue Corpus.

If you want to convert chat conversation into question answering dataset then this tutorial is definitely for you: Convert Chat Conversation to Question Answering Data.

Customer Support on Twitter

This dataset contains over three million tweets pertaining to the largest brands on Twitter. The tweets are related to customer service issues or inquiries. You can also use this dataset to train chatbots that can interact with customers on social media platforms.

You can also find this Customer Support on Twitter dataset in Kaggle. This is the link to download.

If you are working on training your custom chatbot, then below are the tutorials that can give you a starting point for your exploration:

Dialogue Datasets for Chatbot Training

Dialogue datasets are useful for training chatbots that can engage in natural and fluent conversations with humans on various topics or scenarios. These datasets contain dialogues or utterances between two or more speakers or participants. Some examples of dialogue datasets are:

Santa Barbara Corpus of Spoken American English

This dataset contains approximately 249,000 words from spoken conversations in American English. The conversations cover a wide range of topics and situations, such as family, sports, politics, education, entertainment, etc. You can use it to train chatbots that can converse in informal and casual language.

This is the Kaggle link to download Santa Barbara Corpus of Spoken American English dataset.

Semantic Web Interest Group IRC Chat Logs

This dataset contains automatically generated IRC chat logs from the Semantic Web Interest Group (SWIG). The chats are about topics related to the Semantic Web, such as RDF, OWL, SPARQL, and Linked Data. You can also use this dataset to train chatbots that can converse in technical and domain-specific language.

This is the place where you can find Semantic Web Interest Group IRC Chat log dataset.

Cornell Movie Dialogs Corpus

This dataset contains over 220,000 conversational exchanges between 10,292 pairs of movie characters from 617 movies. The conversations cover a variety of genres and topics, such as romance, comedy, action, drama, horror, etc. You can use this dataset to make your chatbot creative and diverse language conversation.

To download the Cornell Movie Dialog corpus dataset visit this Kaggle link.

DailyDialog Chatbot dataset

This dataset contains over 13,000 dialogues that cover various topics and scenarios in daily life. You can use it to make your chatbot engage in natural and fluent conversations with humans on various topics or scenarios.

You can download Daily Dialog chat dataset from this Huggingface link.

Persona-Chat

This chatbot dataset contains over 10,000 dialogues that are based on personas. Each persona consists of four sentences that describe some aspects of a fictional character. It is one of the best datasets to train chatbot that can converse with humans based on a given persona.

This Facebook AI Persona Chat dataset is available in Kaggle. Go to this link to download it.

EmpatheticDialogues

This dataset contains over 25,000 dialogues that involve emotional situations. Each dialogue consists of a context, a situation, and a conversation. This is the best dataset if you want your chatbot to understand the emotion of a human speaking with it and respond based on that.

Multilingual Datasets for Chatbot Training

Multilingual dataset are useful for training chatbot that can communicate in different languages or across languages. These datasets contain data in multiple languages or translations of the same data in different languages. Some examples of multilingual datasets are:

MultiWOZ 2.0

This MultiWOZ dataset contains over 10,000 dialogues in the domain of travel and tourism. The dialogues are in English, but they also have translations in six other languages: German, French, Spanish, Italian, Chinese, and Japanese. You can use it to train chatbots that can provide travel information or booking services in multiple languages.

This MultiWOZ dataset is available in both Huggingface and Github, You can download it freely from there.

MLQA Chat dataset

MLQA dataset contains question-answer pairs in seven languages: English, Arabic, German, Spanish, Hindi, Vietnamese, and Simplified Chinese. The questions and answers are based on Wikipedia articles in the same languages. It can be used to train chatbots that can answer questions in multiple languages.

MLQA data by facebook research team is also available in both Huggingface and Github. You can download it from any place you want.

OPUS

OPUS dataset contains a large collection of parallel corpora from various sources and domains. It covers over 100 languages and millions of sentence pairs. You can use this dataset to train chatbots that can translate between different languages or generate multilingual content.

You can download OPUS multi-language chat dataset from this huggingface link.

Multi-Domain Wizard-of-Oz Dataset

This dataset contains over 8,000 dialogues in four languages: English, French, German, and Italian. The dialogues are in the domain of hotel booking and restaurant reservation. It can be used to train chatbots that can provide information or booking services in multiple languages.

You can download Multi-Domain Wizard-of-Oz dataset from both Huggingface and Github.

XNLI Multi-language Chat Data

This dataset contains over 500,000 sentence pairs in 15 languages. The sentence pairs are labeled with three categories: entailment, contradiction, or neutral. It can be used to train chatbots that can understand natural language inference in multiple languages.

XNLI is also a dataset by Facebook research team. You can download this multilingual chat data from Huggingface or Github.

More Datasets

In this article, I discussed some of the best dataset for chatbot training that are available online. These datasets cover different types of data, such as question-answer data, customer support data, dialogue data, and multilingual data.

There are many more other datasets for chatbot training that are not covered in this article. You can find more datasets on websites such as Kaggle, Data.world, or Awesome Public Datasets. You can also create your own datasets by collecting data from your own sources or using data annotation tools and then convert conversation data in to the chatbot dataset.

This is it for this article. If you want to make a professional chatbot for your product, then this Udemy course will help you to do that: How to Build Chatbot with Python & Rasa Open Source.

If you have any questions or suggestions regarding this article, please let me know in the comment section below.

Similar Read:

Anindya

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.