Abstractive Text Summarization in 12 lines with T5

In my last post, I showed you how you can use T5 model to make your own Chatbot. In this tutorial, I will show you can use T5 for text summarization.

As discussed in my earlier post, T5 is an encoder-decoder-based transformer model. This model expects input in text format and generates output in text string format.

You can use this T5 model for various NLP downstream tasks like: Named Entity Recognition, Chatbot, Question Answering, Text Summarization, etc. You just need to play with different prefixes.

For example, If you want to use T5 for translation, you would need to give it a specific instruction or command. For example, if you want to translate something from English to German, you would start your input with “translate English to German: ” and then write the text you want to be translated after the colon.

On the other hand, if you want to use T5 for summarization, you would provide a different instruction. In this case, you would start your input with “summarize: ” and then write the text that you want to be summarized after the colon.

What is Text Summarization?

In simple words, text summarization is a way to create a short and clear summary of a long piece of document. You can implement text summarization using some algorithm or manual techniques.

Abstractive vs Extractive text summarization

There are mainly two methods: Extractive Summarization and Abstractive Summarization.

Extractive Summarization

As I said you can also achieve text summarization using some manual technique. Before 2016 we used to summarize text using this technique only.

In this technique, you need to apply various manual NLP techniques like POS tagging, Named Entity Recognition, Dependency Parsing, etc. to extract important words, sentences, or phrases from your input text. Then combine those important phrases to make your final summary.

Abstractive Summarization

After 2016, when deep learning based encoder-decoder models got popularity, we started using this kind of model to generate summaries.

This kind of AI model can understand the meaning of your input text deeply and use natural language generation techniques to generate summary of your input text.

If you want to learn Sequence-to-Sequence models and Generative AI in-depth, then I will highly recommend you below two Udemy courses. Take both of these courses:

Since the model is generating output summary based on its understanding, sometimes you can expect some words or phrases of your output summary might not be present in your input text. You can call this an advantage or disadvantage of abstractive text summarization.

So, is T5 abstractive or extractive summarization?

Since T5 is an encoder-decoder-based generative model, it can generate answers based on its understanding. So the answer is: yes T5 is an abstractive summarization technique.

Also Read: Natural Language Processing Using TextBlob

T5 Text summarization Example

To not make things complicated, I am dividing the entire process of doing text summarization with T5 into some steps.

Step1: Import Libraries

I think importing library is the first step for almost all Python projects right? So let’s import those required libraries to work with T5 pre-trained summarizer.

# Import libraries to use t5 for text summarization
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration

Step2: Download/ Load T5 model files

There are various T5 models available. They divided these models based on the number of parameters and size. Below is the code to download or load t5-base pretrained model.

The model will be downloaded and saved inside your cache directory. For me, it was: C:\Users\Anindya.cache\huggingface\transformers. If you want to save this model to your local directory or directly download this model inside your working folder, read this post: Download and Save Huggingface Model to Custom Path.

# Download and Load T5 model files
t5_model = T5ForConditionalGeneration.from_pretrained('t5-base')
t5_tokenizer = T5Tokenizer.from_pretrained('t5-base')

download-or-load-t5-base-model-for-text-summarization

For this example tutorial, I am using t5-base model for text summarization, but you can also try other models from the T5 family like: mt5, flan-t5, t5-small, etc.

Step3: Find input text to summarize

Input text can be any text you are working on for your project. Just for this demo, I am going to use Wikipedia description about Natural Language Processing. This is the wiki link sharing just for your reference.

summarize-wikipedia-document-using-t5-abstructive-summarizer-generative-model

To show the text better in Jupyter Notebook, we can use three quotations like below Python script.

input_text = '''

Natural language processing (NLP) is an interdisciplinary subfield of linguistics, 
computer science, and artificial intelligence concerned with the interactions 
between computers and human language, in particular how to program computers 
to process and analyze large amounts of natural language data. The goal is a 
computer capable of "understanding" the contents of documents, including the 
contextual nuances of the language within them. The technology can then 
accurately extract information and insights contained in the documents as 
well as categorize and organize the documents themselves.

'''

Step4: Pre-process Input Text

While working on a serious NLP project, there can be n number of pre-processing steps like stop word removal, stemming, lemmatization, etc. But for this demo huggingface summarization pipeline (with T5), I am going to use only one text cleaning: remove next line symbol (\n).

Next, we need to add the prefix: “summarize: ” in our pre processed text. This is required because T5 expects a prefix for specific task. You can find list of prefixes of T5 model in its config file.

Note that those are default prefixes. These are the prefix you need to use if you are using T5 pretrained model. But while finetuning you can use your own prefix. While fine-tuning T5 for chatbot, I used different prefix and it works.

Also Read: FastText Word Embeddings Python implementation

Below Python code is for pre-processing input text to convert it to T5 summarizer format.

# Text Pre-processing for summarization---
# Remove next line symbol
preprocess_text = input_text.strip().replace("\n", "")

# Add T5 prefix for text summarization
t5_ready_text = "summarize: " + preprocess_text

# Let's display the input text for T5 absrtractive text summarizer
print("Input text for T5 summarizer: \n", t5_ready_text)

Input text for T5 summarizer: 
 summarize: Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Step5: Tokenize Input Text

As you know, no model can understand text data. So we need to convert our input text to number. Tokenization is used to do that conversion. In the below Python code, we are using T5 tokenizer to convert text into tokens.

Note: For this tutorial, I am using CPU but you can also mention ‘cuda’ to use your NVIDIA GPU.

# Set device as CPU
device = torch.device('cpu')

# Tokenize input text
tokenized_text = t5_tokenizer.encode(t5_ready_text, return_tensors = "pt").to(device)
tokenized_text

tensor([[21603,    10,  6869,  1612,  3026,    41,   567,  6892,    61,    19,
            46,     3,    23, 25503,   769,  1846,    13,     3, 24703,     7,
             6,  1218,  2056,     6,    11,  7353,  6123,  4376,    28,     8,
          9944,   344,  7827,    11,   936,  1612,     6,    16,  1090,   149,
            12,   478,  7827,    12,   433,    11,  8341,   508,  6201,    13,
           793,  1612,   331,     5,    37,  1288,    19,     3,     9,  1218,
          3919,    13,    96,  7248, 11018,   121,     8, 10223,    13,  2691,
             6,   379,     8, 28131,     3, 26432,    13,     8,  1612,   441,
           135,     5,    37,   748,    54,   258, 12700,  5819,   251,    11,
          7639,  6966,    16,     8,  2691,    38,   168,    38,  9624, 11498,
           776,    11,  7958,     8,  2691,  1452,     5,     1]])

Step6: Summarize text using T5

Now everything is set to generate summary of our input text using T5 model. To do that we are going to use t5_model.generate function.

# Text summarization using T5
summary_tokens = t5_model.generate(tokenized_text,
                                  num_beams = 4,
                                  no_repeat_ngram_size = 2,
                                  min_length = 30,
                                  max_length = 100,
                                  early_stopping = True)

summary_tokens

tensor([[    0,   793,  1612,  3026,    41,   567,  6892,    61,    19,    46,
             3,    23, 25503,   769,  1846,    13,     3, 24703,     7,     6,
          1218,  2056,     6,    11,  7353,  6123,     3,     5,     8,  1288,
            19,     3,     9,  1218,  3919,    13,    96,  7248, 11018,   121,
             8, 10223,    13,  2691,     6,   379,     8, 28131,     3, 26432,
            13,     8,  1612,   441,   135,     5,     1]])

I am using some specific parameters like early_stopping, num_beams, no_repeat_ngram_size, etc. But you can try additional parameters. You can find list of all parameters of T5 model in this link.

Also Read: Train BERT from Scratch on Custom Domain Data

Step7: Decode summarized text

Output of the previous step (step 6) will generate list of tokens. Now we can convert that token vector into the actual text from token index. After this conversion, you will get the final generated summary from T5 model for your input text.

# Decode T5 summary text from tokens
t5_summary_output_text = t5_tokenizer.decode(summary_tokens[0], skip_special_tokens = True)

# Display final output of t5 abstructive summarizer
print("Summarized text: \n", t5_summary_output_text)

Summarized text: 
 natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence. the goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.

Text Summarization Applications

You can implement text summarization for various fields like: Uploading a document and summarizing that document. Text summarization can be suitable for news articles, summary of research paper etc.

Conclusion

So you learned, what is text summarization, how text summarization works, and how you can implement it in your project.

In this tutorial, I used T5 model for text summarization. T5 model is an encoder-decoder-based generative model to do various downstream NLP tasks like: Named Entity Recognition, Question Answering, Chatbot, Text Summarization, etc.

There are mainly two approaches to do text summarization. The first one is extractive where you use various manual NLP techniques to extract main components from text.

But in the abstractive summarization method, a model like T5 will read and understand your input text and based on its understanding, it will generate a summary.