Convert Text to Speech with Deep Learning in Python

convert-text-to-speech-python

A few years ago, the idea of text-to-speech software would have been laughable, if not impossible. However, with deep learning and artificial neural networks on the rise, it’s quite possible to convert text to speech on your own computer. In this guide, you’ll learn how to perform text-to-speech conversion with your own Python script to make speech synthesis more natural sounding than ever before possible.

Application of Text to Speech

Text to speech or TTS has several applications like:

  • E-Reader books: This kind of application can read a book or paper for you
  • Voice-enabled mobile Apps: A good example of this kind of app is Google Map drive navigation
  • Siri: This product of Apple uses TTS in its background
  • Amazon Alexa
  • Google Assistance
  • TikTok TTS
  • Youtube videos: Nowadays using deep learning and AI TTS can generate audio like human voice which can use as voiceover for youtube videos.
  • etc.

Text to Speech Python implementation

There are so many text to speech library in Python. I found two of them very promising which can generate natural audio like real human. Let me share those. Those are:

  • Google Text to Speech (gTTS)
  • Tecotron 2

1. Text to Speech with Google TTS

gTTS (Google Text-to-Speech) is a Python library for interacting with Google Translate’s text-to-speech API. It supports several languages including Indian voices like Hindi, Tamil, Bengali (Bangla), Kannada and many more. You can find the complete list of languages supported by gtts with their language code below:

gtts languages list
Language NameLanguage code (lang)
Afrikaansaf
Arabic ar
Bulgarian bg
Bengali bn
Bosnian bs
Catalan ca
Czech cs
Welsh cy
Danish da
German de
Greek el
English en
Esperanto eo
Spanish es
Estonian et
Finnish fi
French fr
Gujarati gu
Hindi hi
Croatian hr
Hungarian hu
Armenian hy
Indonesian id
Icelandic is
Italian it
Hebrew iw
Japanese ja
Javanese jw
Khmer km
Kannada kn
Korean ko
Latin la
Latvian lv
Macedonian mk
Malay ms
Malayalam ml
Marathi mr
Myanmar (Burmese) my
Nepali ne
Dutch nl
Norwegian no
Polish pl
Portuguese pt
Romanian ro
Russian ru
Sinhala si
Slovak sk
Albanian sq
Serbian sr
Sundanese su
Swedish sv
Swahili sw
Tamil ta
Telugu te
Thai th
Filipino tl
Turkish tr
Ukrainian uk
Urdu ur
Vietnamese vi
Chinese zh-CN
Chinese (Mandarin/Taiwan) zh-TW
Chinese (Mandarin) zh
Also Read:  Automatic Keyword extraction using RAKE in Python

The speech can be delivered in either of two audio speeds: fast or slow. However, in the most recent version, changing the voice of the produced audio is no longer available.

Installation

For installing the gTTS API, Open a terminal or command prompt and type

pip install gTTS

This is applicable to any platform.

We’re now ready to write a sample code that can translate text to speech.

English Language
# Import text to speech conversion library
from gtts import gTTS
# Text2Speech generation
tts = gTTS('Natural language processing is really awesome!', lang='en')
# Save converted audio as mp3 format
tts.save('hello.mp3')
Output

After hearing the output you can understand gTTS cannot generate sound like humans (Sounds like robotic). But since this library is free and open source and easy to use you can use it in your fun project.

Let’s try some Indian languages.

Bengali Language
# Text2Speech generation
tts = gTTS('আমি বাংলায় কথা বলতে পারি', lang='bn')
# Save converted audio as mp3 format
tts.save('bangla.mp3')

Output

Hindi Language
# Text2Speech generation
tts = gTTS('मैं हिन्दी में बोल सकता हूँ', lang='hi')
# Save converted audio as mp3 format
tts.save('hindi.mp3')

Output

2. Text to Speech with Tecotron 2

If you want to use Text to speech in any advanced project where you want to produce natural sound, you must use a deep learning model.

In this tutorial, I will show you the best deep-learning model to synthesize audio from text which is called Tecotron 2.

Tecotron 2 is an open-source deep-learning TTS model developed by NVIDIA. You can download their pre-trained easily and use that in your local system for your project.

Installation

Before we proceed to install the required modules for Tacotron 2 I will highly recommend you create a virtual environment and install Tecotron 2 inside there.

To create virtual environment run below command:

conda create -n tts_python python=3.7
activate tts_python

Here tts_python is the virtual environment name

Note: Don’t change the python version else you may get the below TensorFlow error:

Module 'tensorflow' has no attribute 'contrib'

Inside the virtual environment execute the below commands one by one:

Also Read:  Recurrent Neural Network tutorial for Beginners
Setup Tecotron 2

To set up Tecotron 2 execute the following commands:

git clone https://github.com/NVIDIA/tacotron2.git
cd tacotron2
git submodule init
git submodule update
pip install -r requirements.txt
Install Additional Packages

You need to install some additional packages. To install them run below commands:

pip install tensorflow==1.15.0
pip install inflect==0.2.5
pip install librosa -U
pip install Unidecode==1.0.22
pip install pillow
pip install matplotlib

Note: If you want to install tensorflow for GPU you can follow this tutorial: Install TensorFlow GPU with Jupiter notebook for Windows

Install Pytorch

To work with Tecotron, you have to install PyTorch with CUDA. To install Pytorch with CUDA run the below command

conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Setup Jupyter Notebook

To configure jupyter notebook for your virtual environment execute below commands inside your environment:

conda install jupyter
conda install nb_conda
conda install ipykernel
Solve some error

You may face below errors while configuring Tecotron 2:

AttributeError: 'CacheManager' object has no attribute 'cachedir'
AttributeError: 'Memory' object has no attribute 'location'

To solve the above error just uninstall librosa and joblib and install librosa again running below commands.

pip uninstall librosa
pip uninstall joblib
pip install librosa -U
Download Pre-trained model

NVIDIA published their pre-trained model to use in your TTS project with Python freely. You need to download two models:

Once downloaded paste them inside tacotron2 folder (git cloned folder)

Those models are trained using LJ Speech Dataset which contains 24 hours of audio clips. You can understand the power of this model.

Generate Audio using Tecotron 2

Now we are all set to generate realistic audio like human voice from text using deep learning model called Tecotro

First, create a jupyter notebook inside tacotron2 folder (git cloned folder) then execute below codes there.

Import Required libraries
# Import library for TTS model
import matplotlib
%matplotlib inline
import matplotlib.pylab as plt

import IPython.display as ipd

import sys
sys.path.append('waveglow/')
import numpy as np
import torch

from hparams import create_hparams
from model import Tacotron2
from layers import TacotronSTFT, STFT
from audio_processing import griffin_lim
from train import load_model
from text import text_to_sequence
from denoiser import Denoiser
Define some parameters
# Define parameter
hparams = create_hparams()
hparams.sampling_rate = 22050
Load pre-trained TTS models

Now let’s load two models which we have already downloaded

# Load two pre-trained models for TTS
checkpoint_path = "tacotron2_statedict.pt"
model = load_model(hparams)
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
_ = model.cuda().eval()


waveglow_path = 'waveglow_256channels_universal_v5.pt'
waveglow = torch.load(waveglow_path)['model']
waveglow.cuda().eval()
for k in waveglow.convinv:
    k.float()
denoiser = Denoiser(waveglow)
Convert Text to Speech
# Generate audio from text
text = "natural language processing is really awesome!"
sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.autograd.Variable(
    torch.from_numpy(sequence)).cuda().long()

mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)

with torch.no_grad():
    audio = waveglow.infer(mel_outputs_postnet, sigma=0.666)
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)
Output

After hearing this voice you can understand, this TTS audio quality is near to human natural voice. This is clearly better than gTTS.

Also Read:  Training Data Preparation for Custom BERT Model

Though Tacotron generates natural sounds like a real human but the only drawback of Tacotron 2 is that it supports only English language. In this part gTTS is one step ahead.

Conclusion

In this tutorial, I show you two TTS framework to generate Text to speech in Python which you can use in your project.

The only disadvantage of those two tools (gTTS & Tecotron 2) is that they can only produce default female voice. There is no option to produce audio for male voice.

That’s all for this tutorial. If you have any questions or suggestions regarding this tutorial, feel free to mention those in the comment section below.

Similar Read:

3 thoughts on “Convert Text to Speech with Deep Learning in Python”

  1. # Define parameter
    hparams = create_hparams()
    hparams.sampling_rate = 22050

    This chunk of code isn’t working for me. I am working in google colab and it gives me following error:

    AttributeError: module ‘tensorflow’ has no attribute ‘contrib’

    I followed your tutorial step by step.
    KIndly answer me ASAP.
    Thank you.

    Reply

Leave a comment