
A few years ago, the idea of text-to-speech software would have been laughable, if not impossible. However, with deep learning and artificial neural networks on the rise, it’s quite possible to convert text to speech on your own computer. In this guide, you’ll learn how to perform text-to-speech conversion with your own Python script to make speech synthesis more natural sounding than ever before possible.
Application of Text to Speech
Text to speech or TTS has several applications like:
- E-Reader books: This kind of application can read a book or paper for you
- Voice-enabled mobile Apps: A good example of this kind of app is Google Map drive navigation
- Siri: This product of Apple uses TTS in its background
- Amazon Alexa
- Google Assistance
- TikTok TTS
- Youtube videos: Nowadays using deep learning and AI TTS can generate audio like human voice which can use as voiceover for youtube videos.
- etc.
Text to Speech Python implementation
There are so many text to speech library in Python. I found two of them very promising which can generate natural audio like real human. Let me share those. Those are:
- Google Text to Speech (gTTS)
- Tecotron 2
1. Text to Speech with Google TTS
gTTS (Google Text-to-Speech) is a Python library for interacting with Google Translate’s text-to-speech API. It supports several languages including Indian voices like Hindi, Tamil, Bengali (Bangla), Kannada and many more. You can find the complete list of languages supported by gtts with their language code below:
gtts languages list
The speech can be delivered in either of two audio speeds: fast or slow. However, in the most recent version, changing the voice of the produced audio is no longer available.
Installation
For installing the gTTS API, Open a terminal or command prompt and type
pip install gTTS
This is applicable to any platform.
We’re now ready to write a sample code that can translate text to speech.
English Language
# Import text to speech conversion library
from gtts import gTTS
# Text2Speech generation
tts = gTTS('Natural language processing is really awesome!', lang='en')
# Save converted audio as mp3 format
tts.save('hello.mp3')
Output
After hearing the output you can understand gTTS cannot generate sound like humans (Sounds like robotic). But since this library is free and open source and easy to use you can use it in your fun project.
Let’s try some Indian languages.
Bengali Language
# Text2Speech generation
tts = gTTS('আমি বাংলায় কথা বলতে পারি', lang='bn')
# Save converted audio as mp3 format
tts.save('bangla.mp3')
Output
Hindi Language
# Text2Speech generation
tts = gTTS('मैं हिन्दी में बोल सकता हूँ', lang='hi')
# Save converted audio as mp3 format
tts.save('hindi.mp3')
Output
2. Text to Speech with Tecotron 2
If you want to use Text to speech in any advanced project where you want to produce natural sound, you must use a deep learning model.
In this tutorial, I will show you the best deep-learning model to synthesize audio from text which is called Tecotron 2.
Tecotron 2 is an open-source deep-learning TTS model developed by NVIDIA. You can download their pre-trained easily and use that in your local system for your project.
Installation
Before we proceed to install the required modules for Tacotron 2 I will highly recommend you create a virtual environment and install Tecotron 2 inside there.
To create virtual environment run below command:
conda create -n tts_python python=3.7
activate tts_python
Here tts_python
is the virtual environment name
Note: Don’t change the python version else you may get the below TensorFlow error:
Module 'tensorflow' has no attribute 'contrib'
Inside the virtual environment execute the below commands one by one:
Setup Tecotron 2
To set up Tecotron 2 execute the following commands:
git clone https://github.com/NVIDIA/tacotron2.git
cd tacotron2
git submodule init
git submodule update
pip install -r requirements.txt
Install Additional Packages
You need to install some additional packages. To install them run below commands:
pip install tensorflow==1.15.0
pip install inflect==0.2.5
pip install librosa -U
pip install Unidecode==1.0.22
pip install pillow
pip install matplotlib
Note: If you want to install tensorflow for GPU you can follow this tutorial: Install TensorFlow GPU with Jupiter notebook for Windows
Install Pytorch
To work with Tecotron, you have to install PyTorch with CUDA. To install Pytorch with CUDA run the below command
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Setup Jupyter Notebook
To configure jupyter notebook for your virtual environment execute below commands inside your environment:
conda install jupyter
conda install nb_conda
conda install ipykernel
Solve some error
You may face below errors while configuring Tecotron 2:
AttributeError: 'CacheManager' object has no attribute 'cachedir'
AttributeError: 'Memory' object has no attribute 'location'
To solve the above error just uninstall librosa
and joblib
and install librosa
again running below commands.
pip uninstall librosa
pip uninstall joblib
pip install librosa -U
Download Pre-trained model
NVIDIA published their pre-trained model to use in your TTS project with Python freely. You need to download two models:
- Tacotron 2: Download link
- WaveGlow: Download link
Once downloaded paste them inside tacotron2
folder (git cloned folder)
Those models are trained using LJ Speech Dataset which contains 24 hours of audio clips. You can understand the power of this model.
Generate Audio using Tecotron 2
Now we are all set to generate realistic audio like human voice from text using deep learning model called Tecotro
First, create a jupyter notebook inside tacotron2
folder (git cloned folder) then execute below codes there.
Import Required libraries
# Import library for TTS model
import matplotlib
%matplotlib inline
import matplotlib.pylab as plt
import IPython.display as ipd
import sys
sys.path.append('waveglow/')
import numpy as np
import torch
from hparams import create_hparams
from model import Tacotron2
from layers import TacotronSTFT, STFT
from audio_processing import griffin_lim
from train import load_model
from text import text_to_sequence
from denoiser import Denoiser
Define some parameters
# Define parameter
hparams = create_hparams()
hparams.sampling_rate = 22050
Load pre-trained TTS models
Now let’s load two models which we have already downloaded
# Load two pre-trained models for TTS
checkpoint_path = "tacotron2_statedict.pt"
model = load_model(hparams)
model.load_state_dict(torch.load(checkpoint_path)['state_dict'])
_ = model.cuda().eval()
waveglow_path = 'waveglow_256channels_universal_v5.pt'
waveglow = torch.load(waveglow_path)['model']
waveglow.cuda().eval()
for k in waveglow.convinv:
k.float()
denoiser = Denoiser(waveglow)
Convert Text to Speech
# Generate audio from text
text = "natural language processing is really awesome!"
sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :]
sequence = torch.autograd.Variable(
torch.from_numpy(sequence)).cuda().long()
mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence)
with torch.no_grad():
audio = waveglow.infer(mel_outputs_postnet, sigma=0.666)
ipd.Audio(audio[0].data.cpu().numpy(), rate=hparams.sampling_rate)
Output
After hearing this voice you can understand, this TTS audio quality is near to human natural voice. This is clearly better than gTTS.
Though Tacotron generates natural sounds like a real human but the only drawback of Tacotron 2 is that it supports only English language. In this part gTTS is one step ahead.
Conclusion
In this tutorial, I show you two TTS framework to generate Text to speech in Python which you can use in your project.
The only disadvantage of those two tools (gTTS & Tecotron 2) is that they can only produce default female voice. There is no option to produce audio for male voice.
That’s all for this tutorial. If you have any questions or suggestions regarding this tutorial, feel free to mention those in the comment section below.

Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.
Thank you, exactly what I’ve been looking for!
# Define parameter
hparams = create_hparams()
hparams.sampling_rate = 22050
This chunk of code isn’t working for me. I am working in google colab and it gives me following error:
AttributeError: module ‘tensorflow’ has no attribute ‘contrib’
I followed your tutorial step by step.
KIndly answer me ASAP.
Thank you.
This issue is already discussed in this article, Please crosscheck which python and tensorflow version you are using?