A few years ago, the idea of text-to-speech software would have been laughable, if not impossible. However, with deep learning and artificial neural networks on the rise, it’s quite possible to convert text to speech on your own computer. In this guide, you’ll learn how to perform text-to-speech conversion with your own Python script to make speech synthesis more natural sounding than ever before possible.
Application of Text to Speech
Text to speech or TTS has several applications like:
- E-Reader books: This kind of application can read a book or paper for you
- Voice-enabled mobile Apps: A good example of this kind of app is Google Map drive navigation
- Siri: This product of Apple uses TTS in its background
- Amazon Alexa
- Google Assistance
- TikTok TTS
- Youtube videos: Nowadays using deep learning and AI TTS can generate audio like human voice which can use as voiceover for youtube videos.
Text to Speech Python implementation
There are so many text to speech library in Python. I found two of them very promising which can generate natural audio like real human. Let me share those. Those are:
- Google Text to Speech (gTTS)
- Tecotron 2
1. Text to Speech with Google TTS
gTTS (Google Text-to-Speech) is a Python library for interacting with Google Translate’s text-to-speech API. It supports several languages including Indian voices like Hindi, Tamil, Bengali (Bangla), Kannada and many more. You can find the complete list of languages supported by gtts with their language code below:
gtts languages list
The speech can be delivered in either of two audio speeds: fast or slow. However, in the most recent version, changing the voice of the produced audio is no longer available.
For installing the gTTS API, Open a terminal or command prompt and type
pip install gTTS
This is applicable to any platform.
We’re now ready to write a sample code that can translate text to speech.
# Import text to speech conversion library from gtts import gTTS # Text2Speech generation tts = gTTS('Natural language processing is really awesome!', lang='en') # Save converted audio as mp3 format tts.save('hello.mp3')
After hearing the output you can understand gTTS cannot generate sound like humans (Sounds like robotic). But since this library is free and open source and easy to use you can use it in your fun project.
Let’s try some Indian languages.
# Text2Speech generation tts = gTTS('আমি বাংলায় কথা বলতে পারি', lang='bn') # Save converted audio as mp3 format tts.save('bangla.mp3')
# Text2Speech generation tts = gTTS('मैं हिन्दी में बोल सकता हूँ', lang='hi') # Save converted audio as mp3 format tts.save('hindi.mp3')
2. Text to Speech with Tecotron 2
If you want to use Text to speech in any advanced project where you want to produce natural sound, you must use a deep learning model.
In this tutorial, I will show you the best deep-learning model to synthesize audio from text which is called Tecotron 2.
Tecotron 2 is an open-source deep-learning TTS model developed by NVIDIA. You can download their pre-trained easily and use that in your local system for your project.
Before we proceed to install the required modules for Tacotron 2 I will highly recommend you create a virtual environment and install Tecotron 2 inside there.
To create virtual environment run below command:
conda create -n tts_python python=3.7 activate tts_python
tts_python is the virtual environment name
Note: Don’t change the python version else you may get the below TensorFlow error:
Module 'tensorflow' has no attribute 'contrib'
Inside the virtual environment execute the below commands one by one:
Setup Tecotron 2
To set up Tecotron 2 execute the following commands:
git clone https://github.com/NVIDIA/tacotron2.git cd tacotron2 git submodule init git submodule update pip install -r requirements.txt
Install Additional Packages
You need to install some additional packages. To install them run below commands:
pip install tensorflow==1.15.0 pip install inflect==0.2.5 pip install librosa -U pip install Unidecode==1.0.22 pip install pillow pip install matplotlib
Note: If you want to install tensorflow for GPU you can follow this tutorial: Install TensorFlow GPU with Jupiter notebook for Windows
To work with Tecotron, you have to install PyTorch with CUDA. To install Pytorch with CUDA run the below command
conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch
Setup Jupyter Notebook
To configure jupyter notebook for your virtual environment execute below commands inside your environment:
conda install jupyter conda install nb_conda conda install ipykernel
Solve some error
You may face below errors while configuring Tecotron 2:
AttributeError: 'CacheManager' object has no attribute 'cachedir' AttributeError: 'Memory' object has no attribute 'location'
To solve the above error just uninstall
joblib and install
librosa again running below commands.
pip uninstall librosa pip uninstall joblib pip install librosa -U
Download Pre-trained model
NVIDIA published their pre-trained model to use in your TTS project with Python freely. You need to download two models:
- Tacotron 2: Download link
- WaveGlow: Download link
Once downloaded paste them inside
tacotron2 folder (git cloned folder)
Those models are trained using LJ Speech Dataset which contains 24 hours of audio clips. You can understand the power of this model.
Generate Audio using Tecotron 2
Now we are all set to generate realistic audio like human voice from text using deep learning model called Tecotro
First, create a jupyter notebook inside
tacotron2 folder (git cloned folder) then execute below codes there.
Import Required libraries
# Import library for TTS model import matplotlib %matplotlib inline import matplotlib.pylab as plt import IPython.display as ipd import sys sys.path.append('waveglow/') import numpy as np import torch from hparams import create_hparams from model import Tacotron2 from layers import TacotronSTFT, STFT from audio_processing import griffin_lim from train import load_model from text import text_to_sequence from denoiser import Denoiser
Define some parameters
# Define parameter hparams = create_hparams() hparams.sampling_rate = 22050
Load pre-trained TTS models
Now let’s load two models which we have already downloaded
# Load two pre-trained models for TTS checkpoint_path = "tacotron2_statedict.pt" model = load_model(hparams) model.load_state_dict(torch.load(checkpoint_path)['state_dict']) _ = model.cuda().eval() waveglow_path = 'waveglow_256channels_universal_v5.pt' waveglow = torch.load(waveglow_path)['model'] waveglow.cuda().eval() for k in waveglow.convinv: k.float() denoiser = Denoiser(waveglow)
Convert Text to Speech
# Generate audio from text text = "natural language processing is really awesome!" sequence = np.array(text_to_sequence(text, ['english_cleaners']))[None, :] sequence = torch.autograd.Variable( torch.from_numpy(sequence)).cuda().long() mel_outputs, mel_outputs_postnet, _, alignments = model.inference(sequence) with torch.no_grad(): audio = waveglow.infer(mel_outputs_postnet, sigma=0.666) ipd.Audio(audio.data.cpu().numpy(), rate=hparams.sampling_rate)
After hearing this voice you can understand, this TTS audio quality is near to human natural voice. This is clearly better than gTTS.
Though Tacotron generates natural sounds like a real human but the only drawback of Tacotron 2 is that it supports only English language. In this part gTTS is one step ahead.
In this tutorial, I show you two TTS framework to generate Text to speech in Python which you can use in your project.
The only disadvantage of those two tools (gTTS & Tecotron 2) is that they can only produce default female voice. There is no option to produce audio for male voice.
That’s all for this tutorial. If you have any questions or suggestions regarding this tutorial, feel free to mention those in the comment section below.
Hi there, I’m Anindya Naskar, Data Science Engineer. I created this website to show you what I believe is the best possible way to get your start in the field of Data Science.