1. Overview
The Text-to-Speech API enables developers to generate human-like speech. The API converts text into audio formats such as WAV, MP3, or Ogg Opus. It also supports Speech Synthesis Markup Language (SSML) inputs to specify pauses, numbers, date and time formatting, and other pronunciation instructions.
In this tutorial, you will focus on using the Text-to-Speech API with Python.
What you'll learn
- How to use Cloud Shell
- How to enable the Text-to-Speech API
- How to authenticate API requests
- How to install the client library for Python
- How to list supported languages
- How to list available voices
- How to synthesize audio from text
What you'll need
Survey
How will you use this tutorial?
How would you rate your experience with Python?
How would you rate your experience with using Google Cloud services?
2. Setup and requirements
Self-paced environment setup
- Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.
- The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can update it at any time.
- The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (it is typically identified as
PROJECT_ID
). If you don't like the generated ID, you may generate another random one. Alternatively, you can try your own and see if it's available. It cannot be changed after this step and will remain for the duration of the project. - For your information, there is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
- Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, you can delete the resources you created or delete the whole project. New users of Google Cloud are eligible for the $300 USD Free Trial program.
Start Cloud Shell
While Google Cloud can be operated remotely from your laptop, in this tutorial you will be using Cloud Shell, a command line environment running in the Cloud.
Activate Cloud Shell
- From the Cloud Console, click Activate Cloud Shell
.
If you've never started Cloud Shell before, you're presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:
It should only take a few moments to provision and connect to Cloud Shell.
This virtual machine is loaded with all the development tools you need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.
- Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Command output
Credentialed Accounts ACTIVE ACCOUNT * <my_account>@<my_domain.com> To set the active account, run: $ gcloud config set account `ACCOUNT`
- Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Command output
Updated property [core/project].
3. Enable the API
Before you can begin using the Text-to-Speech API, you must enable it. Using Cloud Shell, you can enable the API with the following command:
gcloud services enable texttospeech.googleapis.com
4. Authenticate API requests
To make requests to the Text-to-Speech API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Python client library to make Text-to-Speech API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.
First, set a PROJECT_ID
environment variable:
export PROJECT_ID=$(gcloud config get-value core/project)
Next, create a new service account to access the Text-to-Speech API by using:
gcloud iam service-accounts create my-tts-sa \ --display-name "my tts service account"
Grant the service account the permission to use the service:
gcloud projects add-iam-policy-binding ${PROJECT_ID} \ --member serviceAccount:my-tts-sa@${PROJECT_ID}.iam.gserviceaccount.com \ --role roles/serviceusage.serviceUsageConsumer
Create credentials that your Python code will use to login as your new service account. Create and save these credentials as a ~/key.json
JSON file by using the following command:
gcloud iam service-accounts keys create ~/key.json \ --iam-account my-tts-sa@${PROJECT_ID}.iam.gserviceaccount.com
Finally, set the GOOGLE_APPLICATION_CREDENTIALS
environment variable, which is used by the Speech-to-Text client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created:
export GOOGLE_APPLICATION_CREDENTIALS=~/key.json
5. Install the client library
Install the client library:
pip3 install --user --upgrade google-cloud-texttospeech
You should see something like this:
... Installing collected packages: google-cloud-texttospeech Successfully installed google-cloud-texttospeech-2.13.0
Now, you're ready to use the Text-to-Speech API!
6. Start Interactive Python
In this tutorial, you'll use an interactive Python interpreter called IPython. Start a session by running ipython
in Cloud Shell. This command runs the Python interpreter in an interactive session.
ipython
You should see something like this:
Python 3.9.2 (default, Feb 28 2021, 17:03:44) Type 'copyright', 'credits' or 'license' for more information IPython 8.7.0 -- An enhanced Interactive Python. Type '?' for help. In [1]:
7. List supported languages
In this section, you will get the list of all supported languages.
Copy the following code into your IPython session:
import google.cloud.texttospeech as tts
def unique_languages_from_voices(voices):
language_set = set()
for voice in voices:
for language_code in voice.language_codes:
language_set.add(language_code)
return language_set
def list_languages():
client = tts.TextToSpeechClient()
response = client.list_voices()
languages = unique_languages_from_voices(response.voices)
print(f" Languages: {len(languages)} ".center(60, "-"))
for i, language in enumerate(sorted(languages)):
print(f"{language:>10}", end="\n" if i % 5 == 4 else "")
Take a moment to study the code and see how it uses the list_voices
client library method to build the list of supported languages.
Call the function:
list_languages()
You should get this (or a larger) list:
---------------------- Languages: 53 ----------------------- af-ZA ar-XA bg-BG bn-IN ca-ES cmn-CN cmn-TW cs-CZ da-DK de-DE el-GR en-AU en-GB en-IN en-US es-ES es-US fi-FI fil-PH fr-CA fr-FR gu-IN hi-IN hu-HU id-ID is-IS it-IT ja-JP kn-IN ko-KR lv-LV ml-IN mr-IN ms-MY nb-NO nl-BE nl-NL pa-IN pl-PL pt-BR pt-PT ro-RO ru-RU sk-SK sr-RS sv-SE ta-IN te-IN th-TH tr-TR uk-UA vi-VN yue-HK
The list shows 53 languages and variants such as:
- Chinese and Taiwanese Mandarin,
- Australian, British, Indian, and American English,
- French from Canada and France,
- Portuguese from Brazil and Portugal.
This list is not fixed and will grow as new voices are available.
Summary
In this step, you were able to list the supported languages.
8. List available voices
In this section, you will get the list of voices available in different languages.
Copy the following code into your IPython session:
import google.cloud.texttospeech as tts
def list_voices(language_code=None):
client = tts.TextToSpeechClient()
response = client.list_voices(language_code=language_code)
voices = sorted(response.voices, key=lambda voice: voice.name)
print(f" Voices: {len(voices)} ".center(60, "-"))
for voice in voices:
languages = ", ".join(voice.language_codes)
name = voice.name
gender = tts.SsmlVoiceGender(voice.ssml_gender).name
rate = voice.natural_sample_rate_hertz
print(f"{languages:<8} | {name:<24} | {gender:<8} | {rate:,} Hz")
Take a moment to study the code and see how it uses the client library method list_voices(language_code)
to list voices available for a given language.
Now, get the list of available German voices:
list_voices("de")
You should see something like this:
------------------------ Voices: 20 ------------------------ de-DE | de-DE-Neural2-D | MALE | 24,000 Hz de-DE | de-DE-Neural2-F | FEMALE | 24,000 Hz de-DE | de-DE-Standard-A | FEMALE | 24,000 Hz de-DE | de-DE-Standard-B | MALE | 24,000 Hz ... de-DE | de-DE-Wavenet-E | MALE | 24,000 Hz de-DE | de-DE-Wavenet-F | FEMALE | 24,000 Hz
Multiple female and male voices are available, as well as standard, WaveNet, and Neural2 voices:
- Standard voices are generated by signal processing algorithms.
- WaveNet and Neural2 voices are higher quality voices synthesized by machine learning models and sounding more natural.
Now, get the list of available English voices:
list_voices("en")
You should get something like this:
------------------------ Voices: 101 ----------------------- en-AU | en-AU-Neural2-A | FEMALE | 24,000 Hz ... en-AU | en-AU-Standard-A | FEMALE | 24,000 Hz ... en-AU | en-AU-Wavenet-D | MALE | 24,000 Hz en-GB | en-GB-Neural2-A | FEMALE | 24,000 Hz ... en-GB | en-GB-Standard-A | FEMALE | 24,000 Hz ... en-GB | en-GB-Wavenet-F | FEMALE | 24,000 Hz en-IN | en-IN-Standard-A | FEMALE | 24,000 Hz ... en-IN | en-IN-Wavenet-D | FEMALE | 24,000 Hz en-US | en-US-Neural2-A | MALE | 24,000 Hz en-US | en-US-Standard-A | MALE | 24,000 Hz ... en-US | en-US-Wavenet-J | MALE | 24,000 Hz
In addition to a selection of multiple voices in different genders and qualities, multiple accents are available: Australian, British, Indian, and American English.
Take a moment to list the voices available for your preferred languages and variants (or even all of them):
list_voices("fr")
list_voices("pt")
list_voices()
Summary
In this step, you were able to list available voices. You can also find the complete list of voices available on the Supported voices and languages page.
9. Synthesize audio from text
You can use the Text-to-Speech API to convert a string into audio data. You can configure the output of speech synthesis in a variety of ways, including selecting a unique voice or modulating the output in pitch, volume, speaking rate, and sample rate.
Copy the following code into your IPython session:
import google.cloud.texttospeech as tts
def text_to_wav(voice_name: str, text: str):
language_code = "-".join(voice_name.split("-")[:2])
text_input = tts.SynthesisInput(text=text)
voice_params = tts.VoiceSelectionParams(
language_code=language_code, name=voice_name
)
audio_config = tts.AudioConfig(audio_encoding=tts.AudioEncoding.LINEAR16)
client = tts.TextToSpeechClient()
response = client.synthesize_speech(
input=text_input, voice=voice_params, audio_config=audio_config
)
filename = f"{language_code}.wav"
with open(filename, "wb") as out:
out.write(response.audio_content)
print(f'Generated speech saved to "{filename}"')
Take a moment to study the code and see how it uses the synthesize_speech
client library method to generate the audio data and save it as a wav
file.
Now, generate sentences in a few different accents:
text_to_wav("en-AU-Wavenet-A", "What is the temperature in Sydney?")
text_to_wav("en-GB-Wavenet-B", "What is the temperature in London?")
text_to_wav("en-IN-Wavenet-C", "What is the temperature in Delhi?")
text_to_wav("en-US-Wavenet-F", "What is the temperature in New York?")
You should see something like this:
Generated speech saved to "en-AU.wav" Generated speech saved to "en-GB.wav" Generated speech saved to "en-IN.wav" Generated speech saved to "en-US.wav"
To download all generated files at once, you can use this Cloud Shell command from your Python environment:
import os
os.system("cloudshell download en-*.wav")
Validate and your browser will download the files:
Open the files and listen to the results.
Summary
In this step, you were able to use Text-to-Speech API to convert sentences into audio wav
files. Read more about creating voice audio files.
10. Congratulations!
You learned how to use the Text-to-Speech API using Python to generate human-like speech!
Clean up
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:
- In the Cloud Console, go to the Manage resources page.
- In the project list, select your project then click Delete.
- In the dialog, type the project ID and then click Shut down to delete the project.
Learn more
- Test the demo in your browser: https://cloud.google.com/text-to-speech
- Text-to-Speech documentation: https://cloud.google.com/text-to-speech/docs
- Python on Google Cloud: https://cloud.google.com/python
- Cloud Client Libraries for Python: https://googlecloudplatform.github.io/google-cloud-python
License
This work is licensed under a Creative Commons Attribution 2.0 Generic License.