The Text-to-Speech API enables developers to generate human-like speech. The API converts text into audio formats such as WAV, MP3, or Ogg Opus. It also supports Speech Synthesis Markup Language (SSML) inputs to specify pauses, numbers, date and time formatting, and other pronunciation instructions.

In this tutorial, you will focus on using the Text-to-Speech API with Python.

What you'll learn

What you'll need

Codelab-at-a-conference setup

By using a kiosk at Google I/O, a test project has been created and can be accessed by using going to: https://console.cloud.google.com/.

These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.

Note that all these accounts will be disabled soon after the codelab is over.

Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.

When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this tutorial you will be using Cloud Shell, a command line environment running in the Cloud.

Activate Google Cloud Shell

From the GCP Console click the Cloud Shell icon on the top right toolbar:

If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click "Continue" (and you won't ever see it again). Here's what that one-time screen looks like:

It should only take a few moments to provision and connect to the shell environment:

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

gcloud auth list

Command output

Credentialed accounts:
 - <myaccount>@<mydomain>.com (active)
gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

Before you can begin using the Text-to-Speech API, you must enable the API. Using Cloud Shell, you can enable the API with the following command:

gcloud services enable texttospeech.googleapis.com

In order to make requests to the Text-to-Speech API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Python client library to make Text-to-Speech API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.

First, set a PROJECT_ID environment variable:

export PROJECT_ID=$(gcloud config get-value core/project)

Next, create a new service account to access the Text-to-Speech API by using:

gcloud iam service-accounts create my-tts-sa \
  --display-name "my tts service account"

Next, create credentials that your Python code will use to login as your new service account. Create and save these credentials as a ~/key.json JSON file by using the following command:

gcloud iam service-accounts keys create ~/key.json \
  --iam-account my-tts-sa@${PROJECT_ID}.iam.gserviceaccount.com

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the Speech-to-Text client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created:

export GOOGLE_APPLICATION_CREDENTIALS=~/key.json

Install the client library:

pip3 install --user --upgrade google-cloud-texttospeech

You should see something like this:

...
Installing collected packages: google-cloud-texttospeech
Successfully installed google-cloud-texttospeech-1.0.1

Now, you're ready to use the Text-to-Speech API!

In this tutorial, you'll use an interactive Python interpreter called IPython. Start a session by running ipython in Cloud Shell. This command runs the Python interpreter in an interactive session.

ipython

You should see something like this:

Python 3.7.3 (default, Mar 31 2020, 14:50:17)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

In this section, you will get the list of all supported languages.

Copy the following code into your IPython session:

from google.cloud import texttospeech


def list_languages():
    client = texttospeech.TextToSpeechClient()
    voices = client.list_voices().voices
    languages = unique_languages_from_voices(voices)

    print(f' Languages: {len(languages)} '.center(60, '-'))
    for i, language in enumerate(sorted(languages)):
        print(f'{language:>10}', end='' if i % 5 < 4 else '\n')


def unique_languages_from_voices(voices):
    language_set = set()
    for voice in voices:
        for language_code in voice.language_codes:
            language_set.add(language_code)
    return language_set 

Take a moment to study the code and see how it uses the list_voices client library method to build the list of supported languages.

Call the function:

list_languages()

You should get this (or a larger) list:

---------------------- Languages: 33 -----------------------
     ar-XA    cmn-CN    cmn-TW     cs-CZ     da-DK
     de-DE     el-GR     en-AU     en-GB     en-IN
     en-US     es-ES     fi-FI    fil-PH     fr-CA
     fr-FR     hi-IN     hu-HU     id-ID     it-IT
     ja-JP     ko-KR     nb-NO     nl-NL     pl-PL
     pt-BR     pt-PT     ru-RU     sk-SK     sv-SE
     tr-TR     uk-UA     vi-VN

The list shows over 30 languages and variants such as:

This list is not fixed and will grow as new voices are available.

Summary

In this step, you were able to list supported languages.

In this section, you will get the list of voices available in different languages.

Copy the following code into your IPython session:

from google.cloud import texttospeech
from google.cloud.texttospeech import enums


def list_voices(language_code=None):
    client = texttospeech.TextToSpeechClient()
    response = client.list_voices(language_code)
    voices = sorted(response.voices, key=lambda voice: voice.name)

    print(f' Voices: {len(voices)} '.center(60, '-'))
    for voice in voices:
        languages = ', '.join(voice.language_codes)
        name = voice.name
        gender = enums.SsmlVoiceGender(voice.ssml_gender).name
        rate = voice.natural_sample_rate_hertz
        print(f'{languages:<8}',
              f'{name:<24}',
              f'{gender:<8}',
              f'{rate:,} Hz',
              sep=' | ')

Take a moment to study the code and see how it uses the client library method list_voices(language_code) to list voices available for a given language.

Now, get the list of available German voices:

list_voices('de')

You should see something like this:

------------------------ Voices: 10 ------------------------
de-DE    | de-DE-Standard-A         | FEMALE   | 24,000 Hz
de-DE    | de-DE-Standard-B         | MALE     | 24,000 Hz
de-DE    | de-DE-Standard-E         | MALE     | 24,000 Hz
de-DE    | de-DE-Standard-F         | FEMALE   | 22,050 Hz
de-DE    | de-DE-Wavenet-A          | FEMALE   | 24,000 Hz
de-DE    | de-DE-Wavenet-B          | MALE     | 24,000 Hz
de-DE    | de-DE-Wavenet-C          | FEMALE   | 24,000 Hz
de-DE    | de-DE-Wavenet-D          | MALE     | 24,000 Hz
de-DE    | de-DE-Wavenet-E          | MALE     | 24,000 Hz
de-DE    | de-DE-Wavenet-F          | FEMALE   | 24,000 Hz

Multiple female and male voices are available, as well as standard and WaveNet voices:

Now, get the list of available English voices:

list_voices('en')

You should get something like this:

------------------------ Voices: 34 ------------------------
en-AU    | en-AU-Standard-A         | FEMALE   | 24,000 Hz
...
en-AU    | en-AU-Wavenet-D          | MALE     | 24,000 Hz
en-GB    | en-GB-Standard-A         | FEMALE   | 24,000 Hz
...
en-GB    | en-GB-Wavenet-D          | MALE     | 24,000 Hz
en-IN    | en-IN-Standard-A         | FEMALE   | 24,000 Hz
...
en-IN    | en-IN-Wavenet-D          | FEMALE   | 24,000 Hz
en-US    | en-US-Standard-B         | MALE     | 24,000 Hz
...
en-US    | en-US-Wavenet-F          | FEMALE   | 24,000 Hz

In addition to a selection of multiple voices in different genders and qualities, multiple accents are available: Australian, British, Indian, and American English.

Take a moment to list the voices available for your preferred languages (or even all of them):

list_voices('fr')
list_voices('pt')
...
list_voices()

Summary

In this step, you were able to list available voices. You can also find the complete list of voices available on the Supported Voices page.

You can use the Text-to-Speech API to convert a string into audio data. You can configure the output of speech synthesis in a variety of ways, including selecting a unique voice or modulating the output in pitch, volumn, speaking rate, and sample rate.

Copy the following code into your IPython session:

from google.cloud import texttospeech
from google.cloud.texttospeech import enums, types


def text_to_wav(voice_name, text):
    language_code = '-'.join(voice_name.split('-')[:2])
    text_input = types.SynthesisInput(text=text)
    voice_params = types.VoiceSelectionParams(
        language_code=language_code,
        name=voice_name)
    audio_config = types.AudioConfig(
        audio_encoding=enums.AudioEncoding.LINEAR16)

    client = texttospeech.TextToSpeechClient()
    response = client.synthesize_speech(text_input, voice_params, audio_config)

    filename = f'{language_code}.wav'
    with open(filename, 'wb') as out:
        out.write(response.audio_content)
        print(f'Audio content written to "{filename}"')

Take a moment to study the code and see how it uses the synthesize_speech client library method to generate the audio data and save it as a wav file.

Now, generate sentences in a few different accents:

text_to_wav('en-AU-Wavenet-A', 'What is the temperature in Sydney?')
text_to_wav('en-GB-Wavenet-B', 'What is the temperature in London?')
text_to_wav('en-IN-Wavenet-C', 'What is the temperature in Delhi?')
text_to_wav('en-US-Wavenet-F', 'What is the temperature in New York?')

You should see something like this:

Audio content written to "en-AU.wav"
Audio content written to "en-GB.wav"
Audio content written to "en-IN.wav"
Audio content written to "en-US.wav"

To download all generated files at once, you can use this Cloud Shell command from your Python environment:

import os
os.system('cloudshell download en-*.wav')

Validate and your browser will download the files:

Open the files and listen to the results.

Summary

In this step, you were able to use Text-to-Speech API to convert sentences into audio wav files. Read more about creating voice audio files.

You learned how to use the Text-to-Speech API using Python to generate human-like speech!

Clean up

To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:

Learn more

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.