The Speech-to-Text API enables developers to convert audio to text in over 120 languages and variants, by applying powerful neural network models in an easy to use API.
In this tutorial, you will focus on using the Speech-to-Text API with Python.
By using a kiosk at Google I/O, a test project has been created and can be accessed by using going to: https://console.cloud.google.com/.
These temporary accounts have existing projects that are set up with billing so that there are no costs associated for you with running this codelab.
Note that all these accounts will be disabled soon after the codelab is over.
Use these credentials to log into the machine or to open a new Google Cloud Console window https://console.cloud.google.com/. Accept the new account Terms of Service and any updates to Terms of Service.
When presented with this console landing page, please select the only project available. Alternatively, from the console home page, click on "Select a Project" :
While Google Cloud can be operated remotely from your laptop, in this tutorial you will be using Cloud Shell, a command line environment running in the Cloud.
From the GCP Console click the Cloud Shell icon on the top right toolbar:
If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click "Continue" (and you won't ever see it again). Here's what that one-time screen looks like:
It should only take a few moments to provision and connect to the shell environment:
This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on the Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this lab can be done with simply a browser or your Google Chromebook.
Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your PROJECT_ID.
Run the following command in Cloud Shell to confirm that you are authenticated:
gcloud auth list
Command output
Credentialed accounts: - <myaccount>@<mydomain>.com (active)
gcloud config list project
Command output
[core] project = <PROJECT_ID>
If it is not, you can set it with this command:
gcloud config set project <PROJECT_ID>
Command output
Updated property [core/project].
Before you can begin using the Speech-to-Text API, you must enable the API. Using Cloud Shell, you can enable the API with the following command:
gcloud services enable speech.googleapis.com
In order to make requests to the Speech-to-Text API, you need to use a Service Account. A Service Account belongs to your project and it is used by the Python client library to make Speech-to-Text API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.
First, set a PROJECT_ID
environment variable:
export PROJECT_ID=$(gcloud config get-value core/project)
Next, create a new service account to access the Speech-to-Text API by using:
gcloud iam service-accounts create my-stt-sa \ --display-name "my stt service account"
Next, create credentials that your Python code will use to login as your new service account. Create and save these credentials as a ~/key.json
JSON file by using the following command:
gcloud iam service-accounts keys create ~/key.json \ --iam-account my-stt-sa@${PROJECT_ID}.iam.gserviceaccount.com
Finally, set the GOOGLE_APPLICATION_CREDENTIALS
environment variable, which is used by the Speech-to-Text client library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created:
export GOOGLE_APPLICATION_CREDENTIALS=~/key.json
Install the client library:
pip3 install --user --upgrade google-cloud-speech
You should see something like this:
... Installing collected packages: google-cloud-speech Successfully installed google-cloud-speech-1.3.2
Now, you're ready to use the Speech-to-Text API!
In this tutorial, you'll use an interactive Python interpreter called IPython. Start a session by running ipython
in Cloud Shell. This command runs the Python interpreter in an interactive session.
ipython
You should see something like this:
Python 3.7.3 (default, Mar 31 2020, 14:50:17) Type 'copyright', 'credits' or 'license' for more information IPython 7.13.0 -- An enhanced Interactive Python. Type '?' for help. In [1]:
In this section, you will transcribe an English audio file.
Copy the following code into your IPython session:
from google.cloud import speech_v1 as speech
def speech_to_text(config, audio):
client = speech.SpeechClient()
response = client.recognize(config, audio)
print_sentences(response)
def print_sentences(response):
for result in response.results:
best_alternative = result.alternatives[0]
transcript = best_alternative.transcript
confidence = best_alternative.confidence
print('-' * 80)
print(f'Transcript: {transcript}')
print(f'Confidence: {confidence:.0%}')
config = {'language_code': 'en-US'}
audio = {'uri': 'gs://cloud-samples-data/speech/brooklyn_bridge.flac'}
Take a moment to study the code and see how it uses the recognize
client library method to transcribe an audio file. The config
parameter indicates how to process the request and the audio
parameter specifies the audio data to be recognized.
Call the function:
speech_to_text(config, audio)
You should see the following output:
-------------------------------------------------------------------------------- Transcript: how old is the Brooklyn Bridge Confidence: 98%
Update the configuration to enable automatic punctuation and call the function again:
config.update({'enable_automatic_punctuation': True})
speech_to_text(config, audio)
You should see the following output:
-------------------------------------------------------------------------------- Transcript: How old is the Brooklyn Bridge? Confidence: 98%
In this step, you were able to transcribe an audio file in English, using different parameters, and print out the result. You can read more about performing synchronous speech recognition.
Speech-to-Text can detect time offsets (timestamps) for the transcribed audio. Time offsets show the beginning and end of each spoken word in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.
To transcribe an audio file with word timestamps, update your code by copying the following into your IPython session:
from google.cloud import speech_v1 as speech
def speech_to_text(config, audio):
client = speech.SpeechClient()
response = client.recognize(config, audio)
print_sentences(response)
def print_sentences(response):
for result in response.results:
best_alternative = result.alternatives[0]
transcript = best_alternative.transcript
confidence = best_alternative.confidence
print('-' * 80)
print(f'Transcript: {transcript}')
print(f'Confidence: {confidence:.0%}')
print_word_offsets(best_alternative)
def print_word_offsets(alternative):
for word in alternative.words:
start_ms = word.start_time.ToMilliseconds()
end_ms = word.end_time.ToMilliseconds()
word = word.word
print(f'{start_ms/1000:>7.3f}',
f'{end_ms/1000:>7.3f}',
f'{word}',
sep=' | ')
config = {
'language_code': 'en-US',
'enable_automatic_punctuation': True,
'enable_word_time_offsets': True,
}
audio = {'uri': 'gs://cloud-samples-data/speech/brooklyn_bridge.flac'}
Take a moment to study the code and see how it transcribes an audio file with word timestamps. The enable_word_time_offsets
parameter tells the API to return the time offsets for each word (see the doc for more details).
Call the function:
speech_to_text(config, audio)
You should see the following output:
-------------------------------------------------------------------------------- Transcript: How old is the Brooklyn Bridge? Confidence: 98% 0.000 | 0.300 | How 0.300 | 0.600 | old 0.600 | 0.800 | is 0.800 | 0.900 | the 0.900 | 1.100 | Brooklyn 1.100 | 1.400 | Bridge?
In this step, you were able to transcribe an audio file in English with word timestamps and print out the result. Read more about getting word timestamps.
The Speech-to-Text API recognizes more than 120 languages and variants! You can find a list of supported languages here.
In this section, you will transcribe a French audio file.
To transcribe the French audio file, update your code by copying the following into your IPython session:
config = {
'language_code': 'fr-FR',
'enable_automatic_punctuation': True,
'enable_word_time_offsets': True,
}
audio = {'uri': 'gs://cloud-samples-data/speech/corbeau_renard.flac'}
speech_to_text(config, audio)
You should see the following output:
-------------------------------------------------- Transcript: Maître corbeau sur un arbre perché tenait en son bec un fromage... Confidence: 93% 0.000 | 0.700 | Maître 0.700 | 0.900 | corbeau 0.900 | 1.300 | sur 1.300 | 1.600 | un 1.600 | 1.700 | arbre 1.700 | 2.000 | perché 2.000 | 2.800 | tenait 2.800 | 3.100 | en 3.100 | 3.200 | son 3.200 | 3.500 | bec 3.500 | 3.700 | un 3.700 | 3.800 | fromage ... 10.400 | 11.400 | Bonjour 11.400 | 11.800 | Monsieur 11.800 | 11.900 | du 11.900 | 12.100 | corbeau.
This is the beginning of a popular French fable by Jean de La Fontaine.
In this step, you were able to transcribe a French audio file and print out the result. You can read more about supported languages.
You learned how to use the Speech-to-Text API using Python to perform different kinds of transcription on audio files!
To avoid incurring charges to your Google Cloud account for the resources used in this tutorial:
This work is licensed under a Creative Commons Attribution 2.0 Generic License.