Using the Speech-to-Text API with Python

1. Overview

The Speech-to-Text API enables developers to convert audio to text in over 125 languages and variants, by applying powerful neural network models in an easy to use API.

In this tutorial, you will focus on using the Speech-to-Text API with Python.

What you'll learn

How to set up your environment
How to transcribe audio files in English
How to transcribe audio files with word timestamps
How to transcribe audio files in different languages

What you'll need

A Google Cloud project
A browser, such as Chrome or Firefox
Familiarity using Python

Survey

How will you use this tutorial?

Read it through only

Read it and complete the exercises

How would you rate your experience with Python?

Novice

Intermediate

Proficient

How would you rate your experience with Google Cloud services?

Novice

Intermediate

Proficient

2. Setup and requirements

Self-paced environment setup

Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

The Project name is the display name for this project's participants. It is a character string not used by Google APIs. You can always update it.
The Project ID is unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference your Project ID (typically identified as PROJECT_ID). If you don't like the generated ID, you might generate another random one. Alternatively, you can try your own, and see if it's available. It can't be changed after this step and remains for the duration of the project.
For your information, there is a third value, a Project Number, which some APIs use. Learn more about all three of these values in the documentation.

Next, you'll need to enable billing in the Cloud Console to use Cloud resources/APIs. Running through this codelab won't cost much, if anything at all. To shut down resources to avoid incurring billing beyond this tutorial, you can delete the resources you created or delete the project. New Google Cloud users are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Cloud Shell, a command line environment running in the Cloud.

Activate Cloud Shell

From the Cloud Console, click Activate Cloud Shell .

If this is your first time starting Cloud Shell, you're presented with an intermediate screen describing what it is. If you were presented with an intermediate screen, click Continue.

It should only take a few moments to provision and connect to Cloud Shell.

This virtual machine is loaded with all the development tools needed. It offers a persistent 5 GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with a browser.

Once connected to Cloud Shell, you should see that you are authenticated and that the project is set to your project ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

gcloud auth list

Command output

 Credentialed Accounts
ACTIVE  ACCOUNT
*       <my_account>@<my_domain.com>

To set the active account, run:
    $ gcloud config set account `ACCOUNT`

Run the following command in Cloud Shell to confirm that the gcloud command knows about your project:

gcloud config list project

Command output

[core]
project = <PROJECT_ID>

If it is not, you can set it with this command:

gcloud config set project <PROJECT_ID>

Command output

Updated property [core/project].

3. Environment setup

Before you can begin using the Speech-to-Text API, run the following command in Cloud Shell to enable the API:

gcloud services enable speech.googleapis.com

You should see something like this:

Operation "operations/..." finished successfully.

Now, you can use the Speech-to-Text API!

Navigate to your home directory:

cd ~

Create a Python virtual environment to isolate the dependencies:

virtualenv venv-speech

Activate the virtual environment:

source venv-speech/bin/activate

Install IPython and the Speech-to-Text API client library:

pip install ipython google-cloud-speech

You should see something like this:

...
Installing collected packages: ..., ipython, google-cloud-speech
Successfully installed ... google-cloud-speech-2.25.1 ...

Now, you're ready to use the Speech-to-Text API client library!

In the next steps, you'll use an interactive Python interpreter called IPython, which you installed in the previous step. Start a session by running ipython in Cloud Shell:

ipython

You should see something like this:

Python 3.9.2 (default, Feb 28 2021, 17:03:44)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.18.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

You're ready to make your first request...

4. Transcribe audio files

In this section, you will transcribe an English audio file.

Copy the following code into your IPython session:

from google.cloud import speech


def speech_to_text(
    config: speech.RecognitionConfig,
    audio: speech.RecognitionAudio,
) -> speech.RecognizeResponse:
    client = speech.SpeechClient()

    # Synchronous speech recognition request
    response = client.recognize(config=config, audio=audio)

    return response


def print_response(response: speech.RecognizeResponse):
    for result in response.results:
        print_result(result)


def print_result(result: speech.SpeechRecognitionResult):
    best_alternative = result.alternatives[0]
    print("-" * 80)
    print(f"language_code: {result.language_code}")
    print(f"transcript:    {best_alternative.transcript}")
    print(f"confidence:    {best_alternative.confidence:.0%}")

Take a moment to study the code and see how it uses the recognize client library method to transcribe an audio file*.* The config parameter indicates how to process the request and the audio parameter specifies the audio data to be recognized.

Send a request:

config = speech.RecognitionConfig(
    language_code="en",
)
audio = speech.RecognitionAudio(
    uri="gs://cloud-samples-data/speech/brooklyn_bridge.flac",
)

response = speech_to_text(config, audio)
print_response(response)

You should see the following output:

--------------------------------------------------------------------------------
language_code: en-us
transcript:    how old is the Brooklyn Bridge
confidence:    98%

Update the configuration to enable automatic punctuation and send a new request:

config.enable_automatic_punctuation = True

response = speech_to_text(config, audio)
print_response(response)

You should see the following output:

--------------------------------------------------------------------------------
language_code: en-us
transcript:    How old is the Brooklyn Bridge?
confidence:    98%

Summary

In this step, you were able to transcribe an audio file in English, using different parameters, and print out the result. You can read more about transcribing audio files.

5. Get word timestamps

Speech-to-Text can detect time offsets (timestamps) for the transcribed audio. Time offsets show the beginning and end of each spoken word in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.

To transcribe an audio file with word timestamps, update your code by copying the following into your IPython session:

def print_result(result: speech.SpeechRecognitionResult):
    best_alternative = result.alternatives[0]
    print("-" * 80)
    print(f"language_code: {result.language_code}")
    print(f"transcript:    {best_alternative.transcript}")
    print(f"confidence:    {best_alternative.confidence:.0%}")
    print("-" * 80)
    for word in best_alternative.words:
        start_s = word.start_time.total_seconds()
        end_s = word.end_time.total_seconds()
        print(f"{start_s:>7.3f} | {end_s:>7.3f} | {word.word}")

Take a moment to study the code and see how it transcribes an audio file with word timestamps*.* The enable_word_time_offsets parameter tells the API to return the time offsets for each word (see the doc for more details).

Send a request:

config = speech.RecognitionConfig(
    language_code="en",
    enable_automatic_punctuation=True,
    enable_word_time_offsets=True,
)
audio = speech.RecognitionAudio(
    uri="gs://cloud-samples-data/speech/brooklyn_bridge.flac",
)

response = speech_to_text(config, audio)
print_response(response)

You should see the following output:

--------------------------------------------------------------------------------
language_code: en-us
transcript:    How old is the Brooklyn Bridge?
confidence:    98%
--------------------------------------------------------------------------------
  0.000 |   0.300 | How
  0.300 |   0.600 | old
  0.600 |   0.800 | is
  0.800 |   0.900 | the
  0.900 |   1.100 | Brooklyn
  1.100 |   1.400 | Bridge?

Summary

In this step, you were able to transcribe an audio file in English with word timestamps and print the result. Read more about getting word timestamps.

6. Transcribe different languages

The Speech-to-Text API recognizes more than 125 languages and variants! You can find a list of supported languages here.

In this section, you will transcribe a French audio file.

To transcribe the French audio file, update your code by copying the following into your IPython session:

config = speech.RecognitionConfig(
    language_code="fr-FR",
    enable_automatic_punctuation=True,
    enable_word_time_offsets=True,
)
audio = speech.RecognitionAudio(
    uri="gs://cloud-samples-data/speech/corbeau_renard.flac",
)

response = speech_to_text(config, audio)
print_response(response)

You should see the following output:

--------------------------------------------------------------------------------
language_code: fr-fr
transcript:    Maître corbeau sur un arbre perché Tenait dans son bec un fromage maître Renard par l'odeur alléché lui tint à peu près ce langage et bonjour monsieur du corbeau.
confidence:    94%
--------------------------------------------------------------------------------
  0.000 |   0.700 | Maître
  0.700 |   1.100 | corbeau
  1.100 |   1.300 | sur
  1.300 |   1.600 | un
  1.600 |   1.700 | arbre
  1.700 |   2.000 | perché
  2.000 |   3.000 | Tenait
  3.000 |   3.000 | dans
  3.000 |   3.200 | son
  3.200 |   3.500 | bec
  3.500 |   3.700 | un
  3.700 |   3.800 | fromage
...
 10.800 |  11.800 | monsieur
 11.800 |  11.900 | du
 11.900 |  12.100 | corbeau.

Summary

In this step, you were able to transcribe a French audio file and print the result. You can read more about the supported languages.

7. Congratulations!

You learned how to use the Speech-to-Text API using Python to perform different kinds of transcription on audio files!

Clean up

To clean up your development environment, from Cloud Shell:

If you're still in your IPython session, go back to the shell: exit
Stop using the Python virtual environment: deactivate
Delete your virtual environment folder: cd ~ ; rm -rf ./venv-speech

To delete your Google Cloud project, from Cloud Shell:

Retrieve your current project ID: PROJECT_ID=$(gcloud config get-value core/project)
Make sure this is the project you want to delete: echo $PROJECT_ID
Delete the project: gcloud projects delete $PROJECT_ID

Learn more

Test the demo in your browser: https://cloud.google.com/speech-to-text
Speech-to-Text documentation: https://cloud.google.com/speech-to-text/docs
Python on Google Cloud: https://cloud.google.com/python
Cloud Client Libraries for Python: https://github.com/googleapis/google-cloud-python

License

This work is licensed under a Creative Commons Attribution 2.0 Generic License.