The Google Cloud Speech streaming API enables developers to turn spoken language into text in real time. Using the API in combination with Javascript's Web Audio API and Websockets, a Java servlet can then accept streamed speech from a webpage and provide text transcripts of it, enabling any web page to use the spoken word as an additional user interface.

By the end of this codelab, you should have a web app that transcribes - in real time - audio spoken into a user's microphone.

Overview

This codelab is split into multiple steps, each of which introduces a component of the final web app, provides links to reference documentation and related sample code, and tasks you with implementing that component.

The webapp you'll create will take audio from the client's microphone and stream it to a Java servlet. The Java servlet simply passes the data to the Cloud Speech API, which will stream transcriptions of any speech it detects back to the servlet. The servlet then simply passes the transcription results to the client, which then displays it on the page.

To accomplish this, you'll need to create several components:

A possible sample solution is provided in each step, which can be used in several ways:

Prerequisites

This codelab assumes familiarity with:

Self-paced environment setup

If you don't already have a Google Account (Gmail or Google Apps), you must create one. Sign-in to Google Cloud Platform console (console.cloud.google.com) and create a new project:

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID.

Next, you'll need to enable billing in the Developers Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost you more than a few dollars, but it could be more if you decide to use more resources or if you leave them running (see "cleanup" section at the end of this document). Google Container Engine pricing is documented here.

New users of Google Cloud Platform are eligible for a $300 free trial.

The Cloud Speech API lets you do speech to text transcription from audio files in over 80 languages. There is a streaming endpoint, which allows you to stream audio samples from, say, a live microphone and receive text transcriptions in real time.

In order to use the API, you must first enable it in the Cloud Console.

Click on the menu icon in the top left of the screen.

Select API Manager from the drop down.

Click on Enable API.

Then, search for "speech" in the search box. Click on Google Cloud Speech API:

Click Enable to enable the Cloud Speech API:

Wait for a few moments for it to enable.

You may see this once it's enabled:

Google Compute Engine is a service that allows you to start virtual machines (VMs) on Google's infrastructure. We will use a virtual machine in this codelab to run the Java servlet that hosts the website and uses the Cloud Speech API to provide dynamic transcriptions to the client. For the purposes of this codelab, it can also be used to create and develop your code, if your preferred development machine isn't available.

To create a new virtual machine, click the menu icon on the upper left of the Cloud Console, and find and select Compute Engine:

The Virtual Machine instance we create will be hosting a servlet written in Java 8, so be sure to change the default Boot disk image to one that supports Java 8 - Ubuntu 16.04 LTS is a good choice. The below highlights some options you'll want to set:

Once you're done, click Create and, after a few moments, your virtual machine will be up and running! Connect to it by clicking the SSH button to its right:

This should open a new tab and connect you to your virtual machine, giving you a command prompt. The rest of this codelab will assume that you're using this virtual machine.

Relevant documentation

You can read more about Compute Engine and its different capabilities at https://cloud.google.com/compute/docs/

To facilitate this codelab, we provide a working sample for each of the following steps. You can pull down the provided solution to your virtual machine with this command:

git clone https://github.com/googlecodelabs/speaking-with-a-webpage.git

This will create the directory speaking-with-a-webpage, which contains subdirectories for each of the steps. Each subdirectory builds on the one before it, incrementally adding new functionality:

The sample solutions use Java 8 and the maven project management tool to compile and run its code. Install them on your virtual machine by running:

sudo apt-get update
sudo apt-get install -y maven openjdk-8-jdk

The sample solutions also do not use the normal HTTPS port - instead, they use the non-privileged port 8443, for development purposes. In order to access this port from your web browser, you must open it up in your VM's firewall. You can do this with the gcloud command:

gcloud compute firewall-rules create dev-ports \
    --allow=tcp:8443 \
    --source-ranges=0.0.0.0/0

The Java Servlet is the backbone that supports this webapp, as it serves the required client-side HTML, CSS, and Javascript code, and connects to the Cloud Speech API to provide transcriptions.

When accessing a user's microphone from a webpage, browsers require the webpage communicate over a secure channel to prevent eavesdropping. Because of this, we must set up our servlet to serve webpages over HTTPS. Since configuring and serving secure web pages is a topic in itself, we recommend simply using the self-signed certificate and Jetty configuration files in the provided sample solution, which is sufficient for a development environment.

For this step, simply read through and run the provided Maven project in 01-hello-https. Take particular note of the files within the src/ directory, as those are the primary files we will be modifying in subsequent steps:

Running the sample solution

The 01-hello-https subdirectory of the provided speaking-with-a-webpage repository contains a Maven servlet project configured for HTTPS, using the Jetty servlet framework to serve both static files and a dynamic endpoint. It uses the blog post above to generate a self-signed certificate using the keytool command, and adds Jetty configuration to support HTTPS. You can start it by running:

cd ~/speaking-with-a-webpage
cd 01-hello-https
mvn clean jetty:run

Then point your web browser to:

To stop the server, hit Control-C

When you first access the webapp using the HTTPS URL, your browser will likely warn you that the connection is not private. This is because the sample app uses a self-signed SSL certificate for development. In a production environment, you would need an SSL certificate signed by a Certificate Authority, but for the purposes of this codelab, a self-signed SSL certificate suffices. Just be sure not to speak of any secrets with your web page 😁

Relevant documentation

Relevant samples

The Web Audio API allows a webpage to capture audio data from a user's microphone, given their consent. The Cloud Speech API needs this raw data in a certain form, and needs to know the rate at which it's sampled. For this step, start with the code from 01-hello-https and modify it in the following way:

Relevant documentation

Relevant sample code

Sample solution

The 02-webaudio subdirectory of the provided speaking-with-a-webpage repository builds on the 01-hello-https step and the sample code above by using Web Audio's getUserMedia function to connect the user's microphone to a visualization of the audio. It then adds a ScriptProcessorNode to the audio pipeline to retrieve the raw audio bytes, in preparation for sending it to the server. Since the the Cloud Speech API will also eventually need the sampleRate, it retrieves that as well. You can start it by running:

cd ~/speaking-with-a-webpage
cd 02-webaudio
mvn clean jetty:run

To access your running webapp, look for the External IP address in your Cloud Console VM Instances page, and point your browser to:

A normal HTTP connection is not ideal for realtime streaming of audio to a server, while receiving transcriptions as they become available. For this step, you'll create a Web Socket connection from the client to the server, and use it to send along the audio metadata (ie the sample rate) and data to the server, while listening for a response (ie the transcript of the data).

In your Java servlet, define a subclass of WebSocketAdapter with handlers to receive both text and binary data. For now, simply send the request data back to the client, to verify that it works. Then modify TranscribeServlet.java to instead be a subclass of WebSocketServlet, and register the WebSocketAdapter you've defined, so it can respond to requests. Note that you'll also need to add the Jetty websocket library as a dependency in pom.xml.

In your Javascript client, create a secure websocket connection to your server. Once it's connected, first send the server the sampleRate of the audio you retrieved in the previous step, so the server can know how to interpret the audio you'll send it. Once the server has acknowledged receipt of it, use the ScriptProcessorNode you created in the previous step to send the raw audio bytes through the websocket as it receives them.

For completeness, you can also make sure to handle gracefully closing and reopening the Websocket when errors occur, or when the audio stream is paused.

Relevant documentation

Sample solution

The provided sample solution changes the TranscribeServlet to extend from WebSocketServlet in order to register a WebSocketAdapter. The WebSocketAdapter it defines simply takes the message it's received and sends it back to the client.

On the client, the sample replaces the scriptNode from the previous step with one that sends the data to a socket to be defined later. It then creates that secure Websocket connection to the server. Once both the server and the microphone have connected, it starts listening for messages from the server; then it sends the server the sample rate. When the server echos back the sample rate, the client replaces the listener with the more permanent transcription handler, and connects the scriptNode to begin streaming audio bytes to the server.

You can start it by running:

cd ~/speaking-with-a-webpage
cd 03-websockets
mvn clean jetty:run

To access your running webapp, look for the External IP address in your Cloud Console VM Instances page, and point your browser to:

The Google Cloud Speech streaming API allows you to send audio bytes to the API in real time and asynchronously receive transcriptions of any speech it detects. The API expects the bytes to be in a specific format, as determined by the configuration that is sent in the beginning of a request. For this webapp, we will be sending the API raw audio samples in the LINEAR16 format - that is, each sample is a 16-bit signed integer - sent at the sample rate obtained by the client.

Update your code from the previous step to:

Relevant documentation

Relevant sample code

Sample Solution

The 04-speech subdirectory of the provided speaking-with-a-webpage repository fills out the server code from the 03-websockets step. It incorporates the code from the StreamingRecognizeClient sample code above to connect with, pass along audio bytes to, and receive transcripts from the Cloud Speech API. When it asynchronously receives the transcripts, it uses its connection to the Javascript client to pass those along. The Javascript client simply outputs it to the web page.

You can start the sample solution by running:

cd ~/speaking-with-a-webpage
cd 04-speech
mvn clean jetty:run

To access your running webapp, look for the External IP address in your Cloud Console VM Instances page, and point your browser to:

By this point, you should have a functional webapp that is able to take speech from the microphone of a webpage, stream it to a Java servlet, and receive text transcriptions back. Depending on things like the quality of your microphone and ambient noise (especially in a conference environment), you may have found the webpage missing or misinterpreting some words you spoke. This could potentially be improved for your use case using hardware improvements such as directional mikes.

In the meantime, there are also software improvements you could add:

Cleanup

Once you're done with the codelab, make sure to delete the Compute Engine to avoid recurring charges. Navigate to the Compute Engine page in the Cloud Console, click the More Options vertical-ellipsis icon on the right and select ‘Delete':