In this codelab, you will build an audio recognition network and use it to control a slider in the browser by making sounds. You will be using TensorFlow.js, a powerful and flexible machine learning library for Javascript.

First, you will load and run a pre-trained model that can recognize 20 speech commands. Then using your microphone, you will build and train a simple neural network that recognizes your sounds and makes the slider go left or right.

This codelab will not go over the theory behind audio recognition models. If you are curious about that, check out this tutorial.

We have also created a glossary of machine learning terms that you find in this codelab.

What you'll learn

So let's get started.

To complete this codelab, you will need:

If you are using a computer without a code editor, you can use the free Google Cloud Console.

  1. Open https://console.cloud.google.com/
  2. Click the "Activate Cloud Shell" button
  3. Click "Launch code editor"
  4. Create a file index.html, a root HTML file for the main web page.
  5. Create a file index.js, where our JS source code to train and run the TensorFlow.js model will live
  6. In the console, start a simple HTTP server to serve the files: python -m SimpleHTTPServer 8080
  7. Click "Web Preview" to open up the development server which will serve your application.

Open index.html in an editor and add this content:

<html>
  <head>
    <script src="https://unpkg.com/@tensorflow/tfjs"></script>
    <script src="https://unpkg.com/@tensorflow-models/speech-commands"></script>
  </head>
  <body>
    <div id="console"></div>
    <script src="index.js"></script>
  </body>
</html>

The first <script> tag imports the TensorFlow.js library, and the second <script> imports the pre-trained Speech Commands model. The <div id="console"> tag will be used to display the output of the model.

Next, open/create the file index.js in a code editor, and include the following code:

let recognizer;

function predictWord() {
 // Array of words that the recognizer is trained to recognize.
 const words = recognizer.wordLabels();
 recognizer.listen(({scores}) => {
   // Turn scores into a list of (score,word) pairs.
   scores = Array.from(scores).map((s, i) => ({score: s, word: words[i]}));
   // Find the most probable word.
   scores.sort((s1, s2) => s2.score - s1.score);
   document.querySelector('#console').textContent = scores[0].word;
 }, {probabilityThreshold: 0.75});
}

async function app() {
 recognizer = speechCommands.create('BROWSER_FFT');
 await recognizer.ensureModelLoaded();
 predictWord();
}

app();

Make sure your device has a microphone. It's worth noting that this will also work on a mobile phone! To run the webpage, simply open index.html in a browser. If you are using the cloud console, simply refresh the preview page.

It may take a bit of time to download the model, so please be patient. As soon as the model loads, you should see a word at the top of the page. The model was trained to recognize the numbers 0 through 9 and a few additional commands such as "left", "right", "yes", "no", etc.

Speak one of those words. Does it get your word correctly? Play with the probabilityThreshold which controls how often the model fires -- 0.75 means that the model will fire when it is more than 75% confident that it hears a given word.

To learn more about the Speech Commands model and its API, see the README.md on Github.

To make it fun, let's use short sounds instead of whole words to control the slider!

You are going to train a model to recognize 3 different commands: "Left", "Right" and "Noise" which will make the slider move left or right. Recognizing "Noise" (no action needed) is critical in speech detection since we want the slider to react only when we produce the right sound, and not when we are generally speaking and moving around.

1. First we need to collect data. Add a simple UI to the app by adding this inside the <body> tag:

<button id="left" onmousedown="collect(0)" onmouseup="collect(null)">Left</button>
<button id="right" onmousedown="collect(1)" onmouseup="collect(null)">Right</button>
<button id="noise" onmousedown="collect(2)" onmouseup="collect(null)">Noise</button>
<!-- The bottom 2 lines should be there already -->
<div id="console"></div>
<script src="index.js"></script>

2. Add this to index.js:

// One frame is ~23ms of audio.
const NUM_FRAMES = 3;
let examples = [];

function collect(label) {
 if (label == null) {
   return recognizer.stopListening();
 }
 recognizer.listen(async ({spectrogram: {frameSize, data}}) => {
   let vals = normalize(data.subarray(-frameSize * NUM_FRAMES));
   examples.push({vals, label});
   document.querySelector('#console').textContent =
       `${examples.length} examples collected`;
 }, {
   overlapFactor: 0.999,
   includeSpectrogram: true,
   invokeCallbackOnNoiseAndUnknown: true
 });
}

function normalize(x) {
 const mean = -100;
 const std = 10;
 return x.map(x => (x - mean) / std);
}

3. Remove predictWord() from app():

async function app() {
 recognizer = speechCommands.create('BROWSER_FFT');
 await recognizer.ensureModelLoaded();
 // predictWord() no longer called.
}

Breaking it down

This code might be overwhelming at first, so let's break it down.

We've added three buttons to our UI labeled "Left", "Right", and "Noise", corresponding to the three commands we want our model to recognize. Pressing these buttons calls our newly added collect() function, which creates training examples for our model.

collect() associates a label with the output of recognizer.listen(). Since includeSpectrogram is true, recognizer.listen() gives the raw spectrogram (frequency data) for 1 sec of audio, divided into 43 frames, so each frame is ~23ms of audio:

recognizer.listen(async ({spectrogram: {frameSize, data}}) => {
...
}, {includeSpectrogram: true});

Since we want to use short sounds instead of words to control the slider, we are taking into consideration only the last 3 frames (~70ms):

let vals = normalize(data.subarray(-frameSize * NUM_FRAMES));

And to avoid numerical issues, we normalize the data to have an average of 0 and a standard deviation of 1. In this case, the spectrogram values are usually large negative numbers around -100 and deviation of 10:

const mean = -100;
const std = 10;
return x.map(x => (x - mean) / std);

Finally, each training example will have 2 fields:

and we store all data in the examples variable:

examples.push({vals, label});

Open index.html in a browser, and you should see 3 buttons corresponding to the 3 commands. If you are working from a local file, to access the microphone you will have to start a webserver and use http://localhost:port/.

To start a simple webserver on port 8000:

python -m SimpleHTTPServer

To collect examples for each command, make a consistent sound repeatedly (or continuously) while pressing and holding each button for 3-4 seconds. You should collect ~150 examples for each label. For example, we can snap fingers for "Left", whistle for "Right", and alternate between silence and talk for "Noise".

As you collect more examples, the counter shown on the page should go up. Feel free to also inspect the data by logging the examples variable in the console. At this point the goal is to test the data collection process. Later you will re-collect data when you are testing the whole app.

1. Add a "Train" button right after the "Noise" button in the body in index.html:

<br/><br/>
<button id="train" onclick="train()">Train</button>

2. Add the following to the existing code in index.js:

const INPUT_SHAPE = [NUM_FRAMES, 232, 1];
let model;

async function train() {
 toggleButtons(false);
 const ys = tf.oneHot(examples.map(e => e.label), 3);
 const xsShape = [examples.length, ...INPUT_SHAPE];
 const xs = tf.tensor(flatten(examples.map(e => e.vals)), xsShape);

 await model.fit(xs, ys, {
   batchSize: 16,
   epochs: 10,
   callbacks: {
     onEpochEnd: (epoch, logs) => {
       document.querySelector('#console').textContent =
           `Accuracy: ${(logs.acc * 100).toFixed(1)}% Epoch: ${epoch + 1}`;
     }
   }
 });
 tf.dispose([xs, ys]);
 toggleButtons(true);
}

function buildModel() {
 model = tf.sequential();
 model.add(tf.layers.depthwiseConv2d({
   depthMultiplier: 8,
   kernelSize: [NUM_FRAMES, 3],
   activation: 'relu',
   inputShape: INPUT_SHAPE
 }));
 model.add(tf.layers.maxPooling2d({poolSize: [1, 2], strides: [2, 2]}));
 model.add(tf.layers.flatten());
 model.add(tf.layers.dense({units: 3, activation: 'softmax'}));
 const optimizer = tf.train.adam(0.01);
 model.compile({
   optimizer,
   loss: 'categoricalCrossentropy',
   metrics: ['accuracy']
 });
}

function toggleButtons(enable) {
 document.querySelectorAll('button').forEach(b => b.disabled = !enable);
}

function flatten(tensors) {
 const size = tensors[0].length;
 const result = new Float32Array(tensors.length * size);
 tensors.forEach((arr, i) => result.set(arr, i * size));
 return result;
}

3. Call buildModel() when the app loads:

async function app() {
 recognizer = speechCommands.create('BROWSER_FFT');
 await recognizer.ensureModelLoaded();
 // Add this line.
 buildModel();
}

At this point if you refresh the app you'll see a new "Train" button. You can test training by re-collecting data and clicking "Train", or you can wait until step 10 to test training along with prediction.

Breaking it down

At a high level we are doing two things: buildModel() defines the model architecture and train() trains the model using the collected data.

Model architecture

The model has 4 layers: a convolutional layer that processes the audio data (represented as a spectrogram), a max pool layer, a flatten layer, and a dense layer that maps to the 3 actions:

model = tf.sequential();
 model.add(tf.layers.depthwiseConv2d({
   depthMultiplier: 8,
   kernelSize: [NUM_FRAMES, 3],
   activation: 'relu',
   inputShape: INPUT_SHAPE
 }));
 model.add(tf.layers.maxPooling2d({poolSize: [1, 2], strides: [2, 2]}));
 model.add(tf.layers.flatten());
 model.add(tf.layers.dense({units: 3, activation: 'softmax'}));

The input shape of the model is [NUM_FRAMES, 232, 1] where each frame is 23ms of audio containing 232 numbers that correspond to different frequencies (232 was chosen because it is the amount of frequency buckets needed to capture the human voice). In this codelab, we are using samples that are 3 frames long (~70ms samples) since we are making sounds instead of speaking whole words to control the slider.

We compile our model to get it ready for training:

const optimizer = tf.train.adam(0.01);
 model.compile({
   optimizer,
   loss: 'categoricalCrossentropy',
   metrics: ['accuracy']
 });

We use the Adam optimizer and "categoricalCrossEntropy" for loss which is the standard loss function used for classification. In short, it measures how far the predicted probabilities (one probability per class) are from having 100% probability in the true class, and 0% probability for all the other classes. We also provide accuracy as a metric to monitor, which will give us the percentage of examples the model gets correct after each epoch of training.

Training

The training goes 10 times (epochs) over the data using a batch size of 16 (processing 16 examples at a time) and shows the current accuracy in the UI:

await model.fit(xs, ys, {
   batchSize: 16,
   epochs: 10,
   callbacks: {
     onEpochEnd: (epoch, logs) => {
       document.querySelector('#console').textContent =
           `Accuracy: ${(logs.acc * 100).toFixed(1)}% Epoch: ${epoch + 1}`;
     }
   }
 });

Now that we can train our model, let's add code to make predictions in real-time and move the slider. Add this right after the "Train" button in index.html:

<br/><br/>
<button id="listen" onclick="listen()">Listen</button>
<input type="range" id="output" min="0" max="10" step="0.1">

And the following in index.js:

async function moveSlider(labelTensor) {
 const label = (await labelTensor.data())[0];
 document.getElementById('console').textContent = label;
 if (label == 2) {
   return;
 }
 let delta = 0.1;
 const prevValue = +document.getElementById('output').value;
 document.getElementById('output').value =
     prevValue + (label === 0 ? -delta : delta);
}

function listen() {
 if (recognizer.isListening()) {
   recognizer.stopListening();
   toggleButtons(true);
   document.getElementById('listen').textContent = 'Listen';
   return;
 }
 toggleButtons(false);
 document.getElementById('listen').textContent = 'Stop';
 document.getElementById('listen').disabled = false;

 recognizer.listen(async ({spectrogram: {frameSize, data}}) => {
   const vals = normalize(data.subarray(-frameSize * NUM_FRAMES));
   const input = tf.tensor(vals, [1, ...INPUT_SHAPE]);
   const probs = model.predict(input);
   const predLabel = probs.argMax(1);
   await moveSlider(predLabel);
   tf.dispose([input, probs, predLabel]);
 }, {
   overlapFactor: 0.999,
   includeSpectrogram: true,
   invokeCallbackOnNoiseAndUnknown: true
 });
}

Breaking it down

Real-time prediction

listen() listens to the microphone and makes real time predictions. The code is very similar to the collect() method, which normalizes the raw spectrogram and drops all but the last NUM_FRAMES frames. The only difference is that we also call the trained model to get a prediction:

const probs = model.predict(input);
const predLabel = probs.argMax(1);
await moveSlider(predLabel);

The output of model.predict(input)is a Tensor of shape [1, numClasses] representing a probability distribution over the number of classes. More simply, this is just a set of confidences for each of the possible output classes which sum to 1. The Tensor has an outer dimension of 1 because that is the size of the batch (a single example).

To convert the probability distribution to a single integer representing the most likely class, we call probs.argMax(1)which returns the class index with the highest probability. We pass a "1" as the axis parameter because we want to compute the argMax over the last dimension, numClasses.

Updating the slider

moveSlider() decreases the value of the slider if the label is 0 ("Left") , increases it if the label is 1 ("Right") and ignores if the label is 2 ("Noise").

Disposing tensors

To clean up GPU memory it's important for us to manually call tf.dispose() on output Tensors. The alternative to manual tf.dispose() is wrapping function calls in a tf.tidy(), but this cannot be used with async functions.

   tf.dispose([input, probs, predLabel]);

Open index.html in your browser and collect data as you did in the previous section with the 3 buttons corresponding to the 3 commands. Remember to press and hold each button for 3-4 seconds while collecting data.

Once you've collected examples, press the "Train" button. This will start training the model and you should see the accuracy of the model go above 90%. If you don't achieve good model performance, try collecting more data.

Once the training is done, press the "Listen" button to make predictions from the microphone and control the slider!