TensorFlow.js: Make a smart webcam in JavaScript with a pre-trained Machine Learning model

Machine Learning is quite the buzzword these days touching almost every industry but how can you take your first steps as a web engineer or designer? If you are working in the web or creative industry, be it front end, or back end, and familiar with JavaScript, this is the code lab for you!

What you'll build

In this code lab you will create a webpage that uses machine learning directly in the web browser via TensorFlow.js to classify and detect common objects, (yes, including more than one at a time), from a live webcam stream in real time supercharging your regular webcam to have superpowers in the browser! Even better, not only will we know that the image contains an object, but we can also get the coordinates of the bounding box for each object it finds, which allows you to highlight the found object in the image as shown in the example below.

8f9bad6e49e646b.png

Imagine being able to detect if a person was in a video, so you could then count how many people were present at any given time to estimate how busy a certain area was over the day, or send yourself an alert when your dog was detected in a room of your house whilst you are away that maybe it should not be in. If you could do that you would be well on your way to making your own version of a Google Nest cam that could alert you when it sees an intruder (of any type) using your own custom hardware! Pretty neat. Is it hard to do? Nope. Let's get hacking...

What you'll learn

  • How to load a pre-trained TensorFlow.js model.
  • How to grab data from a live webcam stream and draw it to canvas.
  • How to classify an image frame to find the bounding box(s) of any object(s) the model has been trained to recognize
  • How to use the data passed back from the model to highlight found objects.

This codelab is focused on how to use and get started with TensorFlow.js pre-trained models. Non-relevant concepts such as CSS styling and other non ML code blocks are glossed over and are provided for you to simply copy and paste.

Share what you make with us

You can easily extend what we make today for other creative use cases too and we encourage you to think outside the box and keep hacking after you are finished.

Tag us on social media using the #MadeWithTFJS hashtag for a chance for your project to be featured on our TensorFlow blog or even future events.

P8bLRjyX8S3gw2q_TMTBEzMRAT8l6NH1AIOgyPNVDqCMwP2_xORbg3BqjmZgdvW0fYEElMVk-2D21rOlb4uXFRPZFYrrrLqxOGYoJvV5UHOxxDEbk9bIy-Zzw8VbtfST1JUOIB3Fyg

TensorFlow.js is an open source machine learning library that can run anywhere JavaScript can. It's based upon the original TensorFlow library written in Python and aims to re-create this developer experience and set of APIs for the JavaScript ecosystem.

Where can it be used?

Given the portability of JavaScript this means we can now write in 1 language and perform machine learning across all of the following platforms with ease:

  • Client side in the web browser using vanilla JavaScript
  • Server side and even IoT devices like Raspberry Pi using Node.js
  • Desktop apps using Electron
  • Native mobile apps using React Native

TensorFlow.js also supports multiple backends within each of these environments (the actual hardware based environments it can execute within such as the CPU or WebGL for example. A "backend" in this context does not mean a server side environment - the backend for execution could be client side in WebGL for example) to ensure compatibility and also keep things running fast. Currently TensorFlow.js supports:

  • WebGL execution on the device's graphics card (GPU) - this is the fastest way to execute larger models (over 3MB in size) with GPU acceleration.
  • Web Assembly (WASM) execution on CPU - to improve CPU performance across devices including older generation mobile phones for example. This is better suited to smaller models (less than 3MB in size) which can actually execute faster on CPU with WASM than with WebGL due to the overhead of uploading content to a graphics processor.
  • CPU execution - the fallback should none of the other environments be available. This is the slowest of the three but is always there for you.

Note: You can choose to force one of these backends if you know what device you will be executing on, or you can simply let TensorFlow.js decide for you if you do not specify this.

Client side super powers

Running TensorFlow.js in the web browser on the client machine can lead to several benefits that are worth considering.

Privacy

You can both train and classify data on the client machine without ever sending data to a 3rd party web server. There may be times where this may be a requirement to comply with local laws such as GDPR for example or when processing any data that the user may want to keep on their machine and not sent to a 3rd party.

Speed

As we are not having to send data to a remote server, inference (the act of classifying the data) can be faster. Even better, you have direct access to the device's sensors such as the camera, microphone, GPS, accelerometer and more should the user grant you access.

Reach and scale

With one click anyone in the world can click a link you send them, open the web page in their browser, and utilise what you have made. No need for a complex server side Linux setup with CUDA drivers and much more just to use the machine learning system.

Cost

No servers means the only thing you need to pay for is a CDN to host your HTML, CSS, JS, and model files. The cost of a CDN is much cheaper than keeping a server (potentially with a graphics card attached) running 24/7.

Server side features

Leveraging the Node.js implementation of TensorFlow.js enables the following features.

Full CUDA support

On the server side, for graphics card acceleration, one must install the NVIDIA CUDA drivers to enable TensorFlow to work with the graphics card (unlike in the browser which uses WebGL - no install needed). However with full CUDA support you can fully leverage the graphics card's lower level abilities which can lead to faster training and inference times. Performance is on parity with the Python TensorFlow implementation as they both share the same C++ backend.

Model Size

For cutting edge models from research you may be working with very large models, maybe gigabytes in size. These models can not currently be run in the web browser due to the limitations of memory usage per browser tab. To run these larger models you can use Node.js on your own server with the hardware specifications you require to run such a model efficiently.

IOT

Node.js is supported on popular single board computers like the Raspberry Pi, which in turn means you can execute TensorFlow.js models on such devices too.

Speed

Node.js is written in JavaScript which means that it benefits from just in time compilation. This means that you may often see performance gains when using Node.js as it will be optimized at runtime, especially for any preprocessing you may be doing. A great example of this can be seen in this case study which shows how Hugging Face used Node.js to get a 2x performance boost for their natural language processing model.

Now you understand the basics of TensorFlow.js, where it can run, and some of the benefits, let's start doing useful things with it!

The TensorFlow.js team have created a number of machine learning (ML) models that have already been trained by our team and wrapped in an easy to use class. These are often based on the latest research being published here at Google and beyond. This is a great way to learn how to be productive with TensorFlow.js whilst also taking your first steps with machine learning. We offer a growing list of pre-trained models that you can use in a similar manner to what you will learn today so this knowledge can be reused with those too should you wish to try them once finished with this code lab.

Why would I want to use a pre-trained model?

There are a number of benefits to starting with a popular pre-trained model if it fits your desired use case:

  1. No need to gather training data yourself - this can be very time consuming and costly to have data in the correct format and labelled so that a machine learning system can use it to learn from.
  2. Rapidly prototype an idea with lower cost and time - often a pre-trained model may be good enough to do what you need to do so there is no point reinventing the wheel allowing you to concentrate on using the knowledge provided by the model to then implement that creative idea you had.
  3. Use state of the art research - often these models are based on popular research so gives you exposure to such models and understand their performance in the real world.
  4. Typically easier to use and well documented due to the popularity of such models.
  5. Some pre-trained models offer transfer learning capabilities. This essentially is the practice of transferring information learnt from one machine learning task to another similar example. For example a model that was originally trained to recognize cats, could be retrained to recognize dogs, if you gave it new training data. This is often faster to train as you are not starting with a blank canvas. The model can use what it has already learnt to recognize cats to then recognize the new thing - dogs have eyes and ears too after all, so if it already knows how to find those features, we are halfway there. Re-train the model on your own data in a much faster way.

What is COCO-SSD?

COCO-SSD is the name of a pre-trained object detection ML model that we will be using today which aims to localize and identify multiple objects in a single image - or in other words, it can let you know the bounding box of objects it has been trained to find to give you the location of that object in any given image you present to it. Check the example image below to see this in action showing us where the dog is in the image.

760e5f87c335dd9e.png

If there were more than 1 dog in the image above, we would be given the coordinates of 2 bounding boxes describing the location of each. COCO-SSD has been pe-trained to recognize 90 common everyday objects such as a person, car, cat etc.

Where did the name come from?

The name may sound strange but this originates from 2 things:

  1. COCO: Refers to the fact it was trained on the COCO (Common Objects in Context) dataset which is freely available for anyone to download and use when training their own models. The dataset contains over 200,000 labeled images which can be used to learn from.
  2. SSD (Single Shot MultiBox Detection): Refers to part of the model architecture that was used in the model's implementation. You do not need to understand this for the code lab but if you are curious you can learn more about SSD here.

What you'll need

  • A modern web browser.
  • Basic knowledge of HTML, CSS, JavaScript, and Chrome DevTools (viewing the console output).

Let's get coding.

We have created a Glitch.com or Codepen.io boilerplate template to start from which you can simply clone as your base state for this code lab in just one click.

On Glitch simply click the "remix this" button to fork it and make a new set of files you can edit. On Codepen click "fork" in the lower bottom right of the screen.

This very simple skeleton provides us with the following files:

  1. HTML page (index.html)
  2. Stylesheet (style.css)
  3. File to write our JavaScript code (script.js)

For your convenience we have also added in the HTML file an import for the TensorFlow.js library which looks like this:

index.html

<!-- Import TensorFlow.js library -->
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js" type="text/javascript"></script>

Alternative: Use your preferred webeditor or work locally

If you want to download the code and work locally, or on a different online editor, simply create the 3 files named above in the same directory and copy and paste the code from our Glitch boilerplate into each of them.

What's our starting point?

All prototypes require some basic HTML scaffolding upon which we shall render our findings to. Let's set that up now.

We are going to add:

  • A title for the page
  • Some descriptive text
  • A button to enable the webcam
  • A video tag to render the webcam stream to

Let's go back to index.html and paste over the existing code with the following to set up the above features:

index.html

<!DOCTYPE html>
<html lang="en">
  <head>
    <title>Multiple object detection using pre trained model in TensorFlow.js</title>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
    <!-- Import the webpage's stylesheet -->
    <link rel="stylesheet" href="style.css">
  </head>  
  <body>
    <h1>Multiple object detection using pre trained model in TensorFlow.js</h1>

    <p>Wait for the model to load before clicking the button to enable the webcam - at which point it will become visible to use.</p>
    
    <section id="demos" class="invisible">

      <p>Hold some objects up close to your webcam to get a real-time classification! When ready click "enable webcam" below and accept access to the webcam when the browser asks (check the top left of your window)</p>
      
      <div id="liveView" class="camView">
        <button id="webcamButton">Enable Webcam</button>
        <video id="webcam" autoplay width="640" height="480"></video>
      </div>
    </section>

    <!-- Import TensorFlow.js library -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs/dist/tf.min.js" type="text/javascript"></script>
    <!-- Load the coco-ssd model to use to recognize things in images -->
    <script src="https://cdn.jsdelivr.net/npm/@tensorflow-models/coco-ssd"></script>
    
    <!-- Import the page's JavaScript to do some stuff -->
    <script src="script.js" defer></script>
  </body>
</html>

Breaking it down

Let's break some of the above HTML code down to highlight some key things we added.

As you can see we have simply added a <h1> tag and some <p> tags for the header and some information about how to use the page. Nothing special here.

We also have a section tag that represents our demo space:

index.html

    <section id="demos" class="invisible">

      <p>Hold some objects up close to your webcam to get a real-time classification! When ready click "enable webcam" below and accept access to the webcam when the browser asks (check the top left of your window)</p>
      
      <div id="liveView" class="webcam">
        <button id="webcamButton">Enable Webcam</button>
        <video id="webcam" autoplay width="640" height="480"></video>
      </div>
    </section>

We initially give this a section a class of "invisible" so we can visually illustrate to the user when the model is ready to be used with the webcam and it is safe to click the "enable webcam" button. We will style this shortly in our CSS.

Also in this section you can see we added a button which we will use to enable the webcam, and a video tag to which we shall stream our webcam input. We will set this up shortly in our JavaScript code.

If you preview the output right now it should look something like this:

b1bfb8c3de68845c.png

Element defaults

First, let's add styles for the HTML elements we just added to ensure they render correctly.

style.css

body {
  font-family: helvetica, arial, sans-serif;
  margin: 2em;
  color: #3D3D3D;
}

h1 {
  font-style: italic;
  color: #FF6F00;
}

video {
  display: block;
}

section {
  opacity: 1;
  transition: opacity 500ms ease-in-out;
}

Next, let's add some useful CSS classes to help us with various different states of our user interface such as when we want to hide the button, or make the demo area greyed out if the model is not ready yet.

style.css

.removed {
  display: none;
}

.invisible {
  opacity: 0.2;
}

.camView {
  position: relative;
  float: left;
  width: calc(100% - 20px);
  margin: 10px;
  cursor: pointer;
}

.camView p {
  position: absolute;
  padding: 5px;
  background-color: rgba(255, 111, 0, 0.85);
  color: #FFF;
  border: 1px dashed rgba(255, 255, 255, 0.7);
  z-index: 2;
  font-size: 12px;
}

.highlighter {
  background: rgba(0, 255, 0, 0.25);
  border: 1px dashed #fff;
  z-index: 1;
  position: absolute;
}

Great! That's all we need. If you successfully overwrote your styles with the 2 pieces of code above, your live preview should now look like this:

336899a78cf80fcb.png

Note how the demo area text and the button are all greyed out as the HTML by default has the class "invisible" applied. We will use JavaScript to remove this class once the model is ready to be used.

Referencing key DOM elements

First, let's ensure we can access key parts of the page we will need to manipulate or access later on in our code:

script.js

const video = document.getElementById('webcam');
const liveView = document.getElementById('liveView');
const demosSection = document.getElementById('demos');
const enableWebcamButton = document.getElementById('webcamButton');

Checking for webcam support

We can now add some assistive functions to check if the browser we are using supports accessing the webcam stream via getUserMedia:

script.js

// Check if webcam access is supported.
function getUserMediaSupported() {
  return !!(navigator.mediaDevices &&
    navigator.mediaDevices.getUserMedia);
}

// If webcam supported, add event listener to button for when user
// wants to activate it to call enableCam function which we will 
// define in the next step.
if (getUserMediaSupported()) {
  enableWebcamButton.addEventListener('click', enableCam);
} else {
  console.warn('getUserMedia() is not supported by your browser');
}

// Placeholder function for next step. Paste over this in the next step.
function enableCam(event) {
}

Fetching the webcam stream

Next we can fill out the code for the previously empty enableCam function we defined above.

script.js

// Enable the live webcam view and start classification.
function enableCam(event) {
  // Only continue if the COCO-SSD has finished loading.
  if (!model) {
    return;
  }
  
  // Hide the button once clicked.
  event.target.classList.add('removed');  
  
  // getUsermedia parameters to force video but not audio.
  const constraints = {
    video: true
  };

  // Activate the webcam stream.
  navigator.mediaDevices.getUserMedia(constraints).then(function(stream) {
    video.srcObject = stream;
    video.addEventListener('loadeddata', predictWebcam);
  });
}

Finally lets just add some temporary code so we can test the webcam is working. The code below will pretend our model is loaded and enable the camera button so we can click it. We will replace all of this this code in the next step so be prepared to delete it again in a moment:

script.js

// Placeholder function for next step.
function predictWebcam() {
}

// Pretend model has loaded so we can try out the webcam code.
var model = true;
demosSection.classList.remove('invisible');

Great! If you ran the code and clicked the button as it currently stands you should see something like this:

bbc80efbd7654c06.jpeg

Loading the model

We are now ready to load our COCO-SSD model. When it has finished initialising we enable the demo area and button on our web page (paste this code over the temporary code we added at the end of the last step).

script.js

// Store the resulting model in the global scope of our app.
var model = undefined;

// Before we can use COCO-SSD class we must wait for it to finish
// loading. Machine Learning models can be large and take a moment 
// to get everything needed to run.
// Note: cocoSsd is an external object loaded from our index.html
// script tag import so ignore any warning in Glitch.
cocoSsd.load().then(function (loadedModel) {
  model = loadedModel;
  // Show demo section now model is ready to use.
  demosSection.classList.remove('invisible');
});

Classifying a frame from the webcam

The below code allows us to continuously grab a frame from the webcam stream when the browser is ready and pass it to our model to be classified. We then parse the results and draw a <p> tag at the coordinates that come back and set it's text to the name of what we found if it is over a certain level of confidence.

script.js

var children = [];

function predictWebcam() {
  // Now let's start classifying a frame in the stream.
  model.detect(video).then(function (predictions) {
    // Remove any highlighting we did previous frame.
    for (let i = 0; i < children.length; i++) {
      liveView.removeChild(children[i]);
    }
    children.splice(0);
    
    // Now lets loop through predictions and draw them to the live view if
    // they have a high confidence score.
    for (let n = 0; n < predictions.length; n++) {
      // If we are over 66% sure we are sure we classified it right, draw it!
      if (predictions[n].score > 0.66) {
        const p = document.createElement('p');
        p.innerText = predictions[n].class  + ' - with ' 
            + Math.round(parseFloat(predictions[n].score) * 100) 
            + '% confidence.';
        p.style = 'margin-left: ' + predictions[n].bbox[0] + 'px; margin-top: '
            + (predictions[n].bbox[1] - 10) + 'px; width: ' 
            + (predictions[n].bbox[2] - 10) + 'px; top: 0; left: 0;';

        const highlighter = document.createElement('div');
        highlighter.setAttribute('class', 'highlighter');
        highlighter.style = 'left: ' + predictions[n].bbox[0] + 'px; top: '
            + predictions[n].bbox[1] + 'px; width: ' 
            + predictions[n].bbox[2] + 'px; height: '
            + predictions[n].bbox[3] + 'px;';

        liveView.appendChild(highlighter);
        liveView.appendChild(p);
        children.push(highlighter);
        children.push(p);
      }
    }
    
    // Call this function again to keep predicting when the browser is ready.
    window.requestAnimationFrame(predictWebcam);
  });
}

Now the really important call in this new code we added is model.detect(). All of our pre-made models for TensorFlow.js have a function like this (the name of which may change from model to model so check the docs for details) that actually performs the machine learning inference.

Inference is simply the act of taking some input and running it through the machine learning model (essentially a lot of mathematical operations), and then providing some results. With our pre-made models we return our predictions in the form of a JSON object so it is easy to use.

You can find full details of this predict function in our GitHub documentation for the COCO-SSD model here. One thing to note is that this function actually does a lot of heavy lifting for you behind the scenes - it can accept any "image like" object as its parameter - such as an image, a video, a canvas and so on. This is just one of the reasons using our pre-made models can save you a lot of time and effort as you do not need to write this code yourself and things just work out of the box. In other codelabs you can learn how to go lower level to gain more control and pass data directly to a model yourself, but that is outside the scope of this codelab.

Running this code should now give you an image that looks something like this:

8f9bad6e49e646b.png

Finally here is an example of the code running detecting multiple objects at the same time:

a2c73a72cf976b22.jpeg

Woohoo! You could imagine now how simple it would be to take something like this to create a device like a Nest Cam using an old phone to alert you when it sees a person enter a room, or your dog at the back door. Now it is over to you to take these humble beginnings and turn it into something creative. What will you make?

Congratulations, you have taken your first steps in using TensorFlow.js and machine learning in the web browser!

Recap

In this code lab we:

  1. Learnt the benefits of using TensorFlow.js over other forms of TensorFlow.
  2. Learnt the situations in which you may want to start with a pre-trained machine learning model.
  3. Created a fully working web page that can classify objects in real time using your webcam.

This third step also included:

  1. Creating a HTML skeleton for content
  2. Defining styles for our HTML elements and classes
  3. Setting up our JavaScript scaffolding to interact with the HTML and detect the presence of a webcam
  4. Loading a pre-trained TensorFlow.js model
  5. Using the loaded model to make continuous classifications of the webcam stream and drawing a bounding box around the items we care about.

What's next?

Now that you have a working base to start from, what creative ideas can you come up with to extend this machine learning model boilerplate?

Check out all the objects this model can recognize and think about how you could use that knowledge to then perform an action.

Maybe you could add a simple server side layer to deliver a notification to another device when it sees a certain object of your choice using websockets. This would be a great way to upcycle an old smartphone and give it a new purpose. The possibilities are unlimited - Happy hacking!

Remember to tag us in anything you create using #MadeWithTFJS for a chance to be featured on social media or even showcased at future TensorFlow events! We would love to see what you make.

More TensorFlow.js codelabs to go deeper

Websites to check out