How to Run Transformers.js on Cloud Run GPUs

1. Introduction

Overview

Cloud Run recently added GPU support. It's available as a waitlisted public preview. If you're interested in trying out the feature, fill out this form to join the waitlist. Cloud Run is a container platform on Google Cloud that makes it straightforward to run your code in a container, without requiring you to manage a cluster.

Today, the GPUs we make available are Nvidia L4 GPUs with 24 GB of vRAM. There's one GPU per Cloud Run instance, and Cloud Run auto scaling still applies. That includes scaling out up to 5 instances (with quota increase available), as well as scaling down to zero instances when there are no requests.

Transformers.js is designed to be functionally equivalent to Hugging Face's transformers python library, meaning you can run the same pretrained models using a very similar API. You can learn more on the Transformers.js website.

In this codelab, you will create and deploy an app to Cloud Run that uses Transformers.js and GPUs.

What you'll learn

  • How to run an app using Transformers.js on Cloud Run using GPUs

2. Enable APIs and Set Environment Variables

Before you can start using this codelab, there are several APIs you will need to enable. This codelab requires using the following APIs. You can enable those APIs by running the following command:

gcloud services enable run.googleapis.com \
    storage.googleapis.com \
    cloudbuild.googleapis.com \

Then you can set environment variables that will be used throughout this codelab.

PROJECT_ID=<YOUR_PROJECT_ID>

AR_REPO_NAME=repo
REGION=us-central1

3. Create the Transformers.js app

First, create a directory for the source code and cd into that directory.

mkdir transformers-js-codelab && cd $_

Create a package.json file.

{
    "name": "huggingface",
    "version": "1.0.0",
    "main": "index.js",
    "type": "module",
    "scripts": {
      "test": "echo \"Error: no test specified\" && exit 1"
    },
    "keywords": [],
    "author": "",
    "license": "ISC",
    "description": "",
    "dependencies": {
      "@huggingface/transformers": "^3.0.0-alpha.8",
      "@xenova/transformers": "^2.17.2",
      "express": "^4.17.1"
    }
  }

Create a file called index.js

import { pipeline } from "@xenova/transformers";

import express from 'express';

// make sure the text-generation pipeline is created first
// before anyone can access the routes
const generator = await pipeline('text-generation', 'Xenova/llama2.c-stories15M', {
    device: 'cuda',
    dtype: 'fp32',
});

// now create the app and routes
const app = express();

app.get('/', async (req, res) => {
  const text = 'A long time ago in a galaxy far far away,';
  const output = await generator(text, { max_new_tokens: 50 });
  res.send(output);
});

const port = parseInt(process.env.PORT) || 8080;
app.listen(port, () => {
  console.log(`transformers-js app: listening on port ${port}`);
});

Create a Dockerfile. The dockerfile will install additional NVIDIA drivers needed for Transformers.js

FROM node:20
WORKDIR /usr/src/app

RUN apt-get update && \
 apt-get install software-properties-common -y && \
 wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb && \
 dpkg -i cuda-keyring_1.1-1_all.deb && \
 add-apt-repository contrib && \
 apt-get update && \
 apt-get -y install cuda-toolkit-12-6 && \
 apt-get -y install cudnn-cuda-12

EXPOSE 8080
COPY package.json .

RUN npm install

COPY index.js .
ENTRYPOINT ["node", "index.js"]

4. Build and deploy the Cloud Run service

Create a repository in Artifact Registry.

gcloud artifacts repositories create $AR_REPO_NAME \
  --repository-format docker \
  --location us-central1

Submit your code to Cloud Build.

IMAGE=us-central1-docker.pkg.dev/$PROJECT_ID/$AR_REPO_NAME/gpu-transformers-js
gcloud builds submit --tag $IMAGE

Next, deploy to Cloud Run

gcloud beta run deploy transformers-js-codelab \
 --image=$IMAGE \
 --cpu 8 --memory 32Gi \
 --gpu=1 --no-cpu-throttling --gpu-type nvidia-l4 \
 --allow-unauthenticated \
 --region us-central1 \
 --project=$PROJECT_ID \
 --max-instances 1

5. Test the service

You can test the service by running the following:

SERVICE_URL=$(gcloud run services describe transformers-js-codelab --region $REGION --format 'value(status.url)')

curl $SERVICE_URL

and you will see something similar to the following:

[{"generated_text":"A long time ago in a galaxy far far away, there was a beautiful garden. Every day, the little girl would go to the garden and look at the flowers. She loved the garden so much that she would come back every day to visit it.\nOne day, the little girl was walking through"}]

6. Congratulations!

Congratulations for completing the codelab!

We recommend reviewing the documentation on Cloud Run GPUs.

What we've covered

  • How to run an app using Transformers.js on Cloud Run using GPUs

7. Clean up

To avoid inadvertent charges, (for example, if the Cloud Run services are inadvertently invoked more times than your monthly Cloud Run invokement allocation in the free tier), you can either delete the Cloud Run or delete the project you created in Step 2.

To delete the Cloud Run service, go to the Cloud Run Cloud Console at https://console.cloud.google.com/run and delete the transformers-js-codelab service.

If you choose to delete the entire project, you can go to https://console.cloud.google.com/cloud-resource-manager, select the project you created in Step 2, and choose Delete. If you delete the project, you'll need to change projects in your Cloud SDK. You can view the list of all available projects by running gcloud projects list.