Getting started with Cloud Run jobs

1. Introduction

96d07289bb51daa7.png

Overview

Although Cloud Run services are a good fit for containers that run indefinitely listening for HTTP requests, Cloud Run jobs may be a better fit for containers that run to completion and don't serve requests. For example, processing records from a database, processing a list of files from a Cloud Storage bucket, or a long-running operation, such as calculating Pi, would work well if implemented as a Cloud Run job.

Jobs don't have the ability to serve requests or listen on a port. This means that unlike Cloud Run services, jobs should not bundle a web server. Instead, jobs containers should exit when they are done.

In Cloud Run jobs, you can run multiple copies of your container in parallel by specifying a number of tasks. Each task represents one running copy of the container. Using multiple tasks is useful if each task can independently process a subset of your data. For example, processing 10,000 records from Cloud SQL or 10,000 files from Cloud Storage could be done faster with 10 tasks processing 1000 records or files, each in parallel.

Jobs Workflow

It's simple to use Cloud Run jobs, with only two steps involved:

  1. Create a job. This encapsulates all the configuration needed to run the job, such as the container image, region, environment variables.
  2. Run the job. This creates a new execution of the job. Optionally, set up your job to run on a schedule using Cloud Scheduler.

Preview Limitations

During preview, Cloud Run jobs have the following constraints:

  • A maximum of 50 executions (from the same or different jobs) can run concurrently per project per region.
  • You can view your existing jobs, start executions, and monitor execution status from Cloud Console's Cloud Run Jobs page. Use gcloud to create new jobs as Cloud Console doesn't currently support creating new jobs.
  • Don't use Cloud Run jobs for production workloads. There is no reliability or performance guarantee. Cloud Run jobs may change in backward-incompatible ways with little notice before GA.

In this codelab, you first explore a Node.js application to take screenshots of web pages and store them to Cloud Storage. You then build a container image for the application, run it as a job on Cloud Run, update the job to process more web pages, and run the job on a schedule with Cloud Scheduler.

What you'll learn

  • How to use an app to take screenshots of web pages.
  • How to build a container image for the application.
  • How to create a Cloud Run job for the application.
  • How to run the application as a Cloud Run job.
  • How to update the job.
  • How to schedule the job with Cloud Scheduler.

2. Setup and Requirements

Self-paced environment setup

  1. Sign-in to the Google Cloud Console and create a new project or reuse an existing one. If you don't already have a Gmail or Google Workspace account, you must create one.

b35bf95b8bf3d5d8.png

a99b7ace416376c4.png

bd84a6d3004737c5.png

  • The Project name is the display name for this project's participants. It is a character string not used by Google APIs, and you can update it at any time.
  • The Project ID must be unique across all Google Cloud projects and is immutable (cannot be changed after it has been set). The Cloud Console auto-generates a unique string; usually you don't care what it is. In most codelabs, you'll need to reference the Project ID (and it is typically identified as PROJECT_ID), so if you don't like it, generate another random one, or, you can try your own and see if it's available. Then it's "frozen" after the project is created.
  • There is a third value, a Project Number which some APIs use. Learn more about all three of these values in the documentation.
  1. Next, you'll need to enable billing in the Cloud Console in order to use Cloud resources/APIs. Running through this codelab shouldn't cost much, if anything at all. To shut down resources so you don't incur billing beyond this tutorial, follow any "clean-up" instructions found at the end of the codelab. New users of Google Cloud are eligible for the $300 USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell, a command line environment running in the Cloud.

From the Google Cloud Console, click the Cloud Shell icon on the top right toolbar:

55efc1aaa7a4d3ad.png

It should only take a few moments to provision and connect to the environment. When it is finished, you should see something like this:

7ffe5cbb04455448.png

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory, and runs on Google Cloud, greatly enhancing network performance and authentication. All of your work in this lab can be done with simply a browser.

Set up gcloud

In Cloud Shell, set your project ID and the region you want to deploy the Cloud Run job to. Save them as PROJECT_ID and REGION variables. In the future, you will be able to pick a region from one of the Cloud Run locations. For preview, only europe-west9 is supported.

PROJECT_ID=[YOUR-PROJECT-ID]
REGION=europe-west9
gcloud config set core/project $PROJECT_ID
gcloud config set run/region $REGION

Enable APIs

Enable all necessary services:

gcloud services enable \
  artifactregistry.googleapis.com \
  cloudbuild.googleapis.com \
  run.googleapis.com

3. Get the code

You first explore a Node.js application to take screenshots of web pages and store them to Cloud Storage. Later, you build a container image for the application and run it as a job on Cloud Run.

From the Cloud Shell, run the following command to clone the application code from this repo:

git clone https://github.com/GoogleCloudPlatform/jobs-demos.git

Go to the directory containing the application:

cd jobs-demos/screenshot

You should see this file layout:

screenshot
 |
 ├── Dockerfile
 ├── README.md
 ├── screenshot.js
 ├── package.json

Here's a brief description of each file:

  • screenshot.js contains the Node.js code for the application.
  • package.json defines the library dependencies.
  • Dockerfile defines the container image.

4. Explore the code

To explore the code, use the built-in text editor by clicking the Open Editor button at the top of the Cloud Shell window.

f78880c00c0af1ef.png

Here's a brief explanation of each file.

screenshot.js

screenshot.js first adds Puppeteer and Cloud Storage as dependencies. Puppeteer is a Node.js library you use to take screenshots of web pages:

const puppeteer = require('puppeteer');
const {Storage} = require('@google-cloud/storage');

There is an initBrowser function to initialize Puppeteer and takeScreenshot function to take screenshots of a given URL:

async function initBrowser() {
  console.log('Initializing browser');
  return await puppeteer.launch();
}

async function takeScreenshot(browser, url) {
  const page = await browser.newPage();

  console.log(`Navigating to ${url}`);
  await page.goto(url);

  console.log(`Taking a screenshot of ${url}`);
  return await page.screenshot({
    fullPage: true
  });
}

Next, there is a function to get or create a Cloud Storage bucket and another one to upload the screenshot of a webpage to a bucket:

async function createStorageBucketIfMissing(storage, bucketName) {
  console.log(`Checking for Cloud Storage bucket '${bucketName}' and creating if not found`);
  const bucket = storage.bucket(bucketName);
  const [exists] = await bucket.exists();
  if (exists) {
    // Bucket exists, nothing to do here
    return bucket;
  }

  // Create bucket
  const [createdBucket] = await storage.createBucket(bucketName);
  console.log(`Created Cloud Storage bucket '${createdBucket.name}'`);
  return createdBucket;
}

async function uploadImage(bucket, taskIndex, imageBuffer) {
  // Create filename using the current time and task index
  const date = new Date();
  date.setMinutes(date.getMinutes() - date.getTimezoneOffset());
  const filename = `${date.toISOString()}-task${taskIndex}.png`;

  console.log(`Uploading screenshot as '${filename}'`)
  await bucket.file(filename).save(imageBuffer);
}

Finally, the main function is the entry point:

async function main(urls) {
  console.log(`Passed in urls: ${urls}`);

  const taskIndex = process.env.CLOUD_RUN_TASK_INDEX || 0;
  const url = urls[taskIndex];
  if (!url) {
    throw new Error(`No url found for task ${taskIndex}. Ensure at least ${parseInt(taskIndex, 10) + 1} url(s) have been specified as command args.`);
  }
  const bucketName = process.env.BUCKET_NAME;
  if (!bucketName) {
    throw new Error('No bucket name specified. Set the BUCKET_NAME env var to specify which Cloud Storage bucket the screenshot will be uploaded to.');
  }

  const browser = await initBrowser();
  const imageBuffer = await takeScreenshot(browser, url).catch(async err => {
    // Make sure to close the browser if we hit an error.
    await browser.close();
    throw err;
  });
  await browser.close();

  console.log('Initializing Cloud Storage client')
  const storage = new Storage();
  const bucket = await createStorageBucketIfMissing(storage, bucketName);
  await uploadImage(bucket, taskIndex, imageBuffer);

  console.log('Upload complete!');
}

main(process.argv.slice(2)).catch(err => {
  console.error(JSON.stringify({severity: 'ERROR', message: err.message}));
  process.exit(1);
});

Notice the following about the main method:

  • URLs are passed as arguments.
  • Bucket name is passed in as the user-defined BUCKET_NAME environment variable. The bucket name must be globally unique across all Google Cloud.
  • A CLOUD_RUN_TASK_INDEX environment variable is passed by Cloud Run jobs. Cloud Run jobs can run multiple copies of the application as unique tasks. CLOUD_RUN_TASK_INDEX represents the index of the running task. It defaults to zero when the code is run outside of Cloud Run jobs. When the application is run as multiple tasks, each task/container picks up the URL it's responsible for, takes a screenshot, and saves the image to the bucket.

package.json

The package.json file defines the application and specifies the dependencies for Cloud Storage and Puppeteer:

{
  "name": "screenshot",
  "version": "1.0.0",
  "description": "Create a job to capture screenshots",
  "main": "screenshot.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "Google LLC",
  "license": "Apache-2.0",
  "dependencies": {
    "@google-cloud/storage": "^5.18.2",
    "puppeteer": "^13.5.1"
  }
}

Dockerfile

The Dockerfile defines the container image for the application with all the required libraries and dependencies:

FROM node:17-alpine

# Installs latest Chromium (92) package.
RUN apk add --no-cache \
      chromium \
      nss \
      freetype \
      harfbuzz \
      ca-certificates \
      ttf-freefont \
      nodejs \
      npm

# Tell Puppeteer to skip installing Chrome. We'll be using the installed package.
ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \
    PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser

# Add user so we don't need --no-sandbox.
RUN addgroup -S pptruser && adduser -S -g pptruser pptruser \
    && mkdir -p /home/pptruser/Downloads /app \
    && chown -R pptruser:pptruser /home/pptruser \
    && chown -R pptruser:pptruser /app

# Install dependencies
COPY package*.json ./
RUN npm install

# Copy all files
COPY . .

# Run everything after as a non-privileged user.
USER pptruser

ENTRYPOINT ["node", "screenshot.js"]

5. Build and publish the container image

Artifact Registry is the container image storage and management service on Google Cloud. See Working with container images for more information. Artifact Registry can store Docker and OCI container images in a Docker repository.

Create a new Artifact Registry repository called containers:

gcloud artifacts repositories create containers --repository-format=docker --location=$REGION

Build and publish the container image:

gcloud builds submit -t $REGION-docker.pkg.dev/$PROJECT_ID/containers/screenshot:v1

After a couple of minutes, you should see the container image built and hosted on Artifact Registry.

62e50ebe805f9a9c.png

6. Create a job

Before creating a job, you need to create a service account that you will use to run the job.

gcloud iam service-accounts create screenshot-sa --display-name="Screenshot app service account"

Grant storage.admin role to the service account, so it can be used to create buckets and objects.

gcloud projects add-iam-policy-binding $PROJECT_ID \
  --role roles/storage.admin \
  --member serviceAccount:screenshot-sa@$PROJECT_ID.iam.gserviceaccount.com

You're now ready to create a Cloud Run job that includes the configuration needed to run the job.

gcloud beta run jobs create screenshot \
  --image=$REGION-docker.pkg.dev/$PROJECT_ID/containers/screenshot:v1 \
  --args="https://example.com" \
  --args="https://cloud.google.com" \
  --tasks=2 \
  --task-timeout=5m \
  --set-env-vars=BUCKET_NAME=screenshot-$PROJECT_ID \
  --service-account=screenshot-sa@$PROJECT_ID.iam.gserviceaccount.com

This creates a Cloud Run job without running it.

Notice how the web pages are passed in as arguments. The bucket name to save the screenshots is passed in as an environment variable.

You can run multiple copies of your container in parallel by specifying a number of tasks to run with --tasks flag. Each task represents one running copy of the container. Using multiple tasks is useful if each task can independently process a subset of your data. To facilitate this, each task is aware of its index, which is stored in the CLOUD_RUN_TASK_INDEX environment variable. Your code is responsible for determining which task handles which subset of the data. Notice --tasks=2 in this sample. This makes sure 2 containers run for the 2 URLs we want to process.

Each task can run for up to 1 hour. You can decrease this timeout using the --task-timeout flag, as we did in this example. All tasks need to succeed in order for the job to successfully complete. By default, failed tasks are not retried. You can configure tasks to be retried when they fail. If any task exceeds its number of retries, the whole job fails.

By default, your job will run with as many tasks in parallel as possible. This will be equal to the number of tasks for your job, up to a maximum of 100. You may wish to set parallelism lower for jobs that access a backend with limited scalability. For example, a database that supports a limited number of active connections. You can lower parallelism with the --parallelism flag.

7. Run a job

Before running the job, list the job to see that it has been created:

gcloud beta run jobs list

✔
JOB: screenshot
REGION: $REGION
LAST RUN AT:
CREATED: 2022-02-22 12:20:50 UTC

Run the job with the following command:

gcloud beta run jobs execute screenshot

This executes the job. You can list current and past executions:

gcloud beta run jobs executions list --job screenshot

...
JOB: screenshot
EXECUTION: screenshot-znkmm
REGION: $REGION
RUNNING: 1
COMPLETE: 1 / 2
CREATED: 2022-02-22 12:40:42 UTC

Describe the execution. You should see the green checkmark and tasks completed successfully message:

gcloud beta run jobs executions describe screenshot-znkmm
✔ Execution screenshot-znkmm in region $REGION
2 tasks completed successfully


Image:           $REGION-docker.pkg.dev/$PROJECT_ID/containers/screenshot at 311b20d9...
Tasks:           2
Args:            https://example.com https://cloud.google.com
Memory:          1Gi
CPU:             1000m
Task Timeout:    3600s
Parallelism:     2
Service account: 11111111-compute@developer.gserviceaccount.com
Env vars:
  BUCKET_NAME    screenshot-$PROJECT_ID

You can also check Cloud Run jobs page of Cloud Console to see the status:

e59ed4e532b974b1.png

If you check the Cloud Storage bucket, you should see the two screenshot files created:

f2f86e60b94ba47c.png

Sometimes you may need to stop an execution before it completes - perhaps because you realized you need to run the job with different parameters or there's an error in the code, and you don't want to use unnecessary compute time.

To stop an execution of your job, you need to delete the execution:

gcloud beta run jobs executions delete screenshot-znkmm

8. Update a job

New versions of your container don't automatically get picked up by Cloud Run jobs in the next execution. If you change the code for your job, you need to rebuild the container and update your job. Using tagged images will help you to identify which version of the image is currently being used.

Similarly, you also need to update the job if you want to update some of the configuration variables. Subsequent executions of the job will use the new container and configuration settings.

Update the job and change the pages that the app takes screenshots of in the --args flag. Also update the --tasks flag to reflect the number of pages.

gcloud beta run jobs update screenshot \
  --args="https://www.pinterest.com" \
  --args="https://www.apartmenttherapy.com" \
  --args="https://www.google.com" \
  --tasks=3

Run the job again. This time pass in the --wait flag to wait for executions to finish:

gcloud beta run jobs execute screenshot --wait

After a few seconds, you should see 3 more screenshots added to the bucket:

ce91c96dcfd271bb.png

9. Schedule a job

So far, this codelab shows running jobs manually. In a real-world scenario, you probably want to run jobs in response to an event or on a schedule. You can do so via Cloud Run REST API. Let's see how to run the screenshot job on a schedule using Cloud Scheduler.

First, make sure the Cloud Scheduler API is enabled:

gcloud services enable cloudscheduler.googleapis.com

Create a Cloud Scheduler job to run the Cloud Run job every day at 9:00:

PROJECT_NUMBER="$(gcloud projects describe $(gcloud config get-value project) --format='value(projectNumber)')"

gcloud scheduler jobs create http screenshot-scheduled --schedule "0 9 * * *" \
   --http-method=POST \
   --uri=https://$REGION-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/$PROJECT_ID/jobs/screenshot:run \
   --oauth-service-account-email=$PROJECT_NUMBER-compute@developer.gserviceaccount.com \
   --location $REGION

Verify that Cloud Scheduler job is created and ready to call Cloud Run job:

gcloud scheduler jobs list

ID: screenshot-scheduled
LOCATION: $REGION
SCHEDULE (TZ): 0 9 * * * (Etc/UTC)
TARGET_TYPE: HTTP
STATE: ENABLED

To test, manually trigger Cloud Scheduler:

gcloud scheduler jobs run screenshot-scheduled

In a few seconds, you should see 3 more screenshots added by the call from Cloud Scheduler:

971ea598020cf9ba.png

10. Congratulations

Congratulations, you finished the codelab!

What we've covered

  • How to use an app to take screenshots of web pages.
  • How to build a container image for the application.
  • How to create a Cloud Run job for the application.
  • How to run the application as a Cloud Run job.
  • How to update the job.
  • How to schedule the job with Cloud Scheduler.