Deploying a Confidential Space Workload with MIGs using Autoscaling, Autohealing and Image Updates

1. Overview

Confidential Space (CS) provides a secure, attested and encrypted environment to process sensitive data. Relying on standalone VM instances creates operational overhead, as manual orchestration lacks the scalability required for mission critical services. Without automated orchestration, performing synchronized rolling updates or deploying new OS images across a fleet becomes technically difficult and prone to downtime.

In this codelab, you will learn how to deploy a Confidential Space workload on a Managed Instance Group (MIG). You will also learn how to enable autohealing using health checks, autoscaling based on CPU utilization, and rolling updates for both OS images and workloads=.

The processes shown in this codelab should aid you in setting up your own production-ready and secure Confidential Space for mission-critical, long-running deployments.

What you'll learn

How to create a specialized instance template for Confidential Space.
How to use Google Compute Engine and how to configure MIGs and instance groups
How to create a firewall rule and Health Check for autohealing.
How to configure a zonal MIG with the template and health check.
How to set up autoscaling for the MIG.
How to set up one click OS image updates using scripting on MIGs for both workload images as well as new OS image releases for Confidential Space

What you'll need

A Google Cloud project with billing enabled.
Familiarity with text editors, Docker deployments, and Bash scripting
The gcloud command-line tool installed and authenticated.
Basic understanding of Compute Engine, Confidential Space, IAM, Confidential VMs, container technology, remote repositories, service accounts, Cloud Run and Cloud Scheduler
A Confidential Space workload container image already built and pushed to Artifact Registry.

2. How Confidential Space with MIGs work

Using a Managed Instance Group (MIG) to deploy a Confidential Space workload makes a secure application more robust, scalable and easier to operate.

The security and operational needs of a production service are logically divided between the two components. Confidential Space provides the necessary security by running the workload in a highly isolated, encrypted and attested environment called a Trusted Execution Environment (TEE). In contrast, MIGs provide the essential operational capabilities required for running the secure application at scale, similar to that of Kubernetes. MIGs eliminate the risks inherent in running a mission-critical workload on a single VM, which can be potentially slow or susceptible to failure. This combination ensures both data protection and system reliability. This solution ensures High Availability and Autohealing because the workload runs across multiple VMs in a pool. If one VM crashes, the service remains fully functional due to load balancing and the presence of the remaining instances.

Furthermore, MIGs utilize configurable health checks to constantly monitor the operational status of the VMs. If an instance is found to be unhealthy, the MIG automatically replaces it with a new, healthy VM, thereby guaranteeing continuous operation.

MIGs also deliver effective scalability for users with its autoscaling feature. This capability provides an automatic way to manage capacity without manual intervention, fulfilling the need to flexibly add or remove capacity based on utilization.

Finally, MIGs enable Zero-Downtime Updates through Rolling Updates. A key benefit is the " one-click upgrade" ability for the underlying Confidential Space OS image or the application's container image (or both), all without causing any service downtime. The MIG manages this change by gradually replacing the older instances with newer ones running the updated image, ensuring constant availability throughout the deployment. Note that your application may need to be backwards compatible in order to support this type of a gradual upgrade.

3. Setting Up Cloud Resources

Before you begin

Set up a Google Cloud project. For more information on creating a Google Cloud project, please refer to the "Set up and navigate your first Google project" codelab. You can refer to creating and managing projects to get details about how to retrieve the project ID and how it is different from project name and project number.
Enable Billing for your projects.
In one of your Google project's Cloud Shell, set the required project environment variables as shown below.

export  CURRENT_PROJECT_ID=<Google Cloud project id of current project>

Enable Confidential Computing API and the following APIs for your project.

gcloud config set project $CURRENT_PROJECT_ID
gcloud services enable \
cloudapis.googleapis.com \
container.googleapis.com \
artifactregistry.googleapis.com \
confidentialcomputing.googleapis.com \
compute.googleapis.com \
logging.googleapis.com \
run.googleapis.com \
cloudscheduler.googleapis.com

In your Google Cloud project Cloud Shell, clone over the Confidential Space Codelab Github Repository, and use the command below to get the appropriate scripts needed to complete this codelab.

git clone https://github.com/GoogleCloudPlatform/confidential-space.git

Change the directory to to scripts directory for the instance group codelab

cd confidential-space/codelabs/mig_cs_codelab/scripts

Update the project ID line in config_env.sh to reflect that of your chosen project.
Set any pre-existing variables. Override the resource names using these variables

You can set the following variables with existing cloud resource names. If the variable is set, then corresponding existing cloud resources from the project would be used. If it is not set then the cloud resource name would come from the config_env.sh script

Run the config_env.sh script to set the remaining variable names for this project to values based upon the project ID for resource names

source config_env.sh

Add permissions for the project. Permissions can be added by following the details on the grant an IAM role webpage.

You will need the following permissions for this project

Artifact Registry Writer
Cloud Scheduler Admin
Compute Service Agent
Confidential Computing Workload User
Log Writer
Cloud Run Developer
Cloud Run Invoker

gcloud config set project $CURRENT_PROJECT_ID

# Add Artifact Registry Writer role
gcloud projects add-iam-policy-binding $CURRENT_PROJECT_ID --member="serviceAccount:${CURRENT_WORKLOAD_SERVICE_ACCOUNT}" --role='roles/artifactregistry.writer'

# Add Confidential Space Workload Userd
gcloud projects add-iam-policy-binding $CURRENT_PROJECT_ID --member="serviceAccount:${CURRENT_WORKLOAD_SERVICE_ACCOUNT}" --role='roles/confidentialcomputing.workloadUser'

# Add Logging Log Writer
gcloud projects add-iam-policy-binding $CURRENT_PROJECT_ID --member="serviceAccount:${CURRENT_WORKLOAD_SERVICE_ACCOUNT}" --role='roles/logging.logWriter'

# Add Cloud Run Developer
gcloud projects add-iam-policy-binding $CURRENT_PROJECT_ID --member="serviceAccount:${CURRENT_WORKLOAD_SERVICE_ACCOUNT}" --role='roles/run.developer'

# Add Cloud Run Invoker
gcloud projects add-iam-policy-binding $CURRENT_PROJECT_ID --member="serviceAccount:${CURRENT_WORKLOAD_SERVICE_ACCOUNT}" --role='roles/run.invoker'


# Add Cloud Scheduler Admin
gcloud projects add-iam-policy-binding $CURRENT_PROJECT_ID --member="serviceAccount:${CURRENT_WORKLOAD_SERVICE_ACCOUNT}" --role='roles/cloudscheduler.admin'

Look at test_workload.py

Verify the output of the workload by reviewing the source code, it should simply print the current version of the workload
When we first push our workload to CS and check the output, we should see "version A" printed out

4. Setting up Workload

You will first need to create a Docker image for the workload used in this codelab. The workload is a simple script that prints out the version of the workload you are currently running. It will print that the workload is starting, then print the version of the workload, sleep for 5 seconds and then print that the workload is finished.

Steps to create Workload

Run create_workload.sh to create the workload. This script:

Creates the Artifact Registry owned by the Project where the workload will be published
Builds the code and packages it in a Docker image. See the associated dockerfile configuration for more information.
Publishes the Docker image to the Artifact Registry owned by the project
Grants the service account <your service account name> read permissions for the artifact registry <artifact registry repo name>

5. Setting up an Instance Template and MIG

Steps to Creating Instance Template

You must first create an Instance Template. This template is the required blueprint that the Managed Instance Group (MIG) will use to provision and run your workloads within Confidential Space.

The Instance Template is essential because it defines all the specialized parameters:

Machine Type: In this example we use a Confidential VM machine type (e.g., n2d-standard-2) that supports AMD SEV confidential computing technology (--confidential-compute-type=SEV).
VM OS Image: We use the confidential-space-images project and the confidential-space-debug image family to pull the latest Confidential Space operating system image.
Note: We use the debug image in this guide to facilitate easier troubleshooting. Unlike the production image, the debug version keeps the VM running after your workload finishes and allows SSH access for testing. For production deployments using real-world sensitive data, you must switch to the production image family.
Workload Reference: The required tee-image-reference line in the metadata contains the specific container image (your application workload) that the Confidential Space VM will launch.

This setup ensures every VM created by the MIG is a properly configured Confidential Space ready to execute your workload.

Steps to Creating Managed Instance Group

The next step is to create the Managed Instance Group (MIG) using the template you just defined. The MIG is essential because it automates the deployment, management, and scaling of multiple identical VMs.

The script create_launch_mig.sh accomplishes three main objectives:

1. Create the MIG

Command: gcloud compute instance-groups managed create ${CURRENT_MIG_NAME}
Purpose: This command creates the group that will manage your VMs.
--size 3: Specifies that the MIG should initially create and maintain 3 instances of your workload.
--template ${TEMPLATE_NAME}: Crucially, it references the Instance Template created earlier, ensuring all 3 instances are configured as Confidential Space VMs running your specific tee-image-reference workload container.
--zone ${CURRENT_PROJECT_ZONE}: Specifies the deployment location for the instances.

2. Fetch Project Number

Command: PROJECT_NUMBER=$(gcloud projects describe ${PROJECT_ID} --format="value(projectNumber)")
Purpose: The script fetches the numerical ID of your project. This number is often required for creating service account roles and permissions, especially for Google-managed service agents.

3. Grant IAM Permissions

Command: gcloud projects add-iam-policy-binding --role="roles/compute.serviceAgent"
Purpose: This step grants the Compute Engine Service Agent role to your workload's Service Account (${SERVICE_ACCOUNT}). This permission is important because it allows the Service Account to act on behalf of the project's Compute Engine service—which is often necessary for the MIG's automated features like managing instances, setting up networking, and interacting with other Google Cloud services.

Run create_launch_mig.sh to create the managed instance group.

6. Steps to enable Autohealing and Autoscaling

Setting up Autohealing

To ensure high availability, we verify that the workload is responsive. If the application freezes, the MIG should replace the VM. Firewall rules for source IP ranges are defined in this document.

# 1. Create Health Check (TCP Port 22)
gcloud compute health-checks create tcp ${HEALTH_CHECK_NAME} \
  --port 22 \
  --check-interval 30s \
  --healthy-threshold 1 \
  --timeout 10s \
  --unhealthy-threshold 3 \
  --global

# 2. Allow Health Check Traffic (Firewall)
gcloud compute firewall-rules create allow-health-check \
    --allow tcp:22 \
    --source-ranges 130.211.0.0/22,35.191.0.0/16 \
    --network default \
    --project="${CURRENT_PROJECT_ID}" \ 

# 3. Apply to MIG
gcloud compute instance-groups managed update ${CURRENT_MIG_NAME} \
    --health-check ${HEALTH_CHECK_NAME} \
    --initial-delay 60 \
    --zone ${CURRENT_PROJECT_ZONE}

Setting up Autoscaling

We will configure the group to automatically scale between 1 and 5 instances to handle traffic spikes.

gcloud compute instance-groups managed set-autoscaling ${CURRENT_MIG_NAME} \
    --max-num-replicas 5 \
    --target-cpu-utilization 0.80 \
    --cool-down-period 90 \
    --zone ${CURRENT_PROJECT_ZONE}

7. Verifying workload and Setting up image updates

Verify Workload

Once the Managed Instance Group (MIG) has launched the VMs, we need to verify that your Confidential Space workload is running correctly.

You can do this via the Google Cloud Console or the Command Line.

gcloud compute instance-groups managed list-instances ${CURRENT_MIG_NAME} \
    --zone ${CURRENT_PROJECT_ZONE}

You can also check the Serial Port Output for that specific instance to see your workload's log

# Replace <INSTANCE_NAME> with one of the names from the previous command
gcloud compute instances get-serial-port-output <INSTANCE_NAME> \
    --zone ${CURRENT_PROJECT_ZONE} \
    --port 1

Setting up Image Updates

In a production environment, your Managed Instance Group (MIG) must be updated regularly to address two distinct scenarios:

Workload Updates: Releasing a new version of your application code (e.g., updating test_workload.py from v1 to v2).
Infrastructure Updates: Google releasing a security patch or update for the underlying Confidential Space OS. Note that it is a best practice to pick up the latest CS image on at least a monthly basis.

Because we configured our Instance Template with a Dynamic Image Link (.../images/family/...) and a Dynamic Container Tag (:latest), we can handle both of these scenarios with a single "Rolling Replace" operation. This ensures your fleet of VMs is always running the latest stack without any downtime and without requiring you to create a new Instance Template for every minor change.

The Rolling Replace Script

Under the directory update_images navigate to update_images_script.sh. This script triggers a Rolling Replace, which gradually destroys and recreates every VM in the group

#!/bin/bash

# Initialize the template
gcloud compute instance-groups managed set-instance-template "${CURRENT_MIG_NAME}" \
--template=projects/"${PROJECT_ID}"/global/instanceTemplates/"${TEMPLATE_NAME}" \
--zone="${CURRENT_PROJECT_ZONE}" \
--project="${PROJECT_ID}"

# Trigger the rolling replace
gcloud compute instance-groups managed rolling-action replace "${CURRENT_MIG_NAME}" \
    --version=template="${TEMPLATE_NAME}" \
    --project="${PROJECT_ID}" \
    --zone="${CURRENT_PROJECT_ZONE}" \
    --max-surge=1 \
    --max-unavailable=0

# Wait for the update to complete
gcloud compute instance-groups managed wait-until --version-target-reached "${CURRENT_MIG_NAME}" \
    --zone="${CURRENT_PROJECT_ZONE}" \
    --project="${PROJECT_ID}"

For this script we can use replace rather than restart.

A Restart simply reboots the machine. It preserves the existing OS disk, meaning it will not pick up new OS patches.
A Replace deletes the VM and creates a fresh one from the template. This forces the system to look up the latest Confidential Space OS image from the family as well as pull the "Latest" container image from the registry.

–max-surge=1: This allows the MIG to temporarily create 1 extra VM above your target size. It spins up a new (updated) VM and waits for it to be healthy before it deletes an old (outdated) VM.

–max-unavailable=0: This ensures zero downtime. It tells the MIG that it is not allowed to take any machine offline unless it has already successfully surged a replacement.

The Rolling Restart Script

Under the directory update_images there is also another script update_workload_image_script.sh. This script triggers a Rolling Restart, this is a faster method used strictly to refresh the workload. Because Confidential Space pulls the container image from the registry at every boot, a restart is sufficient to update your application to the :latest version without altering the underlying host.

#!/bin/bash
# Reboots the existing VMs to refresh the container
gcloud compute instance-groups managed rolling-action restart "${CURRENT_MIG_NAME}" \
    --project="${PROJECT_ID}" \
    --zone="${CURRENT_PROJECT_ZONE}" \
    --max-surge=1 \
    --max-unavailable=0

# Wait for the update to complete
gcloud compute instance-groups managed wait-until --stable "${CURRENT_MIG_NAME}" \
    --zone="${CURRENT_PROJECT_ZONE}" \
    --project="${CURRENT_PROJECT_ID}"

Verify Updated Workload

We can test the "One-Click Upgrade" by simulating a real-world application release. We will modify the workload code, push it to the Artifact Registry, update the MIG, and verify that the new version is running with no downtime.

Step 1: Deploy a New Workload Version

First, we need to create a "new" version of your application.

Open your local test_workload.py file.
Change the version print statement from print("Workload Version A") to print("Workload Version B")
Rebuild and push the container image to Artifact Registry by running create_workload.sh. Note that we are pushing to the same tag (:latest).

Step 2: Execute the Rolling Update

Run the update script we created in the previous section. This will force the MIG to replace every VM, pulling the new container hash associated with :latest.

# Run your update script
./update_images/update_images_script.sh

Wait for the script to complete

Step 3: Verify the Update via Serial Port

Once the update is complete, we verify that the new VMs are running the updated code.

# Replace <INSTANCE_NAME> with one of the names from the previous command
gcloud compute instances get-serial-port-output <INSTANCE_NAME> \
    --zone ${CURRENT_PROJECT_ZONE} \
    --port 1

Get the name of a new instance:

gcloud compute instance-groups managed list-instances ${CURRENT_MIG_NAME} --zone ${CURRENT_PROJECT_ZONE}

Check the logs:

# Replace <NEW_INSTANCE_NAME> with one of the names of the running VMs
gcloud compute instances get-serial-port-output <NEW_INSTANCE_NAME> \
    --zone ${CURRENT_PROJECT_ZONE} \
    --port 1

Once the instances are running, select any instance name from the previous gcloud command to view its serial port

Expected Output: You should see the updated log message, confirming the deployment was successful:

... Workload Version B ...

Step 4: Verify Infrastructure Configuration (Optional)

You can also verify that your Instance Template is correctly configured to pull dynamic updates for both the OS and the Workload by inspecting its metadata.

Run the following command to see the dynamic container reference:

gcloud compute instance-templates describe ${TEMPLATE_NAME} \
    | grep -A 1 tee-image-reference

Result: You should see your container image ending in :latest.

Implication: Because the template points to the tag and not a specific hash, every rolling-action replace successfully pulls the newest code you pushed in Step 1.

(Optional) Automated Updates

While manual updates are useful for major version releases, you often want your fleet to automatically pick up the latest security patches or regular deployment builds without human intervention.

We can automate the ‘Rolling Replace' process by packaging our update script into a Cloud Run Job. For this codelab, we'll trigger it every 15 minutes. In a production environment, it should run much less frequently. Depending on the user's needs, they might configure it on a weekly or monthly basis.

Step 1: Containerize the Updater Script

First, we need to package our update_images_script.sh (which contains the gcloud ... rolling-action replace logic) into a Docker container so it can run in the cloud.

We have prepared a helper script that builds this container and pushes it to your Artifact Registry.

Run the following command:

# Build and Push the "Updater" Container
# This packages your update logic into a docker image
./update_images/deploy_docker_script_image.sh

What this does:

It takes the update_images_script.sh from the update_images/ directory.
It creates a Docker image containing the Google Cloud SDK and your script.
It pushes the image to ${CURRENT_PROJECT_REGION}-docker.pkg.dev/${PROJECT_ID}/${REPOSITORY}/update-script:latest.

Step 2: Deploy and Schedule the Job

Now we need to tell Google Cloud to run this container periodically. We use Cloud Run Jobs to execute the container and Cloud Scheduler to trigger it.

Run the scheduling configuration script:

# Create the Cloud Run Job and the Scheduler Trigger
./create_configs/create_schedule_job.sh

Inside the Script: This script performs two critical actions:

Creates a Cloud Run Job: It defines a job named mig-updater-job that executes the container we just pushed.
Creates a Scheduler Trigger: It sets up a Cloud Scheduler job to hit the Cloud Run Job API every 15 minutes.

# (Snippet from create_schedule_job.sh for reference)
# The schedule is set to run every 15 minutes for testing purposes
gcloud scheduler jobs create http ${SCHEDULER_NAME} \
    --schedule "*/15 * * * *" \
    --uri "https://${CURRENT_PROJECT_REGION}-run.googleapis.com/apis/run.googleapis.com/v1/namespaces/${PROJECT_ID}/jobs/${JOB_NAME}:run" \
    --http-method POST \
    --oauth-service-account-email ${SERVICE_ACCOUNT}

Step 3: Verify Automation

You don't have to wait 15 minutes to test it. You can force the scheduler to run immediately to verify the pipeline.

Force Run the Job:

gcloud scheduler jobs run ${SCHEDULER_NAME} --location ${CURRENT_PROJECT_REGION}

Check Execution: Go to the Cloud Run console > Jobs. You should see a new execution starting.
Check the MIG: Run gcloud compute instance-groups managed list-instances ${CURRENT_MIG_NAME}. You will see the instances entering the RECREATING state as the job triggers the rolling update.

Why 15 Minutes? We set the schedule to */15 * * * * for this codelab so you can see the results quickly. In a real production environment, you would likely change this to run daily (e.g., 0 3 * * * for 3 AM) or weekly.

8. Clean Up

The cleanup script cleanup.sh can be used to clean up the resources that we have created as part of this codelab. As part of this cleanup, the following resources will be deleted:

The Managed Instance Group (${CURRENT_MIG_NAME}) and its underlying VMs.
The Instance Template (${TEMPLATE_NAME}).
The Health Check and Firewall Rules (${HEALTH_CHECK_NAME}).
The Artifact Registry Repository (${REPOSITORY}).
The Service Account (if you created a dedicated one for this lab).

If you are done exploring, please consider deleting your project by following these instructions: Shutting down (deleting) projects.

Congratulations

Congratulations, you've successfully completed the codelab!

You learned how to securely scale Confidential Space workloads using Managed Instance Groups (MIGs). You successfully configured Autohealing to recover from failures, Autoscaling to handle traffic spikes, and performed Zero-Downtime Updates for both your Confidential Space OS image and Workload container.

What's next?

Check out additional Confidential Space codelabs:

Securing ML models and Intellectual Property using Confidential Space
How to transact digital assets with multi-party computation and Confidential Space
Analyze confidential data with Confidential Space
Use Confidential Space with protected resources that aren't stored with a cloud provider