Deploy an AI model on GKE with NVIDIA NIM

Deploy an AI model on GKE with NVIDIA NIM

About this codelab

subjectLast updated Jul 16, 2025
account_circleWritten by Jason Davenport (Google), Dimitri Maltezakis Vathypetrou (NVIDIA)

1. Introduction

This hands-on Codelab will guide you through deploying and managing a containerized AI model on Google Kubernetes Engine (GKE), using the power of NVIDIA NIM™ microservices.

This tutorial is designed for developers and data scientists who are looking to:

  • Simplify AI inference deployment: Learn how to use a prebuilt NIM for faster and easier deployment of AI models into production on GKE.
  • Optimize performance on NVIDIA GPUs: Gain hands-on experience with deploying NIM that use NVIDIA TensorRT for optimized inference on GPUs within your GKE cluster.
  • Scale AI inference workloads: Explore how to scale your NIM deployment based on demand by using Kubernetes for autoscaling and managing compute resources.

2. What you will learn

By the end of this tutorial, you'll have experience with:

  1. Deploying NIM on GKE: Deploy a prebuilt NVIDIA NIM for various inference tasks onto your GKE cluster.
  2. Managing NIM deployments: Use kubectl commands to manage, monitor, and scale your deployed NIM.
  3. Scaling inference workloads: Utilize Kubernetes features for autoscaling your NIM deployments based on traffic demands.

3. Learn the components

GPUs in Google Kubernetes Engine (GKE)

GPUs let you accelerate specific workloads running on your nodes such as machine learning and data processing. GKE provides a range of machine type options for node configuration, including machine types with NVIDIA H100, L4, and A100 GPUs.

NVIDIA NIM

NVIDIA NIM are a set of easy-to-use inference microservices for accelerating the deployment of foundation models on any cloud or data center and helping to keep your data secure.

NVIDIA AI Enterprise

NVIDIA AI Enterprise is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade co-pilots and other generative AI applications. Available through GCP Marketplace.

4. Prerequisites

  • Project: A Google Cloud project with billing enabled.
  • Permissions: Sufficient permissions to create GKE clusters and other related resources.
  • Helm: Helm is a package manager for Kubernetes.
  • NVIDIA GPU Operator: A kubernetes add-on that automates the management of all NVIDIA software components needed to provision GPU.
  • NVIDIA API key: Click this link, and follow the instructions on how to create an account and generate an API Key. An API key will be required to download the NIM container.
  • NVIDIA GPUs: One of the below GPUs should work (Please note, you can follow these steps to request a quota increase, if you don't have enough GPUs)
  • Optional - GCloud SDK:** In case you're not using the Cloud Shell in the GCP Portal, please ensure you have the Google Cloud SDK installed and configured.
  • Optional - kubectl:** In case you're not using the Cloud Shell in the GCP Portal, please ensure you have the kubectl command-line tool installed and configured.

5. Create a GKE cluster with GPUs

  1. Open Cloud Shell or your terminal.
  2. Specify the following parameters:
    export PROJECT_ID=<YOUR PROJECT ID>
    export REGION=<YOUR REGION>
    export ZONE=<YOUR ZONE>
    export CLUSTER_NAME=nim-demo
    export NODE_POOL_MACHINE_TYPE=g2-standard-16
    export CLUSTER_MACHINE_TYPE=e2-standard-4
    export GPU_TYPE=nvidia-l4
    export GPU_COUNT=1

Please note you may have to change the values for NODE_POOL_MACHINE_TYPE, CLUSTER_MACHINE_TYPE and GPU_TYPE based on what type of Compute Instance and GPUs you are using.

  1. Create GKE Cluster:
    gcloud container clusters create ${CLUSTER_NAME} \
        --project=${PROJECT_ID} \
        --location=${ZONE} \
        --release-channel=rapid \
        --machine-type=${CLUSTER_MACHINE_TYPE} \
        --num-nodes=1
  2. Create GPU node pool:
    gcloud container node-pools create gpupool \
        --accelerator type=${GPU_TYPE},count=${GPU_COUNT},gpu-driver-version=latest \
        --project=${PROJECT_ID} \
        --location=${ZONE} \
        --cluster=${CLUSTER_NAME} \
        --machine-type=${NODE_POOL_MACHINE_TYPE} \
        --num-nodes=1

6. Configure NVIDIA NGC API Key

The NGC API key allows you to pull custom images from NVIDIA NGC. To specify your key:

export NGC_CLI_API_KEY="<YOUR NGC API KEY>"

This is the key that was generated, as part of the Prerequisites.

7. Deploy and test NVIDIA NIM

  1. Fetch NIM LLM Helm Chart:
    helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.3.0.tgz --username='$oauthtoken' --password=$NGC_CLI_API_KEY
  2. Create a NIM Namespace:
    kubectl create namespace nim
  3. Configure secrets:
    kubectl create secret docker-registry registry-secret --docker-server=nvcr.io --docker-username='$oauthtoken'     --docker-password=$NGC_CLI_API_KEY -n nim

    kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_CLI_API_KEY -n nim
  4. Setup NIM Configuration:
    cat <<EOF > nim_custom_value.yaml
    image:
      repository: "nvcr.io/nim/meta/llama3-8b-instruct" # container location
      tag: 1.0.0 # NIM version you want to deploy
    model:
      ngcAPISecret: ngc-api  # name of a secret in the cluster that includes a key named NGC_CLI_API_KEY and is an NGC API key
    persistence:
      enabled: true
    imagePullSecrets:
      -   name: registry-secret # name of a secret used to pull nvcr.io images, see https://kubernetes.io/docs/tasks/    configure-pod-container/pull-image-private-registry/
    EOF
  5. Launching NIM deployment:
    helm install my-nim nim-llm-1.1.2.tgz -f nim_custom_value.yaml --namespace nim
    Verify NIM pod is running:
    kubectl get pods -n nim
  6. Testing NIM deployment:
    Once we've verified that our NIM service was deployed successfully, we can make inference requests to see what type of feedback we'll receive from the NIM service. In order to do this, we enable port forwarding on the service to be able to access the NIM from our localhost on port 8000:
    kubectl port-forward service/my-nim-nim-llm 8000:8000 -n nim
    Next, we can open another terminal or tab in the cloud shell and try the following request:
    curl -X 'POST' \
     
    'http://localhost:8000/v1/chat/completions' \
     
    -H 'accept: application/json' \
     
    -H 'Content-Type: application/json' \
     
    -d '{
      "messages": [
        {
          "content": "You are a polite and respectful chatbot helping people plan a vacation.",
          "role": "system"
        },
        {
          "content": "What should I do for a 4 day vacation in Spain?",
          "role": "user"
        }
      ],
      "model": "meta/llama3-8b-instruct",
      "max_tokens": 128,
      "top_p": 1,
      "n": 1,
      "stream": false,
      "stop": "\n",
      "frequency_penalty": 0.0
    }'

    If you get a chat completion from the NIM service, that means the service is working as expected!

8. Cleanup

Delete the GKE cluster:

gcloud container clusters delete $CLUSTER_NAME --zone=$ZONE

9. What's next

Check out the following articles for more information: