LiteRT CLI 101: Streamline your Edge AI Workflows

1. 🏁 Introduction & CLI overview

Welcome to the LiteRT CLI 101 hands-on codelab! This guide is designed to take you step-by-step from zero environment setup to deploying optimized edge intelligence.

🌟 Background

Edge AI requires bringing complex neural models directly onto mobile phones, wearables, and embedded hardware.

  • LiteRT (formerly TensorFlow Lite / TFLite) Google's on-device framework for high-performance ML & GenAI deployment on edge platforms, via efficient conversion, runtime, and optimization.
  • LiteRT CLI integrates Google AI Edge stacks into a standalone shell command (litert) to streamline LiteRT-related development workflows, including converting, quantizing, compiling, running, benchmarking, and visualizing LiteRT (TFLite) models on various hardware (CPU / GPU / NPU) across platforms (desktop, mobile, or cloud).

2. 🔄 Basic workflows: conversion, quantize & run

In this section, let's execute a complete Edge AI modeling lifecycle with LiteRT CLI: PyTorch model wrapper ➔ tracing conversion ➔ model quantization ➔ desktop inference ➔ performance benchmarks.

LiteRT CLI 101 Workflow

📝 Stage 1: Prepare a PyTorch model wrapper script

Create resnet18.py in your current directory. It exposes dynamic tracing hooks so the graph builder can capture shapes:

import torch
import torchvision

def get_model(batch_size: int = 1) -> torch.nn.Module:
  model = torchvision.models.resnet18(
      weights=torchvision.models.ResNet18_Weights.IMAGENET1K_V1
  )
  model.eval()
  return model

def get_args(batch_size: int = 1) -> tuple[torch.Tensor, ...]:
  return (torch.randn(batch_size, 3, 224, 224),)

🔄 Stage 2: Convert a PyTorch model to LiteRT using LiteRT Torch

Invoke the LiteRT Torch converter to build a standard Float32 .tflite target model:

# Convert PyTorch source to LiteRT
litert convert resnet18.py --output resnet18

# Verify target was exported
ls -lh resnet18/resnet18.tflite

📉 Stage 3: Quantize weights to INT8

Apply dynamic and weight-only recipe cards to shrink sizes by ~4x:

# 1. Dynamic Range Quantization (Dynamic activations + static INT8 weights)
litert quantize resnet18/resnet18.tflite \
  --recipe dynamic_wi8_afp32 \
  --output resnet18/resnet18_int8_dynamic.tflite

# 2. Weight-Only Quantization (Float32 activations + static INT8 weights)
litert quantize resnet18/resnet18.tflite \
  --recipe weight_only_wi8_afp32 \
  --output resnet18/resnet18_int8_weight_only.tflite

🚀 Stage 4: Run Inference locally

Execute test operations with dummy input parameters to verify performance blocks:

# Run original Float32 model inference
litert run resnet18/resnet18.tflite --desktop --cpu

# Run optimized Dynamic INT8 model inference
litert run resnet18/resnet18_int8_dynamic.tflite --cpu --iterations 1

📊 Stage 5: Benchmark model performance

Measure high-precision metrics including average latency, initialization costs, CPU execution throughput, and dynamic active memory footprints:

# Benchmark original Float32 model
litert benchmark resnet18/resnet18.tflite --desktop --cpu

# Benchmark optimized Dynamic range INT8 model
litert benchmark resnet18/resnet18_int8_dynamic.tflite --desktop --cpu

3. 🔌 Local environment setup & verification

Let's build an isolated, clean sandbox on your workstation (macOS or Linux)

🔌 Option A: Ultra-fast setup (uv)

Using uv isolates libraries in seconds, clearing environment cache conflicts.

# Create active workspace sandbox
uv venv --clear --python=3.13 --seed
source .venv/bin/activate

# Install litert-cli from pypi
uv pip install litert-cli-nightly

🐍 Option B: Standard setup (pip)

If uv is not present, use standard Python virtual environment configurations:

# Create active workspace sandbox
python3 -m venv .venv
source .venv/bin/activate

# Install litert-cli from pypi
pip install --upgrade pip setuptools wheel
pip install litert-cli-nightly

🛠️ Option C: Editable local setup

If you are developing or testing directly inside the repository source directory clone:

# Create active workspace sandbox
uv venv --clear --python=3.13 --seed
source .venv/bin/activate

# Install from local directory root
uv pip install -e .

🔍 Setup verification

Verify that your path correctly routes the litert command:

litert --help

💡 The central Model Reference catalog (model-ref)

To keep scripts and development robust, the LiteRT CLI implements a centralized Model Catalog:

Model Catalog Routing Diagram

  • Format: Assign a unique name (alias) to a downloaded or imported model, such as mobilenet.
  • Variants: Use target colons to refer to optimized variations, like mobilenet:int8 or mobilenet:gpu.
  • Simplicity: All CLI commands accept this pathless alias directly, automatically resolving physical file storage on the fly!

4. 📲 Device deployment & profiling

Let us take optimization further by running high-precision local benchmarks, deploying to real USB hardware targets, implementing hardware delegates, compiling Ahead-of-Time, and cloud benchmarking.

Target Deployment and Acceleration Options

📊 Local profiling metrics

Download EfficientNet-B1 and profile performance statistics on your desktop host:

# Download and register model alias in central catalog
litert download litert-community/efficientnet_b1 --file "*.tflite" --output efficientnet
litert import efficientnet/efficientnet_b1.tflite --model-ref efficientnet_b1

# Run high-precision benchmark on desktop CPU
litert benchmark efficientnet_b1 --desktop --cpu

Observe the key performance headers in the benchmark log:

  • Model initialization: Time to boot structural networks.
  • Warmup (avg): Compilation overhead timings.
  • Inference (avg): The raw mathematical processing time.
  • Overall footprint: Peak RAM consumed during execution.

📲 Target A: Mobile CPU (USB connect)

Connect an Android device with USB Debugging enabled, and deploy:

# 1. Confirm device connection
adb devices

# 2. Push and execute model on mobile CPU
litert run efficientnet_b1 --android --cpu

The CLI automatically pushes the model to target device, executes the inference loop, and pipes outputs back!

🎮 Target B: Mobile GPU

Offload heavy workloads dynamically to the GPU using OpenCL or WebGPU:

# Benchmark model performance on mobile GPU
litert benchmark efficientnet_b1 --android --gpu

# Run inference with GPU acceleration and CPU fallback
litert run efficientnet_b1 --android --accelerator gpu,cpu

⚙️ Target C: JIT Android NPU

Offload execution parameters directly to the NPU on modern chipsets:

# Run with on-device JIT NPU acceleration
litert run efficientnet_b1 --android --accelerator npu,cpu
  • Warning: Dynamic graph builds inside on-device runtime suffer from significant initialization JIT warmup delays.

🚀 Target D: AOT compiled NPU

Pre-compile offline to bypass runtime JIT overheads and achieve maximum acceleration:

# 1. Offline compile for Qualcomm SM8750 NPU (Linux host)
litert compile efficientnet/efficientnet_b1.tflite --target sm8750

# 2. Execute compiled AOT binary with zero JIT warmup latency
litert run efficientnet_b1_Qualcomm_SM8750.tflite --android --npu

☁️ Cloud profiling (Google AI Edge Portal)

Push tests to remote hardware profiles in Google Cloud's device farms.

# Log in to your Google Cloud project
gcloud auth login

# Push benchmark metrics to Pixel 7 CPU
litert benchmark efficientnet_b1 --gcp --device "pixel 7" --cpu --gcp-project "your-project-id"

# Push benchmark metrics to Snapdragon GPU targets
litert benchmark efficientnet_b1 --gcp --devices "pixel 7, sm-s931u1" --gpu --gcp-project "your-project-id"

5. 🧠 Advanced topics: LLMs & speech recognition

Let's explore advanced models by converting and running Large Language Models (LLMs) and Automatic Speech Recognition (ASR) models.

💬 Generative AI (LLMs)

Edge LLMs are memory-bounded. We resolve this by applying Weight-Only INT4/INT8 quantization, compressing weights while keeping execution channels in Float32:

# 1. Automated download & conversion from Hugging Face
litert convert Qwen/Qwen1.5-0.5B-Chat --output models/qwen

# 2. Quantize model weights to INT4
litert convert Qwen/Qwen1.5-0.5B-Chat \
  --quantize-recipe weight_only_wi4_afp32 \
  --output models/qwen_w4

# 3. Generation once
litert lm run models/qwen/model.litertlm --prompt "Introduce San Francisco."

# 4. Interactive chat
litert lm run models/qwen/model.litertlm

🎙️ Speech processing (ASR)

ASR models like Whisper or Parakeet have both an encoder and a decoder module. We can benchmark for specific modules using Signature Keys:

# 1. Download prepackaged Whisper-Tiny
litert download litert-community/whisper-tiny --file "whisper_tiny_30s_f32.tflite" --output "models/whisper_tiny"

# 2. Profile audio encoding
litert benchmark models/whisper_tiny/whisper_tiny_30s_f32.tflite --android --gpu --signature-key "encode"

# 3. Profile text token decoding loop
litert benchmark models/whisper_tiny/whisper_tiny_30s_f32.tflite --android --cpu --signature-key "decode"

6. 🤖 Use in agentic coding

LiteRT CLI is agent-friendly, and you can integrate it directly with coding agents.

Here are some examples. Add the LiteRT CLI skill SKILL.md into your coding agent (like Google Antigravity) and try prompts like below.

7. 🚀 Congratulations & next steps

🎉 Outstanding! You have completed the LiteRT CLI 101 codelab!

You now possess the skills to master edge ML development:

  • Convert PyTorch models to LiteRT models.
  • Quantize model weights to INT4/INT8.
  • Run models on mobile CPU/GPU/NPU.
  • Compile for NPU.
  • Benchmark on desktop, mobile and cloud.
  • Convert and run LLM models.
  • Use in agentic coding.

🔗 Resources

  • LiteRT CLI: https://github.com/google-ai-edge/LiteRT-CLI
  • LiteRT: https://ai.google.dev/edge/litert
  • Examples: https://github.com/google-ai-edge/LiteRT-CLI/tree/main/examples