1. 🏁 Introduction & CLI overview
Welcome to the LiteRT CLI 101 hands-on codelab! This guide is designed to take you step-by-step from zero environment setup to deploying optimized edge intelligence.
🌟 Background
Edge AI requires bringing complex neural models directly onto mobile phones, wearables, and embedded hardware.
- LiteRT (formerly TensorFlow Lite / TFLite) Google's on-device framework for high-performance ML & GenAI deployment on edge platforms, via efficient conversion, runtime, and optimization.
- LiteRT CLI integrates Google AI Edge stacks into a standalone shell command (
litert) to streamline LiteRT-related development workflows, including converting, quantizing, compiling, running, benchmarking, and visualizing LiteRT (TFLite) models on various hardware (CPU / GPU / NPU) across platforms (desktop, mobile, or cloud).
2. 🔄 Basic workflows: conversion, quantize & run
In this section, let's execute a complete Edge AI modeling lifecycle with LiteRT CLI: PyTorch model wrapper ➔ tracing conversion ➔ model quantization ➔ desktop inference ➔ performance benchmarks.
📝 Stage 1: Prepare a PyTorch model wrapper script
Create resnet18.py in your current directory. It exposes dynamic tracing hooks so the graph builder can capture shapes:
import torch
import torchvision
def get_model(batch_size: int = 1) -> torch.nn.Module:
model = torchvision.models.resnet18(
weights=torchvision.models.ResNet18_Weights.IMAGENET1K_V1
)
model.eval()
return model
def get_args(batch_size: int = 1) -> tuple[torch.Tensor, ...]:
return (torch.randn(batch_size, 3, 224, 224),)
🔄 Stage 2: Convert a PyTorch model to LiteRT using LiteRT Torch
Invoke the LiteRT Torch converter to build a standard Float32 .tflite target model:
# Convert PyTorch source to LiteRT
litert convert resnet18.py --output resnet18
# Verify target was exported
ls -lh resnet18/resnet18.tflite
📉 Stage 3: Quantize weights to INT8
Apply dynamic and weight-only recipe cards to shrink sizes by ~4x:
# 1. Dynamic Range Quantization (Dynamic activations + static INT8 weights)
litert quantize resnet18/resnet18.tflite \
--recipe dynamic_wi8_afp32 \
--output resnet18/resnet18_int8_dynamic.tflite
# 2. Weight-Only Quantization (Float32 activations + static INT8 weights)
litert quantize resnet18/resnet18.tflite \
--recipe weight_only_wi8_afp32 \
--output resnet18/resnet18_int8_weight_only.tflite
🚀 Stage 4: Run Inference locally
Execute test operations with dummy input parameters to verify performance blocks:
# Run original Float32 model inference
litert run resnet18/resnet18.tflite --desktop --cpu
# Run optimized Dynamic INT8 model inference
litert run resnet18/resnet18_int8_dynamic.tflite --cpu --iterations 1
📊 Stage 5: Benchmark model performance
Measure high-precision metrics including average latency, initialization costs, CPU execution throughput, and dynamic active memory footprints:
# Benchmark original Float32 model
litert benchmark resnet18/resnet18.tflite --desktop --cpu
# Benchmark optimized Dynamic range INT8 model
litert benchmark resnet18/resnet18_int8_dynamic.tflite --desktop --cpu
3. 🔌 Local environment setup & verification
Let's build an isolated, clean sandbox on your workstation (macOS or Linux)
🔌 Option A: Ultra-fast setup (uv)
Using uv isolates libraries in seconds, clearing environment cache conflicts.
# Create active workspace sandbox
uv venv --clear --python=3.13 --seed
source .venv/bin/activate
# Install litert-cli from pypi
uv pip install litert-cli-nightly
🐍 Option B: Standard setup (pip)
If uv is not present, use standard Python virtual environment configurations:
# Create active workspace sandbox
python3 -m venv .venv
source .venv/bin/activate
# Install litert-cli from pypi
pip install --upgrade pip setuptools wheel
pip install litert-cli-nightly
🛠️ Option C: Editable local setup
If you are developing or testing directly inside the repository source directory clone:
# Create active workspace sandbox
uv venv --clear --python=3.13 --seed
source .venv/bin/activate
# Install from local directory root
uv pip install -e .
🔍 Setup verification
Verify that your path correctly routes the litert command:
litert --help
💡 The central Model Reference catalog (model-ref)
To keep scripts and development robust, the LiteRT CLI implements a centralized Model Catalog:
- Format: Assign a unique name (alias) to a downloaded or imported model, such as
mobilenet. - Variants: Use target colons to refer to optimized variations, like
mobilenet:int8ormobilenet:gpu. - Simplicity: All CLI commands accept this pathless
alias directly, automatically resolving physical file storage on the fly!
4. 📲 Device deployment & profiling
Let us take optimization further by running high-precision local benchmarks, deploying to real USB hardware targets, implementing hardware delegates, compiling Ahead-of-Time, and cloud benchmarking.
📊 Local profiling metrics
Download EfficientNet-B1 and profile performance statistics on your desktop host:
# Download and register model alias in central catalog
litert download litert-community/efficientnet_b1 --file "*.tflite" --output efficientnet
litert import efficientnet/efficientnet_b1.tflite --model-ref efficientnet_b1
# Run high-precision benchmark on desktop CPU
litert benchmark efficientnet_b1 --desktop --cpu
Observe the key performance headers in the benchmark log:
Model initialization: Time to boot structural networks.Warmup (avg): Compilation overhead timings.Inference (avg): The raw mathematical processing time.Overall footprint: Peak RAM consumed during execution.
📲 Target A: Mobile CPU (USB connect)
Connect an Android device with USB Debugging enabled, and deploy:
# 1. Confirm device connection
adb devices
# 2. Push and execute model on mobile CPU
litert run efficientnet_b1 --android --cpu
The CLI automatically pushes the model to target device, executes the inference loop, and pipes outputs back!
🎮 Target B: Mobile GPU
Offload heavy workloads dynamically to the GPU using OpenCL or WebGPU:
# Benchmark model performance on mobile GPU
litert benchmark efficientnet_b1 --android --gpu
# Run inference with GPU acceleration and CPU fallback
litert run efficientnet_b1 --android --accelerator gpu,cpu
⚙️ Target C: JIT Android NPU
Offload execution parameters directly to the NPU on modern chipsets:
# Run with on-device JIT NPU acceleration
litert run efficientnet_b1 --android --accelerator npu,cpu
- Warning: Dynamic graph builds inside on-device runtime suffer from significant initialization JIT warmup delays.
🚀 Target D: AOT compiled NPU
Pre-compile offline to bypass runtime JIT overheads and achieve maximum acceleration:
# 1. Offline compile for Qualcomm SM8750 NPU (Linux host)
litert compile efficientnet/efficientnet_b1.tflite --target sm8750
# 2. Execute compiled AOT binary with zero JIT warmup latency
litert run efficientnet_b1_Qualcomm_SM8750.tflite --android --npu
☁️ Cloud profiling (Google AI Edge Portal)
Push tests to remote hardware profiles in Google Cloud's device farms.
# Log in to your Google Cloud project
gcloud auth login
# Push benchmark metrics to Pixel 7 CPU
litert benchmark efficientnet_b1 --gcp --device "pixel 7" --cpu --gcp-project "your-project-id"
# Push benchmark metrics to Snapdragon GPU targets
litert benchmark efficientnet_b1 --gcp --devices "pixel 7, sm-s931u1" --gpu --gcp-project "your-project-id"
5. 🧠 Advanced topics: LLMs & speech recognition
Let's explore advanced models by converting and running Large Language Models (LLMs) and Automatic Speech Recognition (ASR) models.
💬 Generative AI (LLMs)
Edge LLMs are memory-bounded. We resolve this by applying Weight-Only INT4/INT8 quantization, compressing weights while keeping execution channels in Float32:
# 1. Automated download & conversion from Hugging Face
litert convert Qwen/Qwen1.5-0.5B-Chat --output models/qwen
# 2. Quantize model weights to INT4
litert convert Qwen/Qwen1.5-0.5B-Chat \
--quantize-recipe weight_only_wi4_afp32 \
--output models/qwen_w4
# 3. Generation once
litert lm run models/qwen/model.litertlm --prompt "Introduce San Francisco."
# 4. Interactive chat
litert lm run models/qwen/model.litertlm
🎙️ Speech processing (ASR)
ASR models like Whisper or Parakeet have both an encoder and a decoder module. We can benchmark for specific modules using Signature Keys:
# 1. Download prepackaged Whisper-Tiny
litert download litert-community/whisper-tiny --file "whisper_tiny_30s_f32.tflite" --output "models/whisper_tiny"
# 2. Profile audio encoding
litert benchmark models/whisper_tiny/whisper_tiny_30s_f32.tflite --android --gpu --signature-key "encode"
# 3. Profile text token decoding loop
litert benchmark models/whisper_tiny/whisper_tiny_30s_f32.tflite --android --cpu --signature-key "decode"
6. 🤖 Use in agentic coding
LiteRT CLI is agent-friendly, and you can integrate it directly with coding agents.
Here are some examples. Add the LiteRT CLI skill SKILL.md into your coding agent (like Google Antigravity) and try prompts like below.
7. 🚀 Congratulations & next steps
🎉 Outstanding! You have completed the LiteRT CLI 101 codelab!
You now possess the skills to master edge ML development:
- Convert PyTorch models to LiteRT models.
- Quantize model weights to INT4/INT8.
- Run models on mobile CPU/GPU/NPU.
- Compile for NPU.
- Benchmark on desktop, mobile and cloud.
- Convert and run LLM models.
- Use in agentic coding.
🔗 Resources
- LiteRT CLI: https://github.com/google-ai-edge/LiteRT-CLI
- LiteRT: https://ai.google.dev/edge/litert
- Examples: https://github.com/google-ai-edge/LiteRT-CLI/tree/main/examples