How Power Lever Works
We combine a Claude-based routing agent with dynamic Modal GPU workers and an on-demand speculative decoding implementation.
Request Pipeline
Every prompt flows through 8 orchestration steps — from classification to right-sized GPU inference.
User Prompt
0msUser submits a prompt with power level (0-100)
Gateway
~2msFastAPI gateway receives request, validates schema
Claude Agent Classifies
~150msTool-use agent analyzes prompt complexity and domain
Tier Selection
Maps complexity to optimal GPU tier (Eco / Balanced / Performance / Ultra)
GPU Dispatch
~50msmodal.Function.lookup() routes to right-sized GPU worker
vLLM Inference
~200ms TTFTSpeculative decoding generates tokens with draft + target models
SSE Stream
Tokens stream back in real-time via Server-Sent Events
Sustainability Metrics
Energy, water, CO2, and cost savings calculated and displayed
Traditional API vs Power Lever
Most APIs always use maximum compute. Power Lever matches GPU resources to actual task difficulty.
Traditional API
Frontier model, black box
H100 for every query
No prompt classification
Max compute every time
Power Lever
Intelligent GPU routing
Matched to task complexity
Real-time classification
4 tiers, spec decoding
Four Inference Tiers
Each tier pairs a GPU with optimally-sized models and speculative decoding parameters.
Nemotron-Mini-4B
TinyLlama-1.1B
Simple Q&A, factual lookups, basic arithmetic
Nemotron-Nano-8B
Nemotron-Mini-4B
Code generation, word problems, symptom assessment
Nemotron-70B
Nemotron-Nano-8B
Complex debugging, calculus proofs, cardiac workups
Nemotron-70B-FP8
None
Polytrauma triage, constrained optimization, distributed systems
Speculative Decoding
A small draft model proposes K tokens; the target model verifies them in a single forward pass — up to K× speedup with no quality loss.
Speculative decoding uses a small, fast draft model to generate K candidate tokens. The larger target model then verifies all K tokens in a single forward pass. Accepted tokens are kept; rejected ones are regenerated by the target. This yields up to K× throughput with mathematically identical output quality to running the target model alone.
Sustainability Impact
Every right-sized query saves energy, water, cost, and carbon. Small savings per query compound at scale.
At Scale: 10,000 Queries / Day
Projected annual impact if Power Lever handles 10,000 queries per day.