How Power Lever Works

We combine a Claude-based routing agent with dynamic Modal GPU workers and an on-demand speculative decoding implementation.

Request Pipeline

Every prompt flows through 8 orchestration steps — from classification to right-sized GPU inference.

01

User Prompt

0ms

User submits a prompt with power level (0-100)

02

Gateway

~2ms

FastAPI gateway receives request, validates schema

03

Claude Agent Classifies

~150ms

Tool-use agent analyzes prompt complexity and domain

04

Tier Selection

Maps complexity to optimal GPU tier (Eco / Balanced / Performance / Ultra)

05

GPU Dispatch

~50ms

modal.Function.lookup() routes to right-sized GPU worker

06

vLLM Inference

~200ms TTFT

Speculative decoding generates tokens with draft + target models

07

SSE Stream

Tokens stream back in real-time via Server-Sent Events

08

Sustainability Metrics

Energy, water, CO2, and cost savings calculated and displayed

Traditional API vs Power Lever

Most APIs always use maximum compute. Power Lever matches GPU resources to actual task difficulty.

Traditional API

Frontier model, black box

Prompt
Always H100
Response
GPU Power
350W always

H100 for every query

Analysis
None

No prompt classification

Approach
One-size-fits-all

Max compute every time

Energy Usage100%

Power Lever

Intelligent GPU routing

Prompt
Classify
Right-size GPU
Spec Decode
Response
GPU Power
55-280W

Matched to task complexity

Analysis
Claude Agent

Real-time classification

Approach
Right-sized

4 tiers, spec decoding

Avg. Energy Usage~20%

Four Inference Tiers

Each tier pairs a GPU with optimally-sized models and speculative decoding parameters.

Eco0-25
L4
72W TDP · $0.6/hr
Target

Nemotron-Mini-4B

Draft

TinyLlama-1.1B

Spec Decoding:K=8

Simple Q&A, factual lookups, basic arithmetic

Balanced26-55
A10G
150W TDP · $1.1/hr
Target

Nemotron-Nano-8B

Draft

Nemotron-Mini-4B

Spec Decoding:K=6

Code generation, word problems, symptom assessment

Performance56-80
A100
250W TDP · $2.2/hr
Target

Nemotron-70B

Draft

Nemotron-Nano-8B

Spec Decoding:K=4

Complex debugging, calculus proofs, cardiac workups

Ultra81-100
H100
350W TDP · $3.5/hr
Target

Nemotron-70B-FP8

Draft

None

Spec Decoding:OFF

Polytrauma triage, constrained optimization, distributed systems

Speculative Decoding

A small draft model proposes K tokens; the target model verifies them in a single forward pass — up to K× speedup with no quality loss.

Draft Model(fast, generates K candidates)
The
quick
brown
cat
jumps
Waiting for candidates...
Target Model(large, verifies in one forward pass)
Output tokens will appear here...
Eco
~85%
K=8
Balanced
~78%
K=6
Performance
~65%
K=4
Ultra
N/A
No spec dec

Speculative decoding uses a small, fast draft model to generate K candidate tokens. The larger target model then verifies all K tokens in a single forward pass. Accepted tokens are kept; rejected ones are regenerated by the target. This yields up to K× throughput with mathematically identical output quality to running the target model alone.

Sustainability Impact

Every right-sized query saves energy, water, cost, and carbon. Small savings per query compound at scale.

Avg. Energy Saved
0.00000
kWh / query
Avg. Water Saved
0.00000
L / query
Avg. Cost Saved
$0.0000
$ / query
Avg. CO2 Avoided
0.000000
kg / query

At Scale: 10,000 Queries / Day

Projected annual impact if Power Lever handles 10,000 queries per day.

0
kWh / year
~0 US homes powered for a year
0
L / year
~9 bathtubs of water
0
$ / year
$6,570 saved annually
0
kg / year
~0 transatlantic flights offset