How Power Lever Works

We combine a Claude-based routing agent with dynamic Modal GPU workers and an on-demand speculative decoding implementation.

Open Technical Report (PDF)

Request Pipeline

Every prompt flows through 8 orchestration steps — from classification to right-sized GPU inference.

User Prompt

0ms

User submits a prompt with power level (0-100)

Gateway

~2ms

FastAPI gateway receives request, validates schema

Claude Agent Classifies

~150ms

Tool-use agent analyzes prompt complexity and domain

Tier Selection

Maps complexity to optimal GPU tier (Eco / Balanced / Performance / Ultra)

GPU Dispatch

~50ms

modal.Function.lookup() routes to right-sized GPU worker

vLLM Inference

~200ms TTFT

Speculative decoding generates tokens with draft + target models

SSE Stream

Tokens stream back in real-time via Server-Sent Events

Sustainability Metrics

Energy, water, CO2, and cost savings calculated and displayed

Traditional API vs Power Lever

Most APIs always use maximum compute. Power Lever matches GPU resources to actual task difficulty.

Traditional API

Frontier model, black box

Prompt

Always H100

Response

GPU Power

350W always

H100 for every query

Analysis

None

No prompt classification

Approach

One-size-fits-all

Max compute every time

Energy Usage100%

Power Lever

Intelligent GPU routing

Prompt

Classify

Right-size GPU

Spec Decode

Response

GPU Power

55-280W

Matched to task complexity

Analysis

Claude Agent

Real-time classification

Approach

Right-sized

4 tiers, spec decoding

Avg. Energy Usage~20%

Four Inference Tiers

Each tier pairs a GPU with optimally-sized models and speculative decoding parameters.

Eco0-25

72W TDP · $0.6/hr

Target

Nemotron-Mini-4B

Draft

TinyLlama-1.1B

Spec Decoding:K=8

Simple Q&A, factual lookups, basic arithmetic

Balanced26-55

A10G

150W TDP · $1.1/hr

Target

Nemotron-Nano-8B

Draft

Nemotron-Mini-4B

Spec Decoding:K=6

Code generation, word problems, symptom assessment

Performance56-80

A100

250W TDP · $2.2/hr

Target

Nemotron-70B

Draft

Nemotron-Nano-8B

Spec Decoding:K=4

Complex debugging, calculus proofs, cardiac workups

Ultra81-100

H100

350W TDP · $3.5/hr

Target

Nemotron-70B-FP8

Draft

None

Spec Decoding:OFF

Polytrauma triage, constrained optimization, distributed systems

Speculative Decoding

A small draft model proposes K tokens; the target model verifies them in a single forward pass — up to K× speedup with no quality loss.

Draft Model(fast, generates K candidates)

The

quick

brown

cat

jumps

Waiting for candidates...

Target Model(large, verifies in one forward pass)

Output tokens will appear here...

Eco

~85%

K=8

Balanced

~78%

K=6

Performance

~65%

K=4

Ultra

N/A

No spec dec

Speculative decoding uses a small, fast draft model to generate K candidate tokens. The larger target model then verifies all K tokens in a single forward pass. Accepted tokens are kept; rejected ones are regenerated by the target. This yields up to K× throughput with mathematically identical output quality to running the target model alone.

Sustainability Impact

Every right-sized query saves energy, water, cost, and carbon. Small savings per query compound at scale.

Avg. Energy Saved

0.00000

kWh / query

Avg. Water Saved

0.00000

L / query

Avg. Cost Saved

$0.0000

$ / query

Avg. CO2 Avoided

0.000000

kg / query

At Scale: 10,000 Queries / Day

Projected annual impact if Power Lever handles 10,000 queries per day.

kWh / year

~0 US homes powered for a year

L / year

~9 bathtubs of water

$ / year

$6,570 saved annually

kg / year

~0 transatlantic flights offset