RAF · the self-optimizing layer for production AI

Control your LLM spend before it controls you.

RAF sits in front of every model call and actively makes it cheaper by replaying your real traffic against smaller models, grading the answers, and routing each request to the cheapest one that still clears your quality bar. The savings compound every week. Your quality doesn't move.

Frontier-model quality at a fraction of the cost.
Route one workflow in < 1 hour Quality-gated downgrades No app rewrite
raf · optimization enginelive
Quality held99.2%
Cost down64%
Auto-routed71%
14:32:07support-summarize · opushaikuROUTED ↓≳0.98 · −$0.041
14:32:07doc-extract · sonnethaikuROUTED ↓≳0.99 · −$0.007
14:32:06legal-review · opusopusHELDfloor not met
14:32:06faq-answer · near-duplicateCACHE ~0.97 · −$0.012
14:32:05copy-gen · gpt-4ohaikuROUTED ↓≳0.96 · −$0.018
14:32:05ticket-triage · exact matchCACHE HITn/a · −$0.004
14:32:04summarize-call · sonnethaikuROUTED ↓≳0.98 · −$0.009
14:32:04research-synth · opussonnetROUTED ↓≳0.97 · −$0.024
every downgrade graded before it ships · quality floor 0.95adaptive · ML
opushaiku on 71% of routes quality held 99.2% every downgrade graded first near-duplicate cache live cost-per-task down 64% prompt-cost regressions caught in minutes value compounds as prices fall opushaiku on 71% of routes quality held 99.2% every downgrade graded first near-duplicate cache live cost-per-task down 64% prompt-cost regressions caught in minutes value compounds as prices fall
What RAF is

The economic safety layer for production AI.

RAF is a self-hosted LLM FinOps gateway. It sits in front of your existing model calls with a single base-URL change and no app rewrite, then routes every request, across OpenAI, Anthropic, Google, Bedrock, and open-source models, through one policy layer: budgets enforced before the call, identical prompts served from cache, failures retried, and sensitive data redacted before it ever reaches a provider. Every call, even the ones it blocks, lands in one normalized ledger.

That ledger is the foundation. Most tools stop at "what did we spend?" RAF goes to "what should this have cost?": what one completed task runs, which customers or features are negative-margin, how much of the provider bill you're even seeing, and which routes can move to a cheaper model without anyone noticing. Then it acts on the answer.

01 · Drop-in gateway

Route, don't rewrite

A drop-in proxy with an OpenAI-compatible API, plus an optional Python decorator, pointing at any model you use: OpenAI, Anthropic, Google, Bedrock, or your own open-source deployments. Your apps and providers stay exactly as they are; traffic just flows through RAF.

02 · Control & safety

On by default

Budgets, exact cache, retries, PII and secret redaction, egress allow-lists, and a runaway-spend circuit breaker, centralized at the gateway instead of re-built in every service.

03 · Continuous optimization

Cheaper every week, safely

Quality-gated model downgrades, semantic cache, and prompt slimming run on your traffic history, with a live view and realtime anomaly alerts the moment something regresses.

Quality-gated optimization

Opus answers. Haiku economics.

The biggest line on your bill is the model you reach for by default. RAF finds the cheapest model that produces an equally good answer for each route, and it doesn't guess. It proves it on your own traffic before a single user is affected.

1

Replay your real traffic

RAF re-runs recent requests against smaller candidate models in the background, in shadow mode, never touching production responses.

2

Grade every answer

Built-in graders score each candidate (exact-match, JSON-schema validity, numeric tolerance, policy checks) into a hard quality delta.

3

Promote only what clears the bar

Routes that hold quality get downgraded automatically. The rest hold. RAF tracks before/after, so a regression rolls itself back.

No graders yet? RAF won't promote a downgrade on quality you can't measure; it asks for a signal first. That's the difference between optimization and hand-waving.

Cost × quality, per routeRAF operating point
Answer quality →
Cost per call →
your quality floor · 0.95
haiku
sonnet
gpt-4o
opus
RAF ↓
RAF lands you top-left: frontier-grade answers at a fraction of the cost, by picking the smallest model that still clears your floor, route by route.
The efficiency engine

It doesn't just report savings. It keeps finding new ones.

Six mechanisms run continuously on the one thing only RAF has: the full history of your traffic. Each makes the underlying calls structurally cheaper, and each gets better the longer RAF runs.

adaptive routingcompounds

Machine-learned routing

RAF learns which routes tolerate a smaller model and sends each request to the cheapest one that clears the quality bar, re-checking as your prompts and traffic shift.

71% of routes auto-optimized
cachecompounds

Semantic cache

Beyond exact matches, RAF recognizes near-identical prompts and serves them instantly, quality-gated and tenant-scoped, so reworded duplicates stop costing you twice.

38% served without a provider call
promptscompounds

Prompt optimization

RAF spots bloat (“this route is 38% boilerplate”) and surfaces shorter prompts that hold quality. Recommended, never silently mutated.

avg 31% fewer input tokens
eval

Span replay

Every recorded call becomes a test case. Replay a route against a new model, prompt, or temperature and get a cost / quality / latency diff, and most runs never hit a provider.

replay corpus from live traffic
pre-flight

Shadow mode

Run RAF alongside production for 72 hours, enforcing nothing. It returns an assessment: what it would save, cache, and block, before you change a thing.

zero production risk
unit economics

Cost per outcome

Roll calls up to what one completed task, tenant, or customer actually costs, and catch the account on a $99 plan quietly burning $500 of inference.

margin visible per customer
Value compounds.

When token prices fall, most savings tools lose their pitch. RAF does the opposite: it keeps mining your history for shorter prompts, new cache hits, and cheaper-model candidates. The cheaper tokens get, the more efficiency RAF finds.

Realtime intelligence

Reporting that doesn't wait for Monday.

A prompt change ships at 2pm and triples your cost-per-call. Most teams find out on next month's invoice. RAF watches the ledger as calls land and flags and immediately alerts on the regression minutes after deploy, with the route, the multiplier, and the likely cause already attached.

RAF surfaces spend anomalies, runaway sessions, quality drift, provider error spikes, and budget burn the moment they happen, where your team already works.

And it speaks to finance, too. One click flips the same live data into the CFO summary: gross vs. net, what RAF saved and protected, and the bottom-line number leadership actually forwards.

Want it pushed? RAF still drops that summary in Slack and inbox on a schedule. But it's a render of data that was already live the whole time, not the only time you get to see the truth.
live
Cost / call jumped3.1×
detected4 min after deploy v412
cost / call$0.011 → $0.034
likely causesystem prompt +1,840 tokens
blast radius~$2,600 / wk if unchanged
→ open the diff · pin to budget alert · roll back v412
Five questions, answered the same day

Every company buys tokens. Almost no one can answer these questions.

RAF turns each one from a quarterly fire-drill into a number you can watch move in realtime.

Q1

Where is our LLM spend actually going?

ledger by team, feature, tenant
Q2

What spend can we safely eliminate?

cache + downgrade candidates
Q3

How do we stop surprise bills?

budgets enforced pre-call
Q4

Are we leaking sensitive data into calls?

PII redaction + egress log
Q5

Is answer quality holding as we cut cost?

graders + replay eval
The problem

Buying tokens is one line of code. Governing them is nobody's job.

LLM spend just crossed from an experiment into a material budget line, the exact moment cost control starts getting bought. But teams begin with direct SDK calls scattered across product squads, and that works right up until it doesn't. Then every company hits the same wall at once: the invoice is one number no one can break down, no budget is enforced before a call fires, and nobody can say what a single completed task costs, or that one customer on a $99 plan is quietly burning $500 of inference a month.

01

Spend opacity

No attribution by product, tenant, feature, model, provider, or prompt. The invoice is one number.

unattributed
02

No hard guardrails

Budgets are tracked after the fact, not enforced before the call ever fires.

no cap
03

Repeated calls waste money

Identical prompts get re-sent thousands of times with no shared cache between teams.

duplicate spend
04

Runaway sessions

One bugged agent or retry loop can burn $20K over a weekend. Daily caps don't catch it.

$20K weekend
05

Silent data exposure

Prompts carry PII, secrets, and customer data out to providers with no consistent control.

unaudited egress
06

Provider incidents leak

Retries, rate limits, and failover are hand-rolled differently by every team.

inconsistent
07

Silent cost regressions

A prompt change triples cost-per-call. Nobody notices until the invoice arrives.

3× silent
08

No unit economics

You know the provider bill, but not what one completed task costs, or that a customer on a $99 plan is burning $500 of inference a month.

negative margin
How it works

One base-URL change. Keep your code. Reach every model.

RAF speaks the OpenAI-compatible API, so one base-URL change points your client at it. From there it routes to any frontier or open-source model, and every call inherits budgets, cache, retries, redaction, and a durable cost ledger. No app rewrite.

1

Connect traffic

Swap OPENAI_BASE_URL to your RAF endpoint. Streaming and non-streaming both pass through untouched.

2

See spend

Every call lands in a durable ledger: tokens, cost, latency, cache state, provider, retry, all by tenant and feature.

3

Control spend

Turn on budgets, exact cache, retry/admission, PII redaction and egress allow-lists. Policies apply at the gateway.

4

Optimize continuously

RAF surfaces savings, safety events, and what to optimize next in realtime, with a scheduled summary pushed to inbox and Slack.

bash · your-app/.env
# Before: calling the provider directly
OPENAI_BASE_URL=https://api.openai.com/v1
 
# After: route through RAF. That's the change.
OPENAI_BASE_URL=https://raf.yourco.internal/v1
OPENAI_API_KEY=raf_sk_live_••••••••
 
$ raf doctor
provider keys openai, anthropic, bedrock
cache backend redis · shared
ledger postgres · durable
proxy compatible /v1/chat/completions
ready in 00:11:42 · send a test request
The control plane

Nine controls. One gateway. On by default.

Every proxied call passes through the same policy layer, so control is centralized instead of re-implemented in every service.

pre-call

Budget guardrails

Caps by tenant, route, feature, model, and environment, enforced with atomic reserve-and-settle before the provider call.

dailymonthly50/80/90/100%
free

Exact cache

Identical prompts return instantly, with avoided provider cost attributed to the savings report. Tenant-scoped by default.

redisshared-replica
resilient

Retry & admission

Jittered backoff on retryable errors and per-provider token-bucket admission, so no thundering herds and no surprise 429s.

jittertoken-bucket
fail-closed

PII & secret redaction

Default pack catches email, SSN, cards, IPs, AWS keys, JWTs, tokens, PEM material; redact or block at egress.

9 categoriesstrict-block
audited

Egress allow-list

Prompts can only reach approved destinations. Every block is recorded with reason for security review.

allow-listpolicy-log
queryable

Durable cost ledger

Every call, even blocked ones, recorded with tokens, cost, latency, cache, retry, and error. Export CSV, JSONL, OTLP.

per-tenantOpenTelemetry
breaker

Runaway circuit breaker

Session-velocity limits stop retry storms and near-duplicate loops before they become a $20K weekend invoice.

session-caploop-guard
unit-economics

Cost-per-task attribution

Roll calls up into what one completed task, tenant, or outcome actually costs, not just what you spent.

per-taskper-tenant
recommends

Optimization advisor

Flags prompt bloat, duplicate calls, cheaper-model candidates, and provider reliability, ranked by est. savings.

cache-opsdowngrade
The dashboard

Go look any time. The whole picture is one screen.

One normalized ledger, every view. Filter by feature, route, tenant, model, provider, or environment and every panel updates live across spend, savings, quality, latency, and safety.

RAF
ExecutiveEngineeringCostSafety
live · 9 routes
Spend this week
$41.9K
▲ 6.1% vol
Net after RAF
$29.4K
▼ 29.8%
Cache hit rate
38.4%
▲ 4.2pp
p99 latency
412ms
▼ 18%
Gross vs. net spend · 8 weeksgrossnetsaved
W31W32W33W34W35W36W37W38
Spend by model
gpt-4o · 46% sonnet · 28% haiku · 26%
Top features by spend
support-summarize$14.2K
doc-classify$9.8K
copy-gen$7.1K
enrich-webhook$4.5K
Before & after

Same apps. Same providers. One layer of control in between.

Nothing about your stack changes except what now flows through the middle, and what you finally get to see.

Before: direct callsblind
Your apps & agents
OpenAI ?
Anthropic ?
Bedrock ?
Azure ?
No caps · no cache · no attribution · PII unaudited · invoice arrives a month later.
route
through
After: through RAF−29.8%
Your apps & agents unchanged
RAF gateway
routebudgetcachegraderedactledger
OpenAI
Anthropic
Bedrock
Azure
Every call capped, cached, graded, attributed, redacted, and optimized in realtime.
Time to value

Installed before lunch. Proving its keep by the next morning.

RAF is built for one outcome: a buyer learns where their LLM money goes, and what they can safely save, within a day, not a quarter.

<1hr
to route your first production workflow through RAF, base-URL only.
24h
to your first quality-gated savings: cheaper models, proven on your own traffic.
100%
of calls carry cost, token, latency, cache, retry & status metadata.
Start the pilot

Route one workflow. Watch it get cheaper by tomorrow.

Run RAF in shadow mode for 72 hours and we'll send back an assessment of what it would save, cache, and safely downgrade on your real traffic. Or book a 30-minute design-partner session to scope it together.

Self-hosted · runs in your VPC · no raw prompts stored.