RAF sits in front of every model call and actively makes it cheaper by replaying your real traffic against smaller models, grading the answers, and routing each request to the cheapest one that still clears your quality bar. The savings compound every week. Your quality doesn't move.
RAF is a self-hosted LLM FinOps gateway. It sits in front of your existing model calls with a single base-URL change and no app rewrite, then routes every request, across OpenAI, Anthropic, Google, Bedrock, and open-source models, through one policy layer: budgets enforced before the call, identical prompts served from cache, failures retried, and sensitive data redacted before it ever reaches a provider. Every call, even the ones it blocks, lands in one normalized ledger.
That ledger is the foundation. Most tools stop at "what did we spend?" RAF goes to "what should this have cost?": what one completed task runs, which customers or features are negative-margin, how much of the provider bill you're even seeing, and which routes can move to a cheaper model without anyone noticing. Then it acts on the answer.
A drop-in proxy with an OpenAI-compatible API, plus an optional Python decorator, pointing at any model you use: OpenAI, Anthropic, Google, Bedrock, or your own open-source deployments. Your apps and providers stay exactly as they are; traffic just flows through RAF.
Budgets, exact cache, retries, PII and secret redaction, egress allow-lists, and a runaway-spend circuit breaker, centralized at the gateway instead of re-built in every service.
Quality-gated model downgrades, semantic cache, and prompt slimming run on your traffic history, with a live view and realtime anomaly alerts the moment something regresses.
The biggest line on your bill is the model you reach for by default. RAF finds the cheapest model that produces an equally good answer for each route, and it doesn't guess. It proves it on your own traffic before a single user is affected.
RAF re-runs recent requests against smaller candidate models in the background, in shadow mode, never touching production responses.
Built-in graders score each candidate (exact-match, JSON-schema validity, numeric tolerance, policy checks) into a hard quality delta.
Routes that hold quality get downgraded automatically. The rest hold. RAF tracks before/after, so a regression rolls itself back.
No graders yet? RAF won't promote a downgrade on quality you can't measure; it asks for a signal first. That's the difference between optimization and hand-waving.
Six mechanisms run continuously on the one thing only RAF has: the full history of your traffic. Each makes the underlying calls structurally cheaper, and each gets better the longer RAF runs.
RAF learns which routes tolerate a smaller model and sends each request to the cheapest one that clears the quality bar, re-checking as your prompts and traffic shift.
Beyond exact matches, RAF recognizes near-identical prompts and serves them instantly, quality-gated and tenant-scoped, so reworded duplicates stop costing you twice.
RAF spots bloat (“this route is 38% boilerplate”) and surfaces shorter prompts that hold quality. Recommended, never silently mutated.
Every recorded call becomes a test case. Replay a route against a new model, prompt, or temperature and get a cost / quality / latency diff, and most runs never hit a provider.
Run RAF alongside production for 72 hours, enforcing nothing. It returns an assessment: what it would save, cache, and block, before you change a thing.
Roll calls up to what one completed task, tenant, or customer actually costs, and catch the account on a $99 plan quietly burning $500 of inference.
When token prices fall, most savings tools lose their pitch. RAF does the opposite: it keeps mining your history for shorter prompts, new cache hits, and cheaper-model candidates. The cheaper tokens get, the more efficiency RAF finds.
A prompt change ships at 2pm and triples your cost-per-call. Most teams find out on next month's invoice. RAF watches the ledger as calls land and flags and immediately alerts on the regression minutes after deploy, with the route, the multiplier, and the likely cause already attached.
RAF surfaces spend anomalies, runaway sessions, quality drift, provider error spikes, and budget burn the moment they happen, where your team already works.
And it speaks to finance, too. One click flips the same live data into the CFO summary: gross vs. net, what RAF saved and protected, and the bottom-line number leadership actually forwards.
RAF turns each one from a quarterly fire-drill into a number you can watch move in realtime.
Where is our LLM spend actually going?
What spend can we safely eliminate?
How do we stop surprise bills?
Are we leaking sensitive data into calls?
Is answer quality holding as we cut cost?
LLM spend just crossed from an experiment into a material budget line, the exact moment cost control starts getting bought. But teams begin with direct SDK calls scattered across product squads, and that works right up until it doesn't. Then every company hits the same wall at once: the invoice is one number no one can break down, no budget is enforced before a call fires, and nobody can say what a single completed task costs, or that one customer on a $99 plan is quietly burning $500 of inference a month.
No attribution by product, tenant, feature, model, provider, or prompt. The invoice is one number.
unattributedBudgets are tracked after the fact, not enforced before the call ever fires.
no capIdentical prompts get re-sent thousands of times with no shared cache between teams.
duplicate spendOne bugged agent or retry loop can burn $20K over a weekend. Daily caps don't catch it.
$20K weekendPrompts carry PII, secrets, and customer data out to providers with no consistent control.
unaudited egressRetries, rate limits, and failover are hand-rolled differently by every team.
inconsistentA prompt change triples cost-per-call. Nobody notices until the invoice arrives.
3× silentYou know the provider bill, but not what one completed task costs, or that a customer on a $99 plan is burning $500 of inference a month.
negative marginRAF speaks the OpenAI-compatible API, so one base-URL change points your client at it. From there it routes to any frontier or open-source model, and every call inherits budgets, cache, retries, redaction, and a durable cost ledger. No app rewrite.
Swap OPENAI_BASE_URL to your RAF endpoint. Streaming and non-streaming both pass through untouched.
Every call lands in a durable ledger: tokens, cost, latency, cache state, provider, retry, all by tenant and feature.
Turn on budgets, exact cache, retry/admission, PII redaction and egress allow-lists. Policies apply at the gateway.
RAF surfaces savings, safety events, and what to optimize next in realtime, with a scheduled summary pushed to inbox and Slack.
Every proxied call passes through the same policy layer, so control is centralized instead of re-implemented in every service.
Caps by tenant, route, feature, model, and environment, enforced with atomic reserve-and-settle before the provider call.
dailymonthly50/80/90/100%Identical prompts return instantly, with avoided provider cost attributed to the savings report. Tenant-scoped by default.
redisshared-replicaJittered backoff on retryable errors and per-provider token-bucket admission, so no thundering herds and no surprise 429s.
jittertoken-bucketDefault pack catches email, SSN, cards, IPs, AWS keys, JWTs, tokens, PEM material; redact or block at egress.
9 categoriesstrict-blockPrompts can only reach approved destinations. Every block is recorded with reason for security review.
allow-listpolicy-logEvery call, even blocked ones, recorded with tokens, cost, latency, cache, retry, and error. Export CSV, JSONL, OTLP.
per-tenantOpenTelemetrySession-velocity limits stop retry storms and near-duplicate loops before they become a $20K weekend invoice.
session-caploop-guardRoll calls up into what one completed task, tenant, or outcome actually costs, not just what you spent.
per-taskper-tenantFlags prompt bloat, duplicate calls, cheaper-model candidates, and provider reliability, ranked by est. savings.
cache-opsdowngradeOne normalized ledger, every view. Filter by feature, route, tenant, model, provider, or environment and every panel updates live across spend, savings, quality, latency, and safety.
Nothing about your stack changes except what now flows through the middle, and what you finally get to see.
routebudgetcachegraderedactledgerRAF is built for one outcome: a buyer learns where their LLM money goes, and what they can safely save, within a day, not a quarter.
Run RAF in shadow mode for 72 hours and we'll send back an assessment of what it would save, cache, and safely downgrade on your real traffic. Or book a 30-minute design-partner session to scope it together.