Lineum · Control your LLM spend before it controls you

What RAF is

The economic safety layer for production AI.

RAF is a self-hosted LLM FinOps gateway. It sits in front of your existing model calls with a single base-URL change and no app rewrite, then routes every request, across OpenAI, Anthropic, Google, Bedrock, and open-source models, through one policy layer: budgets enforced before the call, identical prompts served from cache, failures retried, and sensitive data redacted before it ever reaches a provider. Every call, even the ones it blocks, lands in one normalized ledger.

That ledger is the foundation. Most tools stop at "what did we spend?" RAF goes to "what should this have cost?": what one completed task runs, which customers or features are negative-margin, how much of the provider bill you're even seeing, and which routes can move to a cheaper model without anyone noticing. Then it acts on the answer.

01 · Drop-in gateway

Route, don't rewrite

A drop-in proxy with an OpenAI-compatible API, plus an optional Python decorator, pointing at any model you use: OpenAI, Anthropic, Google, Bedrock, or your own open-source deployments. Your apps and providers stay exactly as they are; traffic just flows through RAF.

02 · Control & safety

On by default

Budgets, exact cache, retries, PII and secret redaction, egress allow-lists, and a runaway-spend circuit breaker, centralized at the gateway instead of re-built in every service.

03 · Continuous optimization

Cheaper every week, safely

Quality-gated model downgrades, semantic cache, and prompt slimming run on your traffic history, with a live view and realtime anomaly alerts the moment something regresses.

Quality-gated optimization

Opus answers. Haiku economics.

The biggest line on your bill is the model you reach for by default. RAF finds the cheapest model that produces an equally good answer for each route, and it doesn't guess. It proves it on your own traffic before a single user is affected.

1

Replay your real traffic

RAF re-runs recent requests against smaller candidate models in the background, in shadow mode, never touching production responses.

2

Grade every answer

Built-in graders score each candidate (exact-match, JSON-schema validity, numeric tolerance, policy checks) into a hard quality delta.

3

Promote only what clears the bar

Routes that hold quality get downgraded automatically. The rest hold. RAF tracks before/after, so a regression rolls itself back.

No graders yet? RAF won't promote a downgrade on quality you can't measure; it asks for a signal first. That's the difference between optimization and hand-waving.

Cost × quality, per routeRAF operating point

Answer quality →

Cost per call →

your quality floor · 0.95

haiku

sonnet

gpt-4o

opus

RAF ↓

RAF lands you top-left: frontier-grade answers at a fraction of the cost, by picking the smallest model that still clears your floor, route by route.

The efficiency engine

It doesn't just report savings. It keeps finding new ones.

Six mechanisms run continuously on the one thing only RAF has: the full history of your traffic. Each makes the underlying calls structurally cheaper, and each gets better the longer RAF runs.

adaptive routingcompounds

Machine-learned routing

RAF learns which routes tolerate a smaller model and sends each request to the cheapest one that clears the quality bar, re-checking as your prompts and traffic shift.

71% of routes auto-optimized

cachecompounds

Semantic cache

Beyond exact matches, RAF recognizes near-identical prompts and serves them instantly, quality-gated and tenant-scoped, so reworded duplicates stop costing you twice.

38% served without a provider call

promptscompounds

Prompt optimization

RAF spots bloat (“this route is 38% boilerplate”) and surfaces shorter prompts that hold quality. Recommended, never silently mutated.

avg 31% fewer input tokens

eval

Span replay

Every recorded call becomes a test case. Replay a route against a new model, prompt, or temperature and get a cost / quality / latency diff, and most runs never hit a provider.

replay corpus from live traffic

pre-flight

Shadow mode

Run RAF alongside production for 72 hours, enforcing nothing. It returns an assessment: what it would save, cache, and block, before you change a thing.

zero production risk

unit economics

Cost per outcome

Roll calls up to what one completed task, tenant, or customer actually costs, and catch the account on a $99 plan quietly burning $500 of inference.

margin visible per customer

Value compounds.

When token prices fall, most savings tools lose their pitch. RAF does the opposite: it keeps mining your history for shorter prompts, new cache hits, and cheaper-model candidates. The cheaper tokens get, the more efficiency RAF finds.

Realtime intelligence

Reporting that doesn't wait for Monday.

A prompt change ships at 2pm and triples your cost-per-call. Most teams find out on next month's invoice. RAF watches the ledger as calls land and flags and immediately alerts on the regression minutes after deploy, with the route, the multiplier, and the likely cause already attached.

RAF surfaces spend anomalies, runaway sessions, quality drift, provider error spikes, and budget burn the moment they happen, where your team already works.

And it speaks to finance, too. One click flips the same live data into the CFO summary: gross vs. net, what RAF saved and protected, and the bottom-line number leadership actually forwards.

Want it pushed? RAF still drops that summary in Slack and inbox on a schedule. But it's a render of data that was already live the whole time, not the only time you get to see the truth.

live

Cost / call jumped3.1×

detected4 min after deploy v412

cost / call$0.011 → $0.034

likely causesystem prompt +1,840 tokens

blast radius~$2,600 / wk if unchanged

→ open the diff · pin to budget alert · roll back v412

Five questions, answered the same day

Every company buys tokens. Almost no one can answer these questions.

RAF turns each one from a quarterly fire-drill into a number you can watch move in realtime.

Q1

Where is our LLM spend actually going?

ledger by team, feature, tenant

Q2

What spend can we safely eliminate?

cache + downgrade candidates

Q3

How do we stop surprise bills?

budgets enforced pre-call

Q4

Are we leaking sensitive data into calls?

PII redaction + egress log

Q5

Is answer quality holding as we cut cost?

graders + replay eval

The problem

Buying tokens is one line of code. Governing them is nobody's job.

LLM spend just crossed from an experiment into a material budget line, the exact moment cost control starts getting bought. But teams begin with direct SDK calls scattered across product squads, and that works right up until it doesn't. Then every company hits the same wall at once: the invoice is one number no one can break down, no budget is enforced before a call fires, and nobody can say what a single completed task costs, or that one customer on a $99 plan is quietly burning $500 of inference a month.

01

Spend opacity

No attribution by product, tenant, feature, model, provider, or prompt. The invoice is one number.

unattributed

02

No hard guardrails

Budgets are tracked after the fact, not enforced before the call ever fires.

no cap

03

Repeated calls waste money

Identical prompts get re-sent thousands of times with no shared cache between teams.

duplicate spend

04

Runaway sessions

One bugged agent or retry loop can burn $20K over a weekend. Daily caps don't catch it.

$20K weekend

05

Silent data exposure

Prompts carry PII, secrets, and customer data out to providers with no consistent control.

unaudited egress

06

Provider incidents leak

Retries, rate limits, and failover are hand-rolled differently by every team.

inconsistent

07

Silent cost regressions

A prompt change triples cost-per-call. Nobody notices until the invoice arrives.

3× silent

08

No unit economics

You know the provider bill, but not what one completed task costs, or that a customer on a $99 plan is burning $500 of inference a month.

negative margin

How it works

One base-URL change. Keep your code. Reach every model.

RAF speaks the OpenAI-compatible API, so one base-URL change points your client at it. From there it routes to any frontier or open-source model, and every call inherits budgets, cache, retries, redaction, and a durable cost ledger. No app rewrite.

1

Connect traffic

Swap OPENAI_BASE_URL to your RAF endpoint. Streaming and non-streaming both pass through untouched.

2

See spend

Every call lands in a durable ledger: tokens, cost, latency, cache state, provider, retry, all by tenant and feature.

3

Control spend

Turn on budgets, exact cache, retry/admission, PII redaction and egress allow-lists. Policies apply at the gateway.

4

Optimize continuously

RAF surfaces savings, safety events, and what to optimize next in realtime, with a scheduled summary pushed to inbox and Slack.

bash · your-app/.env

# Before: calling the provider directly
OPENAI_BASE_URL=https://api.openai.com/v1
 
# After: route through RAF. That's the change.
OPENAI_BASE_URL=https://raf.yourco.internal/v1
OPENAI_API_KEY=raf_sk_live_••••••••
 
$ raf doctor
✓ provider keys      openai, anthropic, bedrock
✓ cache backend      redis · shared
✓ ledger             postgres · durable
✓ proxy compatible   /v1/chat/completions
✓ ready in 00:11:42 · send a test request

The control plane

Nine controls. One gateway. On by default.

Every proxied call passes through the same policy layer, so control is centralized instead of re-implemented in every service.

pre-call

Budget guardrails

Caps by tenant, route, feature, model, and environment, enforced with atomic reserve-and-settle before the provider call.

dailymonthly50/80/90/100%

free

Exact cache

Identical prompts return instantly, with avoided provider cost attributed to the savings report. Tenant-scoped by default.

redisshared-replica

resilient

Retry & admission

Jittered backoff on retryable errors and per-provider token-bucket admission, so no thundering herds and no surprise 429s.

jittertoken-bucket

fail-closed

PII & secret redaction

Default pack catches email, SSN, cards, IPs, AWS keys, JWTs, tokens, PEM material; redact or block at egress.

9 categoriesstrict-block

audited

Egress allow-list

Prompts can only reach approved destinations. Every block is recorded with reason for security review.

allow-listpolicy-log

queryable

Durable cost ledger

Every call, even blocked ones, recorded with tokens, cost, latency, cache, retry, and error. Export CSV, JSONL, OTLP.

per-tenantOpenTelemetry

breaker

Runaway circuit breaker

Session-velocity limits stop retry storms and near-duplicate loops before they become a $20K weekend invoice.

session-caploop-guard

unit-economics

Cost-per-task attribution

Roll calls up into what one completed task, tenant, or outcome actually costs, not just what you spent.

per-taskper-tenant

recommends

Optimization advisor

Flags prompt bloat, duplicate calls, cheaper-model candidates, and provider reliability, ranked by est. savings.

cache-opsdowngrade

The dashboard

Go look any time. The whole picture is one screen.

One normalized ledger, every view. Filter by feature, route, tenant, model, provider, or environment and every panel updates live across spend, savings, quality, latency, and safety.

RAF

ExecutiveEngineeringCostSafety

live · 9 routes

Spend this week

$41.9K

▲ 6.1% vol

Net after RAF

$29.4K

▼ 29.8%

Cache hit rate

38.4%

▲ 4.2pp

p99 latency

412ms

▼ 18%

Gross vs. net spend · 8 weeksgrossnetsaved

W31W32W33W34W35W36W37W38

Spend by model

gpt-4o · 46% sonnet · 28% haiku · 26%

Top features by spend

support-summarize$14.2K

doc-classify$9.8K

copy-gen$7.1K

enrich-webhook$4.5K

Before & after

Same apps. Same providers. One layer of control in between.

Nothing about your stack changes except what now flows through the middle, and what you finally get to see.

Before: direct callsblind

Your apps & agents

↓

OpenAI ?

Anthropic ?

Bedrock ?

Azure ?

No caps · no cache · no attribution · PII unaudited · invoice arrives a month later.

route
through→

After: through RAF−29.8%

Your apps & agents unchanged

↓

RAF gateway

routebudgetcachegraderedactledger

↓

OpenAI ✓

Anthropic ✓

Bedrock ✓

Azure ✓

Every call capped, cached, graded, attributed, redacted, and optimized in realtime.

Time to value

Installed before lunch. Proving its keep by the next morning.

RAF is built for one outcome: a buyer learns where their LLM money goes, and what they can safely save, within a day, not a quarter.

<1hr

to route your first production workflow through RAF, base-URL only.

24h

to your first quality-gated savings: cheaper models, proven on your own traffic.

100%

of calls carry cost, token, latency, cache, retry & status metadata.

Start the pilot

Route one workflow. Watch it get cheaper by tomorrow.

Run RAF in shadow mode for 72 hours and we'll send back an assessment of what it would save, cache, and safely downgrade on your real traffic. Or book a 30-minute design-partner session to scope it together.

Book a pilot Self-host quickstart

Self-hosted · runs in your VPC · no raw prompts stored.