AI Cost Calculator BETA 0 tools · 0 agents · 0 RAG

Total / mo (all-in)

Project profile name, team, audience

What this is. Identity for the deployment — project name, owning team or agency, one-paragraph description, and the hosting strategy (managed API, BYOK, self-hosted GPU, on-prem). Drives the report title block and gates which downstream sections appear.

Why it matters. The hosting choice is the single largest structural decision in the report: it switches the entire cost model between per-token-API (linear with traffic), reserved-capacity (committed spend), and capex-amortized (fixed monthly regardless of traffic). Getting it wrong by mis-clicking can move the bill by 5–10× and surface the wrong reservation / self-host fields downstream.

How to interpret the results. Pick the hosting that matches your actual procurement vehicle — not the one you wish you had. Project name and description appear verbatim at the top of the report, so write them as you'd want a reviewer to read them. Sections downstream auto-show / auto-hide based on this choice; if you don't see Reservations, you're on a hosting model that doesn't use them.

Hosting strategy

Pick where the LLM runs. Downstream sections (Reservations, Self-host capacity, On-prem amortization) auto-show only when relevant.

Global parameters MAU, sessions, turns, cache, retries — workload-wide knobs that drive the headline cost

Workload-wide knobs — shared across all agents (per-agent specifics live in the Agent fleet tab). Two sub-panels below: Your audience sizes the traffic (MAU × sessions × questions), Workload parameters shapes the per-query bill (cache hit rate, retry, growth, plus advanced calibration knobs). Open each panel for detailed guidance on its knobs.

Value colors: sizing saving lever measured cost driver hover for details
Your audience single

What this is. Audience segmentation — split MAU into groups that use the app differently. Each segment has its own MAU, sessions/day per user, questions/session, and authentication status. Anonymous public segments get a bot-factor multiplier (typically 1.5–2.5×) to account for crawlers, scrapers, and abuse traffic.

Why it matters. Authenticated internal staff might run 3 deep sessions/day with 15 questions each; anonymous public visitors might run 0.05 sessions/day with 2 questions each. Modeling these together as one average user obscures cost behavior — and bot-factor on the public segment routinely accounts for 30–60% of the bill on consumer-facing deployments. Without this breakout, you can't defend the MAU number to procurement.

How to interpret the results. Pull each segment's numbers from analytics (login records for authenticated, web analytics for anonymous). Set bot-factor to 1.0× for internal-only, 1.5× for typical public, 2.5–3× for SEO-targeted public pages. The aggregate "total monthly questions" used in the headline = Σ (MAU × sessions × questions × 30 × bot-factor) across all segments.

Single vs Multi. One audience uses the slider trio below. Click + Split into audience types to break MAU into distinct groups (auth power users vs. anonymous public, paid vs. free tier, etc.). Both views drive the same engine math — workload.segments[] is the only source of truth.

Monthly active users50
Turns / session8
Daily return rate per user0.3
Workload parameters

What this is. The per-query shape knobs that sit on top of your audience volume. Cache hit rate = vendor discount when you re-serve the same system prompt (every +10pp ≈ 8% off the input bill). Retry rate = share of failed calls that re-run (measured from logs, not set in code). Growth / month = compounding traffic growth used for the 12- and 36-month projection. The advanced knobs (Bot factor, Cache write share, Batch async %, Context compression %, Peak/avg ratio, Language multiplier) cover specialized calibration and only appear in Advanced mode.

Why it matters. These are the second-largest set of cost levers after audience volume. Cache hit rate is THE biggest single lever — moving from 45% → 85% on a $30K/mo deployment saves ~$10K/mo. Retry rate above 5% means real production friction that costs real money. Bot factor on a public-facing segment routinely accounts for 30–60% of the bill (and silently — the multiplier compounds with the audience volume). Growth doesn't change this month's headline but determines what the 36-month TCO looks like in the budget memo.

How to interpret the results. Aim for cache hit ≥70% on stable system prompts (workflow: deduplicate sysprompts, lock them in a versioned artifact, never inline-edit). Pull retry rate from production logs, not guess. Set bot factor to 1.0× for internal SSO-only, 1.5× for typical public, 2.5–3× for SEO-targeted public pages. Set growth conservatively (5–15% for known internal tools, 20–50% for public launches). If your post-launch bill diverges from this estimate, the answer is almost always in this panel — recalibrate cache, retry, and bot factor first.

Basic vs Advanced. Basic mode shows only the Growth slider — the other knobs keep their preset defaults (cache 70%, retry 3%, bot 1.5×) so the headline cost is identical to Advanced. Switch to Advanced to tune all 9 knobs and see the calibration sliders.

Bot factor1.5×
1.0 = internal · 1.5× = typical public · 2–3× = under crawler load
Cache hit rate70%
📊 measured · ~90% discount on hits · OpenAI requires ≥1024-tok prefix
Cache write share10%
API hosting only · of cached tokens: % first-write (premium); rest are reads. ~10% steady state, ~50% cold start.
Batch async %0%
50% discount
Context compression %0%
Net LLM-bill saving from history summarization (after summarization overhead). 30%=Claude-Code-like.
Retry rate3%
Growth / month20%
Curves the cost-over-time projection — this month's headline isn't affected.
Peak / avg ratio
Self-host sizing only · 1×=use workload default (~4×) · >1× overrides
Language multiplier1.0×
EN=1.0, code=1.3, CJK=1.8-2.2
Workload mix — query types ×1.0 out
Default mix across all queries — multipliers: classify ≈ 0.3×, summary ≈ 0.65×, RAG ≈ 0.85×, code 2.8×, longform 3.6×, agentic 4.3×. Same MAU + same model can swing the bill ±20–75% depending on this mix (verified on bundled presets).
↳ Per-agent override: agents with a Task bias set (Section C card → Task bias dropdown) bill against a 60/8/8/8/8/8 mix of their biased type, multiplied into their own output_tokens. Useful for mixed fleets — e.g. a Triage agent (bias=classify, cheap output) + a Responder agent (bias=longform, expensive output) in the same workload.
Tools registry & rates workload-wide tool catalog · return-shape default · provider rate hints

What this is. The catalog of tools your agents can call — web search, vector search, code execution, database lookups, image generators, STT/TTS, custom MCP servers. Each entry defines the tool's per-call fee, the typical schema size injected into the prompt, the result-token shape it returns, and which provider bills it (managed-API, BYOK, or self-hosted).

Why it matters. Tool fees are billed independently of LLM tokens. A vector-search agent at 5K queries/mo using Exa at $0.005/call adds $25/mo flat; an image-gen tool at $0.04/image on 50K calls/mo is $2K — bigger than the LLM bill on some workloads. Tools also pump tokens back into the agent's input (the result), which the engine bills at full input-token rate.

How to interpret the results. Add only the tools your fleet actually calls; each one is wired up in Section C per-agent (via enabled_tools). The "fees + tokens" badge in the cost ledger (Section E) breaks out tool-fee impact separately from LLM-token impact, so you can see which line is hurting. If you're seeing surprise costs, this is the second place to check after Section A.

Tool return shape (default — per-tool RETURN SHAPE in the Tools registry overrides this)
Provider rate card
Provider tool fees/sess: $0 · Per-agent web search / file search / container counts live inside each agent card; rates are looked up by the agent's provider+family (Anthropic web $10/1k · OpenAI web $10/1k + fs $2.50/1k + container $0.03/sess · Bedrock/Azure $0 pass-through).
Embeddings model: text-embedding-3-small @ $0.02/1M input tokens.
Tools registryMCP-STYLE
Workload-level catalog of available tools. Each agent card in Section C declares which tools it enables and at what frequency (per-(agent,tool) trigger_rate). Add custom MCP tools below or by editing this workload's JSON.
Agent fleet per-agent role, model, prompt, RAG, tools, guardrails — 0 tools · 0 agents · 0 RAG

What this is. The per-agent editor. Each card configures one agent in the fleet — its model, system prompt size, RAG settings, which tools it enables, guardrail tokens, task bias, and activation rate. Topology (Single / Fleet / Workflow) sits above; it decides whether agents fan-out in parallel, chain sequentially, or run alone. The reference-topology diagram (collapsible) shows what your choices look like.

Why it matters. Most real cost-engineering happens here. Doubling agents roughly doubles per-query cost (each runs full sysprompt + RAG + reasoning), so 1→3 on a $30K/mo deployment ≈ +$60K/mo. The model choice per agent compounds — putting Opus-4.7 on a Triage agent that only needs gpt-5.4-mini wastes ~5× per call. activation_rate shaves cost on conditional agents (e.g. an Image-Enhancer that only fires on 30% of requests = 0.3× monthly contribution).

How to interpret the results. Each agent card shows its monthly contribution ($/mo · % of fleet) in the header, with a "Compare models" expander that runs the engine against every model so you see the swap delta before committing. Use the topology cards to enforce structure; use Task bias to match each agent to its workload character (Triage→classify, Drafter→longform, etc.).

Topology Multi-Agent Fleet
Agents in fleet3
Parallel reasoning agents — count syncs with the cards below
Comm patternorch
orch=0 · sup=+150·(N−1) · peer=+300·(N−1) · pipe=Σ prior outputs · only matters when N≥2
Reference topology — fleet coordination shape ▸ show diagram
Fleet actions
Click + Add agent to grow the fleet, the on an agent header to rename, the × to remove. The global Agents slider stays in sync with the count. Use ↧ Apply to all agents inside any expanded card to broadcast that agent's TOOLS / RAG / Reasoning / Guardrails settings to the fleet.
What changes here
Fact-checking pipeline verifier preset · NLI hosting · atomizer/reviser token budgets — pick a preset or stay at the simulator default

What this is. The post-response verification pipeline that checks generated claims against a knowledge source before returning. Configures the verifier preset (MiniCheck, FactScore, AlignScore, FR2, etc.), NLI model hosting (self-hosted vs. managed), per-claim atomizer/reviser token budgets, and a cascade rule for escalating low-confidence claims to a second more-expensive verifier.

Why it matters. Skipping fact-checking is the right call for low-stakes chatbots and a compliance violation for legal/medical/financial. When you do need it, the verifier choice can swing fact-check cost by 20×: a self-hosted MiniCheck NLI run on every claim is ~$0.0001/claim; an API-hosted FR2 with reasoning is ~$0.002/claim. Cascade verification keeps the average cheap (90% of claims handled by the fast verifier) while still catching the hard cases.

How to interpret the results. Start by picking a preset that matches your risk profile — MiniCheck for general grounding, FR2 for high-stakes, FactScore if you publish citation rates. Set coverage to the fraction of responses that need verification (1.0 = all, 0.2 = 20%). Cascade settings show up as the ⇉ icon in the Section C architecture diagram; the cost flows into the headline as a separate "verify" line.

Adds an automated factuality check on a sample of answers — catches hallucinations at the cost of extra tokens per verified query. Skip this section if you don't run a verifier.
Advanced — token budgets per call
Cost breakdown per-turn token ledger · cumulative composition · cost-per-session distribution

What this is. Three views on the headline number, each from a different angle. Per-turn token ledger — line-by-line accounting of where one canonical query's tokens go (sysprompt + history + RAG + tool results + output + reasoning). Cumulative token breakdown — same components rolled across the whole session as a stacked bar so the dominant cost driver pops visually. Lognormal CI distribution — the cost-per-session probability curve (p50 / p90 / p99) showing how wide your variance is on real traffic, not just the central estimate.

Why it matters. Headline cost is an opinion until you can see what's driving it. The ledger answers "which knob to tune": if RAG chunks consume 60% of input, the lever is retrieval depth (Section A → RAG chunks); if reasoning tokens are 40% of output, the lever is thinking mode (per-agent in Section C). The CI distribution answers "how confident is the number": a wide p50→p99 spread means a single bad day can blow past the central estimate, so you size budget envelopes against the tail, not the median.

How to interpret the results. Ledger: look for the biggest bar — that's your priority lever. Compare "uncached input" vs "cached input" — if uncached is >50%, prompt-cache isn't earning its keep (system prompt instability, dynamic preambles). Cumulative chart: confirms the ledger findings at session scale. CI distribution: bring the p90 to the procurement memo as the budget defense — "central estimate $X, p90 worst-case $Y, $/session standard deviation Z%."

Per-Turn Token Ledger turn 0
Token composition (horizontal stack)
Cumulative Token Breakdown
Token breakdown chart.
Sys prompt
0
amortised
History ctx
0
conversation
RAG chunks
0
retrieval
Thinking
0
extended CoT
Guardrails
0
safety scan
Tools
0
schema+results
Lognormal CI Distribution — Cost Per Session
CI distribution chart.
Comparisons model-by-model cost table and 12-month projection

What this is. A side-by-side cost table running your current workload against every model in the rate card — Claude Opus/Sonnet/Haiku, GPT-5.x family, Gemini 3, Llama 3.3, etc. Each row shows per-query cost, monthly bill, and percentage delta vs. your selected model. The 12-month projection chart shows the same models curved over time at your configured growth rate.

Why it matters. Model choice is usually the second-biggest decision after agent count. The same workload run on Opus-4.7 vs Haiku-4.5 can differ by 8×; gpt-5.4 vs gpt-5.4-mini by 5×. But naïve "always pick the cheapest" ignores quality differences — this table lets you see the cost gap so you can decide if the quality lift is worth it.

How to interpret the results. The "whole-fleet uniform" badge is a reminder: this table assumes every agent runs the same model. Real fleets mix (Triage on Mini, Drafter on Opus) — use Section C's per-agent "Compare models" expander for the realistic per-agent swap math. Treat this table as a "ceiling vs floor" reference, then mix and match in Section C.

Model Comparison whole-fleet uniform
Each row assumes every agent in the fleet uses that one model — a uniform-fleet sanity check, useful for procurement docs. For realistic mixed-fleet swaps where you change one agent at a time and see the headline update, use the Compare models for this agent expander inside each agent card (section C · Agent fleet). Values are per-session ($/sess); multiply by sessions/month for the run-rate.
ModelAPI totalRAG shareReason shareGuard+Tool tokTool fees+Cache saved (info)Retry+p50p90p99
Monthly Projection — 12 months
Projection chart.
Per-Component Cost Bars
Full Itemised Breakdown
Per-Agent Cost Contribution heterogeneous
Monthly projection exceeds budget. Tune cache rate, enable batch processing, or reduce RAG chunks.
Stress test ±50% sensitivity, tornado chart, what-if scenarios

What this is. A robustness check on your headline number. The sensitivity table flips each input ±50% in isolation and reports the resulting % change in monthly cost; the tornado chart shows the same ranked by impact. The What-If cards flip one knob (or a small bundle) to a named scenario value and report the delta — "what if cache hit rate hit 80%", "what if RAG chunks doubled", etc.

Why it matters. A budget that's right at the central estimate but blows up at +20% on the most sensitive input is not a budget — it's an optimistic guess. Sensitivity tells procurement reviewers which inputs your estimate is fragile to, so they know where to push back on assumptions. The What-If cards give a quick "lever pull" view for the most common cost-reduction or cost-increase moves.

How to interpret the results. Top bars in the tornado = the inputs that matter most — these are the ones you should validate hardest before signing. What-If cards reading 0.0% means that knob doesn't apply to your current preset (RAG off, guardrails already at zero, etc.). Use this section to write the "Sensitivity" paragraph in your procurement memo.

Sensitivity — ±50% Parameter Impact
Tornado Chart
Tornado chart.
What-If Scenarios

Each card flips one knob (or a small bundle) and shows how much your $/session changes vs. the current setup. 0.0% means the knob doesn't apply — e.g. "Double RAG chunks" reads 0% when RAG is off in your preset, "Add full guardrails" reads 0% when guardrails are already at the proposed values. Hover any card for what it tests and why it matters.

Federal compliance & hosting FedRAMP, ATO, egress, audit retention

What this is. The federal compliance overhead layer — FedRAMP/GovCloud premium multiplier, multi-region redundancy, plus additive line items for ATO certification, egress fees, audit log retention, and vector-DB hosting. Set to none / single / 0 if you're modeling a commercial deployment.

Why it matters. Federal AI typically costs 30–80% more than the commercial equivalent at the same traffic — GovCloud carries a 15–25% compute premium, FedRAMP-Moderate adds an annual ATO line ($150K–$400K amortized monthly), and audit retention can add $200–$800/mo. Ignoring these makes the proposal look unrealistic to procurement reviewers; including them lets you set defensible budget envelopes.

How to interpret the results. Pick the FedRAMP tier that matches your authorizing memo (Low / Moderate / High). Multi-region only if you have a hot DR requirement. The ATO amortization assumes a 36-month cycle by default — adjust if your re-cert window differs. Costs from this section flow into the headline as additive monthly lines, broken out separately in the Cost breakdown so reviewers can audit them.

Hosting multipliers

Multipliers stack — e.g. FedRAMP High + active-active = ×2.60 on LLM and GPU costs.

Additive monthly costs

Not modeled: ATO assessment labor (use Agent engineering or Personnel for that), sole-source procurement overhead, GSA Schedule discounts, model availability constraints (some Anthropic/OpenAI models aren't in GovCloud). Add fixed-cost approximations as line items in Infrastructure below.

Your typical question token sizes for one canonical query

What this is. Token sizes for one canonical query in your deployment — system prompt, user input, retrieved context (RAG chunks + tool results), and the assistant's reply. A "token" is roughly ¾ of an English word (1K tokens ≈ 750 words). Don't know your numbers? Use the plain-English wizard below — it maps "I write 2-paragraph answers with 5 search results attached" to actual token counts.

Why it matters. Per-query token size is one of the two multipliers in your monthly bill (the other is query volume in Your users). A 4K-token query at 1M queries/mo costs ~4× a 1K-token query at the same volume on the same model. Mis-sizing this is the most common reason for a budget overrun in the first three months of production.

How to interpret the results. Measure these from a sample of real traffic when you can — token counts from tiktoken or the provider's own tokenizer are authoritative; guesses are not. The "Your typical question" totals roll up into the per-turn ledger in Cost breakdown (Section E). If your post-launch bill diverges from this estimate, this section is the first place to recalibrate.

Estimate tokens in 30 seconds → No engineering required

Question types full-pipeline, lookup-only, refusal, …

What this is. Question-type shapes — multipliers on top of the canonical query above. Full = baseline pipeline (1.0×); RAG = retrieval-only (~0.3×); refusal = out-of-scope reject (~0.05×); heavy = long-context multi-turn (~1.2×). The next section (Traffic mix presets) decides how queries split across these shapes.

Why it matters. Real fleets don't run one shape — a chatbot serves a mix of substantive questions, RAG lookups, polite refusals, and the occasional heavy follow-up. Modeling everything as the most expensive shape over-quotes by 2–4×; modeling everything as cheap lookups under-quotes by the same factor. The shape weights are how you reconcile.

How to interpret the results. Set factors to scale the canonical query — 0.5 means that shape uses half the tokens, 2.0 means double. If your fleet doesn't have refusals at all, leave the weight at 0 in the next section. The mix weights × shape factors × per-query cost = your effective per-query bill in the headline.

+ Add another question type

Multi-agent pipeline optional — if your system uses more than one LLM call per query

What this is. The procurement-side multi-agent pipeline editor — each agent specifies its own model, calls-per-query, input/output tokens, sysprompt size, RAG flag, and task bias. The simulator-side per-agent editor (Section C above) is the richer counterpart with topology cards, tool-enablement, and the live architecture diagram; this section is the JSON shape that flows into the procurement report.

Why it matters. Real production fleets are almost never single-call — Planner → Retriever → Drafter → Reviewer chains, parallel specialist fans, ReAct loops. Modeling these as "one big call" understates cost by 2–6× because each agent has its own sysprompt amortization, cache behavior, and output budget. When this section is populated, the engine sums per-agent cost and ignores "Your typical question" above.

How to interpret the results. Use the Common pipeline templates below to start (single-call, planner+executor, RAG, multi-agent verifier, etc.) then tune per-agent. The "Multi-agent mode is active" banner appears once you add an agent — confirming the engine has switched to agent-sum billing. If your deployment is actually single-call, leave this section empty.

+ Add an agent
Common pipeline templates

Traffic mix presets how questions split across types

What this is. A preset selector for how your traffic splits across the question shapes defined above. "Worst case" treats every query as full-pipeline (the expensive shape); "mixed" is a realistic production blend; "lookup-heavy" weights toward RAG; "refusal-heavy" toward the cheap reject path. Weights inside each preset sum to 1.0.

Why it matters. The mix preset is the bridge between the per-shape token sizes (above) and the headline bill. Same fleet on "worst case" can quote 2–4× higher than on "mixed" — picking the wrong preset is one of the easiest ways to wildly over- or under-quote. Procurement reviewers will (and should) ask which preset you used.

How to interpret the results. If you have real traffic data, write a custom preset that matches the empirical distribution; if you don't yet, "mixed" is the right starting point with "worst case" as a defensive ceiling. The selected preset weights flow into the per-query cost calculation in Cost breakdown.

+ Add another mix preset

API reservations / committed-spend discount or PTU vs on-demand

What this is. Committed-spend reservations on managed-API hosting — Azure PTU (Provisioned Throughput Units), AWS Bedrock Provisioned, OpenAI Enterprise commit, Anthropic on-demand commit. You pay a fixed monthly fee in exchange for a discount vs. pure on-demand billing. The engine computes effective $/query under the reservation and compares against on-demand.

Why it matters. Reservations typically yield 30–50% savings once on-demand spend exceeds ~$10K/mo, breaking even somewhere between $5K–$8K/mo depending on the provider. Below that threshold the commitment locks you into paying for unused capacity; above, it's free money. Procurement decisions on year-2 reservations swing the 3-year TCO by 20–30%.

How to interpret the results. Only appears when hosting = managed API. Pick the provider that matches your contract vehicle; set units to your committed quantity. The engine shows the break-even point — if your projected monthly spend is below that, on-demand wins; above, the reservation. Mix in the Migration timeline section to model "on-demand year 1, reserved year 2-3" patterns.

Edit the rates / discounts in the Prices tab → API reservations table. Effective monthly cost shows in the report's "Reservation savings" row.

Embeddings (RAG) ingest cost + per-query embedding

What this is. The dedicated embedding-token billing for RAG systems. Two cost lines: ingest (embedding the entire corpus into vectors, amortized over the re-embed cycle) and query-time (embedding each user question to run vector search). Independent of the LLM token bill — uses the chosen embedding model's separate rate card (text-embedding-3-small, voyage-3-large, etc.).

Why it matters. Often a forgotten line. For a 100M-token corpus + 1M queries/mo on text-embedding-3-small, this runs $50–$500/mo depending on re-embed frequency. Easy to miss until the first invoice. For very large corpora (500M+ tokens) on premium embedding models, the bill can rival the LLM line.

How to interpret the results. Skip entirely if you're not using RAG. Otherwise, set the corpus token count, the re-embed cadence (quarterly = 4×/yr, monthly = 12×/yr), and the embedding model. The cost shows up as a flat monthly line — predictable, but rises with corpus growth.

Migration timeline phased deployment over 36 months

What this is. Phased deployment plan over the procurement window — typically 36 months for federal, 24 for commercial. Each phase has its own hosting strategy, reservation, traffic level, and infrastructure cost. The engine computes a per-phase monthly bill plus the total multi-year TCO, with a cost-over-time chart showing where phases transition.

Why it matters. Real procurement isn't "pick one hosting, run it for 3 years" — it's "API year 1 to prove value, reserved year 2 to lock in discount, self-host year 3 if scale justifies the capex". Modeling the full curve lets you defend the multi-year ask in a single chart and identify break-even points where strategy shifts pay back.

How to interpret the results. Default 1 phase (matches the current single-strategy headline). Add phases when modeling a transition; each phase's overrides only affect that phase's monthly bill. The cumulative 3-year cost (bottom of chart) is the number that ends up in the procurement memo's bottom line.

+ Add a phase

Personnel / staffing FTE allocations × loaded annual salary

What this is. Ongoing post-launch staffing — the people who keep the deployment running after go-live. Each role: FTE allocation × annual base × total-comp multiplier (1.30 = +30% benefits/overhead) ÷ 12. Set FTE to 0 to skip a role. Salaries are editable in the Prices tab → Personnel. Upfront design effort (pre-launch) lives in Agent engineering below.

Why it matters. Federal RFPs require fully-loaded labor in the cost basis — leaving it out makes the proposal look amateur or non-compliant. Personnel is often 40–70% of a federal AI deployment's TCO; ignoring it under-quotes by 2–3×. Procurement reviewers will compare your loaded rates against the GSA schedule, so use realistic numbers.

How to interpret the results. Start with the roles that always exist (Product Owner, MLOps Engineer, on-call SRE), add specialists as the deployment grows (Prompt Engineer for high-volume tuning, Compliance Officer for federal). FTE = 0.25 means quarter-time. The monthly total flows into the headline as a separate "Personnel" line, broken out in the report.

+ Add a role

Agent engineering upfront design effort + maintenance — amortized monthly

What this is. The upfront engineering effort to design, build, and ship the agent system, plus ongoing maintenance. Roles + FTE during the design phase (SME interviews, spec writing, eval criteria, prototype, calibration runs), amortized over the operational lifespan, plus a maintenance cadence for periodic re-specification.

Why it matters. A pure-token cost estimate underestimates real procurement spend by 30–50% because it ignores the team building the thing. Procurement reviewers expect this line; leaving it out makes the proposal look amateur. Federal RFPs in particular require pre-deployment design effort as a separately-itemized cost basis line.

How to interpret the results. Enable the section, add roles (Agent Design Lead, MLOps Engineer, Prompt Engineer, etc.), set FTE allocation during the design phase. The calculator amortizes the upfront cost over the project's useful life (default 36 months) and adds recurring maintenance hours. Output: a monthly $ line that flows into the headline alongside Personnel and Operations.

Roles × FTE during design phase

Each row uses the same loaded-salary math as Personnel. Edit role salaries in Prices → Personnel.

+ Add a design-phase role

Maintenance — re-engineering cadence

Agent engineering summary


Traffic safeguards bot rate limiting

What this is. Bot rate-limiting on anonymous traffic — sets requests-per-IP-per-minute thresholds that protect against runaway cost from crawlers, scrapers, and abuse. Independent of authentication (logged-in users typically aren't rate-limited; anonymous public visitors are).

Why it matters. Without rate-limiting, a single misconfigured search bot can generate 10K+ queries/hour and rack up thousands in a day. Public AI deployments have hit five-figure surprise bills from this exact pattern. Rate-limiting is cheap insurance; the right threshold is "above any legitimate user pattern, below crawler aggression".

How to interpret the results. Set the per-IP-per-minute cap to ~5–10× a power-user's burst rate (so legit users never hit it). The bot-factor multiplier in the Your-users section is the corollary — it's how you size the bill before rate-limiting clips the abuse tail. Rate-limiting changes the worst-case-cost ceiling; bot-factor sizes the expected.

Bot rate limiting

Public endpoints get hammered by bots. Rate-limiting strategy ranges from cheap edge filters to full WAF + CAPTCHA. The bot ceiling caps the "bot factor" that scales anonymous traffic (set in Project profile).

Self-host capacity only matters if you self-host

What this is. The GPU capacity-planning section for self-hosted deployments. Catalog of GPU specs (H100, A100, MI300X, etc.), per-card throughput, replica count, utilization assumption, and cloud-vs-bare-metal pricing. The currently-selected GPU is marked with a ● dot. Only appears when hosting = self-host.

Why it matters. Self-host TCO is a different shape from API — high fixed monthly (GPU rent), low marginal-per-query. Break-even vs. API typically sits at $20K–$50K/mo of equivalent on-demand spend, depending on model size. Mis-sizing GPU count by 20% changes the bill by 20%, since unused capacity still bills. Utilization assumption is the biggest unknown — running at 30% averages most production fleets.

How to interpret the results. Pick the GPU that matches your model — H100 for 70B+ class, A100 for 13B–34B, consumer L40S for under 13B. Set replicas to peak QPS ÷ per-GPU throughput, with 30–50% safety margin. The headline "$/query effective" tells you when self-host beats API; cross-reference with the Migration timeline to model "API now, self-host at scale" transitions.

+ Add another GPU instance

Critical for bursty traffic. A NOAA storm explainer might run only ~10% of the month; cutting GPU hours 10× changes the API-vs-self-host tradeoff dramatically.

Budget solver given a budget, find the affordable scale

What this is. Inverse mode of the calculator — instead of "given my inputs, what's the bill?", asks "given my budget ceiling, what's the maximum scale I can support?". Enter a target monthly budget; the engine solves for the maximum MAU the deployment fits under, plus enumerates which tradeoffs (cache hit, model swap, batch tier) would unlock more headroom without exceeding budget.

Why it matters. Procurement conversations usually start with a budget envelope ("you've got $50K/mo to work with"), not a traffic estimate. Forward-quoting from inputs leaves you guessing whether you'll fit; the solver tells you directly. Also tells you whether cost-optimization moves (cheaper model, batch async) buy you 10% or 60% more capacity — which informs whether they're worth the engineering effort.

How to interpret the results. Enter the monthly $ ceiling. The solver returns the max MAU at current per-query cost, plus a ranked list of moves that would extend headroom (e.g., "+18% MAU if cache hits 80%", "+45% MAU if you swap Drafter from Opus → Sonnet"). Use this to negotiate the budget envelope with stakeholders before locking in.

Enter a monthly budget above.
OK
When the projected monthly cost is at or under your ceiling, this panel stays empty.

Model cost comparison switch models to see headline savings

What this is. Procurement-grade model comparison table. Re-runs the full deployment (same workload, hosting, infrastructure, federal multipliers, personnel) against every model in the rate card. Each row shows the model, monthly bill, and Δ annual difference vs. your current pick. Sorted cheapest to most expensive.

Why it matters. The model row in an RFP cost basis usually requires multiple-vendor pricing for fairness. This section generates the table directly so you don't have to re-run the calculator manually for each candidate. The annual Δ column is the number stakeholders care about when comparing "what if we used vendor X instead".

How to interpret the results. Cheapest is rarely the right answer for production — quality differences between, e.g., Haiku and Opus on the same workload can be substantial. Use this as a procurement reference, not a selection tool; pair with quality evals before committing. For per-agent model swaps (mixed fleets), Section C's "Compare models" expander is more precise than this whole-fleet uniform view.

Model $/query $/month $/year Δ vs current (annual)

Cheaper ≠ better. Smaller models (gpt-5-nano, haiku-4.5) cost a fraction of flagship models but lose accuracy on multi-step reasoning, RAG faithfulness, and tool use. Validate any switch against your own evaluation set before committing — savings here are only meaningful if the cheaper model still meets your quality bar.

Sensitivity how robust is the headline to input drift?

What this is. Procurement-side sensitivity analysis. Each row perturbs one input around its baseline and shows the resulting monthly cost. Sorted by impact (biggest driver on top). The black tick on each bar marks the baseline; red extends to the low case, green to the high case.

Why it matters. Procurement reviewers and finance committees require an explicit sensitivity section in any major-spend proposal. "Here's the central estimate" without "and here's what happens if MAU misses by 20%" is incomplete — it doesn't let stakeholders gauge whether the budget envelope has enough margin for realistic uncertainty.

How to interpret the results. The top bars are the inputs you should validate most carefully before signing — they're the ones that move the bill most. If any single-input ±20% perturbation blows past your budget ceiling, the proposal needs either a larger envelope or stronger optimization levers (cache, model swap, batch) before it's defensible. Copy the top 3 rows into your procurement memo's "Sensitivity" paragraph.

Lever Low case Headline range High case
Computing…

Perturbations: MAU ±20%, Cache hit rate ±10pp, Provider rates ±15%, Bot factor ±20%, Turns/session ±20%. Cache and rate perturbations approximate the deltas analytically — re-run with your own scenario knobs for exact figures.

Cost over time monthly + cumulative projection given growth%

What this is. Time-series cost projection. Takes the current monthly headline and projects it forward using the Growth/month slider in Section A (default 20%/mo). Shows two curves: monthly cost climbing month-over-month, and running 36-month cumulative TCO.

Why it matters. Procurement envelopes are sized in years, not months. A $20K/mo deployment with 20% MoM growth hits $80K/mo by month 6, $300K/mo by month 12. Sizing the year-1 budget envelope from the current monthly bill is the most common way procurements run out of money mid-cycle. The cumulative curve is what you bring to the finance review.

How to interpret the results. Set growth/month conservatively — 5–15% for known-audience internal tools, 20–50% for public launches, >50% only for genuine viral profile. The 12-month cumulative is your year-1 budget ask; the 36-month is your multi-year envelope. If the curve looks too aggressive, revisit the growth assumption or build cost-optimization milestones into the Migration timeline.

Compounds the current simulator growth rate monthly. Caps at 36 months since further projections are unreliable. Headcount, seat-license, and contract reservations don't grow proportionally — adjust manually for those.

Side-by-side compare two scenarios diffed line-by-line

What this is. A diff tool. Picks two scenarios — your live config plus a saved or bundled comparison — and runs both through the same engine. Side-by-side table of inputs and outputs, with the % difference highlighted on every row that differs.

Why it matters. "Why does scenario A cost $30K and scenario B cost $50K?" is a question every procurement reviewer asks. Hunting through two configs manually is tedious and error-prone. A diff table answers it in one screen — which inputs differ, which outputs differ, and which difference explains the bulk of the gap.

How to interpret the results. Use this to defend a configuration change ("we cut the bill 40% by reducing RAG chunks and swapping the Drafter model"), or to compare your proposal against an alternative architecture. Rows with the largest % delta are the most important to explain in the procurement memo.

AS-IS vs proposed compare against your current contract

What this is. Incumbent-vs-proposed comparison. Enter what you're paying today (or what an incumbent vendor has quoted) and the calculator surfaces the delta against the proposed deployment — monthly savings or overrun, plus the payback window if you've entered a one-time migration cost.

Why it matters. Procurement justifications almost always require an AS-IS baseline — "we're saving $X/mo vs. the current contract" is a much stronger pitch than "the new system costs $Y/mo". Payback windows convert a one-time migration spend (engineering effort, data migration, recerts) into a months-to-recoup number that finance committees actually use.

How to interpret the results. Enter the incumbent's monthly bill from the actual contract or invoice (not a guess). One-time migration costs should include engineering hours × loaded rate, plus any data-migration / re-certification fees. Payback <6 months is usually an easy approval; >18 months requires defending why the move is strategic beyond pure cost.

Today's annual spend on whatever this deployment replaces. Vendor invoice, internal cost-allocation, or incumbent quote.

Switching cost — data migration, training, parallel running, contract exit fees. Use 0 if greenfield.

Infrastructure database, storage, networking, monitoring — fixed monthly costs

What this is. Fixed cloud-infrastructure costs that show up regardless of LLM hosting choice — RDS, S3, CloudWatch, ALB, NAT Gateway, Route 53, secrets manager, observability stack. Each line accepts flat $/mo, $ per query, or $ per GB scaling so you can model both fixed and traffic-scaled lines.

Why it matters. Usually the smaller part of total bill (5–15% of TCO for managed-API deployments; higher for self-host) but they add up — a federal-tier deployment can easily run $2K–$5K/mo just on observability + audit logging + redundant networking. Easy to forget when modeling, embarrassing to discover in the first invoice.

How to interpret the results. Pull line items from your existing AWS/Azure/GCP bill if you have one — that's the most accurate source. For greenfield estimates, the bundled defaults are a reasonable starting point for a single-region production deployment. Flows into the headline as a separate "Infrastructure" line, broken out in the report so reviewers see it isn't bundled into the model cost.

+ Add an infrastructure line

📊 Published cost benchmarks

Real cost numbers cited in vendor pricing pages, earnings calls, GAO reports, and academic studies. Use these to sanity-check your calculator output and to defend procurement budgets with citations. Click any source link to verify the number.

Your calc: · /user/mo · /query · /yr
Visual comparison — pick a metric; your scenario plots alongside every cited benchmark.
Your scenario Commercial Federal Industry / academic

💲 Price book

Single source of truth for every price the calculator uses — LLM rates, API reservations, GPU instances, embeddings, vector DBs, AWS infra, personnel salaries, ATO costs. Edits here override the defaults for this workload only and persist in the URL hash. Defaults are validated against vendor pricing pages; last full refresh: . Future plan: a scraper periodically fetches each source_url and bumps last_verified.

Total monthly cost

View:
Annual
3-year TCO
$ / user / month
Queries / month

Cost composition

API vs self-host comparison

Per-segment breakdown

Infrastructure breakdown

Derivation of your numbers

Full math trace for this workload — copy-paste into ChatGPT/Claude/Gemini and ask "verify this math". For source citations and formula details, see Methodology in the sidebar.

Methodology, sources & disclosures planning only — refresh prices before procurement

Pricing sources, token-counting heuristics, confidence-interval math, and what this calculator does not model. For the math applied to your numbers, see Derivation of your numbers above.

PRICING SOURCES
Static bootstrap prices were patched on 2026-05-04. Your scraper should refresh these before procurement use. Prices are subject to change by providers. For procurement, verify against vendor pricing pages on the day of submission. Sources:
  • Anthropic API: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; includes cache write/read pricing
  • OpenAI API: GPT-5.5, GPT-5.5 Pro, GPT-5.4 family, embeddings, web/file search, containers, and short/long context price tiers
  • Google Gemini API: Gemini 3.1 Pro Preview, Gemini 3 Flash Preview, Gemini 3.1 Flash-Lite Preview, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite
  • Together AI: Llama 3.3 70B Turbo provider-specific bootstrap rate
TOKEN COUNTING
The browser token counter is a lightweight planning heuristic, not a true BPE tokenizer. It is useful for relative scenario exploration, but production estimates should be validated with provider token counters or logged API usage. Cross-provider tokenizer differences can be material, especially for long context, code, CJK text, and model-family upgrades.
CONFIDENCE INTERVALS — LOGNORMAL MODEL
Output token lengths follow a lognormal distribution (right-skewed). The model uses:
  • p90 = exp(μ + 1.282σ)
  • p99 = exp(μ + 2.326σ)
  • σ = √(ln(1 + CV²)), where CV = coefficient of variation weighted by task mix
CV values per task type are approximations based on observations of LLM output variance — not measured against published benchmarks. They should be treated as planning estimates with ±30% uncertainty bands themselves. For high-stakes budgeting, derive CV empirically from your own production logs.
TASK-TYPE OUTPUT MULTIPLIERS
Output multipliers (classify 0.30×, summary 0.65×, RAG 0.85×, code 2.80×, longform 3.60×, agentic 4.30×) are heuristic estimates derived from informal observations of typical task output lengths. They are not formally derived from HELM, LMSYS, or other published benchmarks. Real values for your specific use case may vary ±50%.
AGENT ENGINEERING COST MODEL
Upfront design effort is amortized over the deployment lifetime, then summed into the monthly headline alongside Operations and Personnel. The cost shape mirrors self_host.setup_amortized: roles × FTE × duration ÷ 12 → upfront total → ÷ amortization months → amortized monthly. Maintenance accounts for periodic re-specification as the domain drifts (design-lead loaded hourly × hours per session ÷ months between sessions).

Methodology-agnostic. The same shape models any structured agent-design methodology — DSPy, structured-prompt extraction, design-pattern playbooks, etc. Adjust role FTEs and duration to match your engagement.

Example — CARE (Collaborative Agent Reasoning Engineering): a three-party stage-gated approach pairing subject-matter experts, developers, and helper agents to produce structured agent specifications (interaction requirements, reasoning policies, evaluation criteria). The defaults shipped here (4-month design phase, 0.5/1.0/1.0/0.25 FTE for SME / design lead / developer / eval engineer, ~$400/mo helper-agent budget, quarterly re-spec cadence) approximate a CARE engagement. Ramachandran, Jha & Ramasubramanian, 2026 (arXiv:2604.28043).
CACHE DISCOUNT MODEL — LIMITATIONS
Anthropic prompt caching has a 5-minute TTL and only applies to specific prompt prefixes (system prompts cache best). The cache slider now splits eligible cached input into cache writes and cache reads where provider pricing exposes both. Anthropic 5-minute writes are modeled at the published write rate and reads at the published read rate. OpenAI/Gemini cached input is modeled as the published cached/context-caching input rate without a separate write surcharge. The cost engine now applies OpenAI/Gemini long-context tiers when a per-turn input exceeds the provider threshold encoded on the model. Actual cache eligibility still depends on stable prompt prefixes, TTL, provider cache thresholds, storage time, and request construction.
MULTIMODAL TOKEN ESTIMATES
  • Images: 1568 tokens per 1568×1568 image (Anthropic), variable for OpenAI low/high detail mode
  • Audio (STT): ~25 tokens/sec for English speech (Whisper baseline)
  • PDF pages: ~1500 tokens/page average (varies by content density)
  • Code interpreter: stdout/stderr counted as input tokens on next turn
WHAT THIS TOOL DOES NOT MODEL
  • Fine-tuning training cost amortisation (separate calculator recommended)
  • Infrastructure/hosting costs for self-hosted deployments (separate calculator recommended)
  • Human-in-the-loop reviewer time costs (separate calculator recommended)
  • Volume discount tiers (negotiate directly with vendors above 100M tok/mo)
  • Network egress, storage, vector DB operational costs beyond the optional file-search/container placeholders
  • Compliance overhead (FedRAMP, HIPAA, SOC2 audit costs)
  • Latency SLA penalty costs
DISCLAIMER
This tool produces planning estimates, not contractual cost commitments. For procurement decisions, federal budget submissions, or vendor negotiations: (a) verify pricing on the day of submission, (b) validate token estimates against your own production telemetry, (c) treat all p99 figures as soft upper bounds with their own ±20% uncertainty, (d) consult your finance team for compliance-specific cost adders. The author makes no warranty of accuracy and accepts no liability for procurement decisions based on these estimates.
PROVIDER COST MULTIPLIERS
  • Managed API: 1.00× (direct vendor list price)
  • BYOK: 1.00× (bring-your-own-key, no aggregator markup)
  • AWS Bedrock: 1.05× (typical 5% AWS markup, GovCloud available)
  • Azure OpenAI: 1.00× (parity with OpenAI direct, ent compliance)
  • OpenRouter: 1.05× (typical aggregator markup, varies)
  • Self-Hosted: 0× per-token + $5,000/mo fixed cost (default; configurable; covers H100/A100 instance + light ops)
These are representative averages; actual contracts vary. Negotiated enterprise agreements may be 10–20% below list price.
REALISTIC RETRY MODEL
v9 models retries as retry_rate × 1.5 × base_cost — accounting for partial output already generated before failure plus the full retry call. Previous versions used retry_rate × 1.0 × base_cost, which underestimates real retry waste by ~33%.
CONVERSATION SUMMARISATION OVERHEAD
When per-agent context fills above 70% of model context window, the agent must summarise older turns to continue. v9 models this as an additional API call costing approximately 30% of context tokens at input rate plus 30% × 30% at output rate. This is a real production cost ignored by simpler calculators.
BURSTINESS / PEAK PROVISIONING
Peak-to-average ratio above 2× incurs a 5% surcharge per ratio level above 2 (representing rate-limit overage charges, queue overflow handling, or premium-tier capacity). E.g. a 5× peak adds 5% × 3 = 15% to base cost.
LANGUAGE MULTIPLIER
Tokenizers compress different languages and model families differently. This multiplier is a coarse planning adjustment, not a provider-tokenizer substitute. Approximate multipliers: English 1.0×, Code 1.3×, Spanish/French 1.2×, Chinese/Japanese/Korean 1.8–2.2×, Arabic/Hebrew 1.4×. Validate with provider token counters or production usage logs before budget submission.