▶Project profile name, team, audience
What this is. Identity for the deployment — project name, owning team or agency, one-paragraph description, and the hosting strategy (managed API, BYOK, self-hosted GPU, on-prem). Drives the report title block and gates which downstream sections appear.
Why it matters. The hosting choice is the single largest structural decision in the report: it switches the entire cost model between per-token-API (linear with traffic), reserved-capacity (committed spend), and capex-amortized (fixed monthly regardless of traffic). Getting it wrong by mis-clicking can move the bill by 5–10× and surface the wrong reservation / self-host fields downstream.
How to interpret the results. Pick the hosting that matches your actual procurement vehicle — not the one you wish you had. Project name and description appear verbatim at the top of the report, so write them as you'd want a reviewer to read them. Sections downstream auto-show / auto-hide based on this choice; if you don't see Reservations, you're on a hosting model that doesn't use them.
Pick where the LLM runs. Downstream sections (Reservations, Self-host capacity, On-prem amortization) auto-show only when relevant.
Workload-wide knobs — shared across all agents (per-agent specifics live in the Agent fleet tab). Two sub-panels below: Your audience sizes the traffic (MAU × sessions × questions), Workload parameters shapes the per-query bill (cache hit rate, retry, growth, plus advanced calibration knobs). Open each panel for detailed guidance on its knobs.
What this is. Audience segmentation — split MAU into groups that use the app differently. Each segment has its own MAU, sessions/day per user, questions/session, and authentication status. Anonymous public segments get a bot-factor multiplier (typically 1.5–2.5×) to account for crawlers, scrapers, and abuse traffic.
Why it matters. Authenticated internal staff might run 3 deep sessions/day with 15 questions each; anonymous public visitors might run 0.05 sessions/day with 2 questions each. Modeling these together as one average user obscures cost behavior — and bot-factor on the public segment routinely accounts for 30–60% of the bill on consumer-facing deployments. Without this breakout, you can't defend the MAU number to procurement.
How to interpret the results. Pull each segment's numbers from analytics (login records for authenticated, web analytics for anonymous). Set bot-factor to 1.0× for internal-only, 1.5× for typical public, 2.5–3× for SEO-targeted public pages. The aggregate "total monthly questions" used in the headline = Σ (MAU × sessions × questions × 30 × bot-factor) across all segments.
Single vs Multi. One audience uses the slider trio below. Click + Split into audience types to break MAU into distinct groups (auth power users vs. anonymous public, paid vs. free tier, etc.). Both views drive the same engine math — workload.segments[] is the only source of truth.
What this is. The per-query shape knobs that sit on top of your audience volume. Cache hit rate = vendor discount when you re-serve the same system prompt (every +10pp ≈ 8% off the input bill). Retry rate = share of failed calls that re-run (measured from logs, not set in code). Growth / month = compounding traffic growth used for the 12- and 36-month projection. The advanced knobs (Bot factor, Cache write share, Batch async %, Context compression %, Peak/avg ratio, Language multiplier) cover specialized calibration and only appear in Advanced mode.
Why it matters. These are the second-largest set of cost levers after audience volume. Cache hit rate is THE biggest single lever — moving from 45% → 85% on a $30K/mo deployment saves ~$10K/mo. Retry rate above 5% means real production friction that costs real money. Bot factor on a public-facing segment routinely accounts for 30–60% of the bill (and silently — the multiplier compounds with the audience volume). Growth doesn't change this month's headline but determines what the 36-month TCO looks like in the budget memo.
How to interpret the results. Aim for cache hit ≥70% on stable system prompts (workflow: deduplicate sysprompts, lock them in a versioned artifact, never inline-edit). Pull retry rate from production logs, not guess. Set bot factor to 1.0× for internal SSO-only, 1.5× for typical public, 2.5–3× for SEO-targeted public pages. Set growth conservatively (5–15% for known internal tools, 20–50% for public launches). If your post-launch bill diverges from this estimate, the answer is almost always in this panel — recalibrate cache, retry, and bot factor first.
Basic vs Advanced. Basic mode shows only the Growth slider — the other knobs keep their preset defaults (cache 70%, retry 3%, bot 1.5×) so the headline cost is identical to Advanced. Switch to Advanced to tune all 9 knobs and see the calibration sliders.
↳ Per-agent override: agents with a Task bias set (Section C card → Task bias dropdown) bill against a 60/8/8/8/8/8 mix of their biased type, multiplied into their own output_tokens. Useful for mixed fleets — e.g. a Triage agent (bias=classify, cheap output) + a Responder agent (bias=longform, expensive output) in the same workload.
What this is. The catalog of tools your agents can call — web search, vector search, code execution, database lookups, image generators, STT/TTS, custom MCP servers. Each entry defines the tool's per-call fee, the typical schema size injected into the prompt, the result-token shape it returns, and which provider bills it (managed-API, BYOK, or self-hosted).
Why it matters. Tool fees are billed independently of LLM tokens. A vector-search agent at 5K queries/mo using Exa at $0.005/call adds $25/mo flat; an image-gen tool at $0.04/image on 50K calls/mo is $2K — bigger than the LLM bill on some workloads. Tools also pump tokens back into the agent's input (the result), which the engine bills at full input-token rate.
How to interpret the results. Add only the tools your fleet actually calls; each one is wired up in Section C per-agent (via enabled_tools). The "fees + tokens" badge in the cost ledger (Section E) breaks out tool-fee impact separately from LLM-token impact, so you can see which line is hurting. If you're seeing surprise costs, this is the second place to check after Section A.
Embeddings model: text-embedding-3-small @ $0.02/1M input tokens.
What this is. The per-agent editor. Each card configures one agent in the fleet — its model, system prompt size, RAG settings, which tools it enables, guardrail tokens, task bias, and activation rate. Topology (Single / Fleet / Workflow) sits above; it decides whether agents fan-out in parallel, chain sequentially, or run alone. The reference-topology diagram (collapsible) shows what your choices look like.
Why it matters. Most real cost-engineering happens here. Doubling agents roughly doubles per-query cost (each runs full sysprompt + RAG + reasoning), so 1→3 on a $30K/mo deployment ≈ +$60K/mo. The model choice per agent compounds — putting Opus-4.7 on a Triage agent that only needs gpt-5.4-mini wastes ~5× per call. activation_rate shaves cost on conditional agents (e.g. an Image-Enhancer that only fires on 30% of requests = 0.3× monthly contribution).
How to interpret the results. Each agent card shows its monthly contribution ($/mo · % of fleet) in the header, with a "Compare models" expander that runs the engine against every model so you see the swap delta before committing. Use the topology cards to enforce structure; use Task bias to match each agent to its workload character (Triage→classify, Drafter→longform, etc.).
What this is. The post-response verification pipeline that checks generated claims against a knowledge source before returning. Configures the verifier preset (MiniCheck, FactScore, AlignScore, FR2, etc.), NLI model hosting (self-hosted vs. managed), per-claim atomizer/reviser token budgets, and a cascade rule for escalating low-confidence claims to a second more-expensive verifier.
Why it matters. Skipping fact-checking is the right call for low-stakes chatbots and a compliance violation for legal/medical/financial. When you do need it, the verifier choice can swing fact-check cost by 20×: a self-hosted MiniCheck NLI run on every claim is ~$0.0001/claim; an API-hosted FR2 with reasoning is ~$0.002/claim. Cascade verification keeps the average cheap (90% of claims handled by the fast verifier) while still catching the hard cases.
How to interpret the results. Start by picking a preset that matches your risk profile — MiniCheck for general grounding, FR2 for high-stakes, FactScore if you publish citation rates. Set coverage to the fraction of responses that need verification (1.0 = all, 0.2 = 20%). Cascade settings show up as the ⇉ icon in the Section C architecture diagram; the cost flows into the headline as a separate "verify" line.
Advanced — token budgets per call
What this is. Three views on the headline number, each from a different angle. Per-turn token ledger — line-by-line accounting of where one canonical query's tokens go (sysprompt + history + RAG + tool results + output + reasoning). Cumulative token breakdown — same components rolled across the whole session as a stacked bar so the dominant cost driver pops visually. Lognormal CI distribution — the cost-per-session probability curve (p50 / p90 / p99) showing how wide your variance is on real traffic, not just the central estimate.
Why it matters. Headline cost is an opinion until you can see what's driving it. The ledger answers "which knob to tune": if RAG chunks consume 60% of input, the lever is retrieval depth (Section A → RAG chunks); if reasoning tokens are 40% of output, the lever is thinking mode (per-agent in Section C). The CI distribution answers "how confident is the number": a wide p50→p99 spread means a single bad day can blow past the central estimate, so you size budget envelopes against the tail, not the median.
How to interpret the results. Ledger: look for the biggest bar — that's your priority lever. Compare "uncached input" vs "cached input" — if uncached is >50%, prompt-cache isn't earning its keep (system prompt instability, dynamic preambles). Cumulative chart: confirms the ledger findings at session scale. CI distribution: bring the p90 to the procurement memo as the budget defense — "central estimate $X, p90 worst-case $Y, $/session standard deviation Z%."
What this is. A side-by-side cost table running your current workload against every model in the rate card — Claude Opus/Sonnet/Haiku, GPT-5.x family, Gemini 3, Llama 3.3, etc. Each row shows per-query cost, monthly bill, and percentage delta vs. your selected model. The 12-month projection chart shows the same models curved over time at your configured growth rate.
Why it matters. Model choice is usually the second-biggest decision after agent count. The same workload run on Opus-4.7 vs Haiku-4.5 can differ by 8×; gpt-5.4 vs gpt-5.4-mini by 5×. But naïve "always pick the cheapest" ignores quality differences — this table lets you see the cost gap so you can decide if the quality lift is worth it.
How to interpret the results. The "whole-fleet uniform" badge is a reminder: this table assumes every agent runs the same model. Real fleets mix (Triage on Mini, Drafter on Opus) — use Section C's per-agent "Compare models" expander for the realistic per-agent swap math. Treat this table as a "ceiling vs floor" reference, then mix and match in Section C.
| Model | API total | RAG share | Reason share | Guard+ | Tool tok | Tool fees+ | Cache saved (info) | Retry+ | p50 | p90 | p99 |
|---|
What this is. A robustness check on your headline number. The sensitivity table flips each input ±50% in isolation and reports the resulting % change in monthly cost; the tornado chart shows the same ranked by impact. The What-If cards flip one knob (or a small bundle) to a named scenario value and report the delta — "what if cache hit rate hit 80%", "what if RAG chunks doubled", etc.
Why it matters. A budget that's right at the central estimate but blows up at +20% on the most sensitive input is not a budget — it's an optimistic guess. Sensitivity tells procurement reviewers which inputs your estimate is fragile to, so they know where to push back on assumptions. The What-If cards give a quick "lever pull" view for the most common cost-reduction or cost-increase moves.
How to interpret the results. Top bars in the tornado = the inputs that matter most — these are the ones you should validate hardest before signing. What-If cards reading 0.0% means that knob doesn't apply to your current preset (RAG off, guardrails already at zero, etc.). Use this section to write the "Sensitivity" paragraph in your procurement memo.
Each card flips one knob (or a small bundle) and shows how much your $/session changes vs. the current setup. 0.0% means the knob doesn't apply — e.g. "Double RAG chunks" reads 0% when RAG is off in your preset, "Add full guardrails" reads 0% when guardrails are already at the proposed values. Hover any card for what it tests and why it matters.
▶Federal compliance & hosting FedRAMP, ATO, egress, audit retention
What this is. The federal compliance overhead layer — FedRAMP/GovCloud premium multiplier, multi-region redundancy, plus additive line items for ATO certification, egress fees, audit log retention, and vector-DB hosting. Set to none / single / 0 if you're modeling a commercial deployment.
Why it matters. Federal AI typically costs 30–80% more than the commercial equivalent at the same traffic — GovCloud carries a 15–25% compute premium, FedRAMP-Moderate adds an annual ATO line ($150K–$400K amortized monthly), and audit retention can add $200–$800/mo. Ignoring these makes the proposal look unrealistic to procurement reviewers; including them lets you set defensible budget envelopes.
How to interpret the results. Pick the FedRAMP tier that matches your authorizing memo (Low / Moderate / High). Multi-region only if you have a hot DR requirement. The ATO amortization assumes a 36-month cycle by default — adjust if your re-cert window differs. Costs from this section flow into the headline as additive monthly lines, broken out separately in the Cost breakdown so reviewers can audit them.
Hosting multipliers
Multipliers stack — e.g. FedRAMP High + active-active = ×2.60 on LLM and GPU costs.
Additive monthly costs
Not modeled: ATO assessment labor (use Agent engineering or Personnel for that), sole-source procurement overhead, GSA Schedule discounts, model availability constraints (some Anthropic/OpenAI models aren't in GovCloud). Add fixed-cost approximations as line items in Infrastructure below.
▶Your typical question token sizes for one canonical query
What this is. Token sizes for one canonical query in your deployment — system prompt, user input, retrieved context (RAG chunks + tool results), and the assistant's reply. A "token" is roughly ¾ of an English word (1K tokens ≈ 750 words). Don't know your numbers? Use the plain-English wizard below — it maps "I write 2-paragraph answers with 5 search results attached" to actual token counts.
Why it matters. Per-query token size is one of the two multipliers in your monthly bill (the other is query volume in Your users). A 4K-token query at 1M queries/mo costs ~4× a 1K-token query at the same volume on the same model. Mis-sizing this is the most common reason for a budget overrun in the first three months of production.
How to interpret the results. Measure these from a sample of real traffic when you can — token counts from tiktoken or the provider's own tokenizer are authoritative; guesses are not. The "Your typical question" totals roll up into the per-turn ledger in Cost breakdown (Section E). If your post-launch bill diverges from this estimate, this section is the first place to recalibrate.
▶Question types full-pipeline, lookup-only, refusal, …
What this is. Question-type shapes — multipliers on top of the canonical query above. Full = baseline pipeline (1.0×); RAG = retrieval-only (~0.3×); refusal = out-of-scope reject (~0.05×); heavy = long-context multi-turn (~1.2×). The next section (Traffic mix presets) decides how queries split across these shapes.
Why it matters. Real fleets don't run one shape — a chatbot serves a mix of substantive questions, RAG lookups, polite refusals, and the occasional heavy follow-up. Modeling everything as the most expensive shape over-quotes by 2–4×; modeling everything as cheap lookups under-quotes by the same factor. The shape weights are how you reconcile.
How to interpret the results. Set factors to scale the canonical query — 0.5 means that shape uses half the tokens, 2.0 means double. If your fleet doesn't have refusals at all, leave the weight at 0 in the next section. The mix weights × shape factors × per-query cost = your effective per-query bill in the headline.
▶Multi-agent pipeline optional — if your system uses more than one LLM call per query
What this is. The procurement-side multi-agent pipeline editor — each agent specifies its own model, calls-per-query, input/output tokens, sysprompt size, RAG flag, and task bias. The simulator-side per-agent editor (Section C above) is the richer counterpart with topology cards, tool-enablement, and the live architecture diagram; this section is the JSON shape that flows into the procurement report.
Why it matters. Real production fleets are almost never single-call — Planner → Retriever → Drafter → Reviewer chains, parallel specialist fans, ReAct loops. Modeling these as "one big call" understates cost by 2–6× because each agent has its own sysprompt amortization, cache behavior, and output budget. When this section is populated, the engine sums per-agent cost and ignores "Your typical question" above.
How to interpret the results. Use the Common pipeline templates below to start (single-call, planner+executor, RAG, multi-agent verifier, etc.) then tune per-agent. The "Multi-agent mode is active" banner appears once you add an agent — confirming the engine has switched to agent-sum billing. If your deployment is actually single-call, leave this section empty.
Common pipeline templates
▶Traffic mix presets how questions split across types
What this is. A preset selector for how your traffic splits across the question shapes defined above. "Worst case" treats every query as full-pipeline (the expensive shape); "mixed" is a realistic production blend; "lookup-heavy" weights toward RAG; "refusal-heavy" toward the cheap reject path. Weights inside each preset sum to 1.0.
Why it matters. The mix preset is the bridge between the per-shape token sizes (above) and the headline bill. Same fleet on "worst case" can quote 2–4× higher than on "mixed" — picking the wrong preset is one of the easiest ways to wildly over- or under-quote. Procurement reviewers will (and should) ask which preset you used.
How to interpret the results. If you have real traffic data, write a custom preset that matches the empirical distribution; if you don't yet, "mixed" is the right starting point with "worst case" as a defensive ceiling. The selected preset weights flow into the per-query cost calculation in Cost breakdown.
▶API reservations / committed-spend discount or PTU vs on-demand
What this is. Committed-spend reservations on managed-API hosting — Azure PTU (Provisioned Throughput Units), AWS Bedrock Provisioned, OpenAI Enterprise commit, Anthropic on-demand commit. You pay a fixed monthly fee in exchange for a discount vs. pure on-demand billing. The engine computes effective $/query under the reservation and compares against on-demand.
Why it matters. Reservations typically yield 30–50% savings once on-demand spend exceeds ~$10K/mo, breaking even somewhere between $5K–$8K/mo depending on the provider. Below that threshold the commitment locks you into paying for unused capacity; above, it's free money. Procurement decisions on year-2 reservations swing the 3-year TCO by 20–30%.
How to interpret the results. Only appears when hosting = managed API. Pick the provider that matches your contract vehicle; set units to your committed quantity. The engine shows the break-even point — if your projected monthly spend is below that, on-demand wins; above, the reservation. Mix in the Migration timeline section to model "on-demand year 1, reserved year 2-3" patterns.
Edit the rates / discounts in the Prices tab → API reservations table. Effective monthly cost shows in the report's "Reservation savings" row.
▶Embeddings (RAG) ingest cost + per-query embedding
What this is. The dedicated embedding-token billing for RAG systems. Two cost lines: ingest (embedding the entire corpus into vectors, amortized over the re-embed cycle) and query-time (embedding each user question to run vector search). Independent of the LLM token bill — uses the chosen embedding model's separate rate card (text-embedding-3-small, voyage-3-large, etc.).
Why it matters. Often a forgotten line. For a 100M-token corpus + 1M queries/mo on text-embedding-3-small, this runs $50–$500/mo depending on re-embed frequency. Easy to miss until the first invoice. For very large corpora (500M+ tokens) on premium embedding models, the bill can rival the LLM line.
How to interpret the results. Skip entirely if you're not using RAG. Otherwise, set the corpus token count, the re-embed cadence (quarterly = 4×/yr, monthly = 12×/yr), and the embedding model. The cost shows up as a flat monthly line — predictable, but rises with corpus growth.
▶Migration timeline phased deployment over 36 months
What this is. Phased deployment plan over the procurement window — typically 36 months for federal, 24 for commercial. Each phase has its own hosting strategy, reservation, traffic level, and infrastructure cost. The engine computes a per-phase monthly bill plus the total multi-year TCO, with a cost-over-time chart showing where phases transition.
Why it matters. Real procurement isn't "pick one hosting, run it for 3 years" — it's "API year 1 to prove value, reserved year 2 to lock in discount, self-host year 3 if scale justifies the capex". Modeling the full curve lets you defend the multi-year ask in a single chart and identify break-even points where strategy shifts pay back.
How to interpret the results. Default 1 phase (matches the current single-strategy headline). Add phases when modeling a transition; each phase's overrides only affect that phase's monthly bill. The cumulative 3-year cost (bottom of chart) is the number that ends up in the procurement memo's bottom line.
▶Personnel / staffing FTE allocations × loaded annual salary
What this is. Ongoing post-launch staffing — the people who keep the deployment running after go-live. Each role: FTE allocation × annual base × total-comp multiplier (1.30 = +30% benefits/overhead) ÷ 12. Set FTE to 0 to skip a role. Salaries are editable in the Prices tab → Personnel. Upfront design effort (pre-launch) lives in Agent engineering below.
Why it matters. Federal RFPs require fully-loaded labor in the cost basis — leaving it out makes the proposal look amateur or non-compliant. Personnel is often 40–70% of a federal AI deployment's TCO; ignoring it under-quotes by 2–3×. Procurement reviewers will compare your loaded rates against the GSA schedule, so use realistic numbers.
How to interpret the results. Start with the roles that always exist (Product Owner, MLOps Engineer, on-call SRE), add specialists as the deployment grows (Prompt Engineer for high-volume tuning, Compliance Officer for federal). FTE = 0.25 means quarter-time. The monthly total flows into the headline as a separate "Personnel" line, broken out in the report.
▶Agent engineering upfront design effort + maintenance — amortized monthly
What this is. The upfront engineering effort to design, build, and ship the agent system, plus ongoing maintenance. Roles + FTE during the design phase (SME interviews, spec writing, eval criteria, prototype, calibration runs), amortized over the operational lifespan, plus a maintenance cadence for periodic re-specification.
Why it matters. A pure-token cost estimate underestimates real procurement spend by 30–50% because it ignores the team building the thing. Procurement reviewers expect this line; leaving it out makes the proposal look amateur. Federal RFPs in particular require pre-deployment design effort as a separately-itemized cost basis line.
How to interpret the results. Enable the section, add roles (Agent Design Lead, MLOps Engineer, Prompt Engineer, etc.), set FTE allocation during the design phase. The calculator amortizes the upfront cost over the project's useful life (default 36 months) and adds recurring maintenance hours. Output: a monthly $ line that flows into the headline alongside Personnel and Operations.
Roles × FTE during design phase
Each row uses the same loaded-salary math as Personnel. Edit role salaries in Prices → Personnel.
Maintenance — re-engineering cadence
—
—
—
▶Traffic safeguards bot rate limiting
What this is. Bot rate-limiting on anonymous traffic — sets requests-per-IP-per-minute thresholds that protect against runaway cost from crawlers, scrapers, and abuse. Independent of authentication (logged-in users typically aren't rate-limited; anonymous public visitors are).
Why it matters. Without rate-limiting, a single misconfigured search bot can generate 10K+ queries/hour and rack up thousands in a day. Public AI deployments have hit five-figure surprise bills from this exact pattern. Rate-limiting is cheap insurance; the right threshold is "above any legitimate user pattern, below crawler aggression".
How to interpret the results. Set the per-IP-per-minute cap to ~5–10× a power-user's burst rate (so legit users never hit it). The bot-factor multiplier in the Your-users section is the corollary — it's how you size the bill before rate-limiting clips the abuse tail. Rate-limiting changes the worst-case-cost ceiling; bot-factor sizes the expected.
Public endpoints get hammered by bots. Rate-limiting strategy ranges from cheap edge filters to full WAF + CAPTCHA. The bot ceiling caps the "bot factor" that scales anonymous traffic (set in Project profile).
▶Self-host capacity only matters if you self-host
What this is. The GPU capacity-planning section for self-hosted deployments. Catalog of GPU specs (H100, A100, MI300X, etc.), per-card throughput, replica count, utilization assumption, and cloud-vs-bare-metal pricing. The currently-selected GPU is marked with a ● dot. Only appears when hosting = self-host.
Why it matters. Self-host TCO is a different shape from API — high fixed monthly (GPU rent), low marginal-per-query. Break-even vs. API typically sits at $20K–$50K/mo of equivalent on-demand spend, depending on model size. Mis-sizing GPU count by 20% changes the bill by 20%, since unused capacity still bills. Utilization assumption is the biggest unknown — running at 30% averages most production fleets.
How to interpret the results. Pick the GPU that matches your model — H100 for 70B+ class, A100 for 13B–34B, consumer L40S for under 13B. Set replicas to peak QPS ÷ per-GPU throughput, with 30–50% safety margin. The headline "$/query effective" tells you when self-host beats API; cross-reference with the Migration timeline to model "API now, self-host at scale" transitions.
Critical for bursty traffic. A NOAA storm explainer might run only ~10% of the month; cutting GPU hours 10× changes the API-vs-self-host tradeoff dramatically.
▶Budget solver given a budget, find the affordable scale
What this is. Inverse mode of the calculator — instead of "given my inputs, what's the bill?", asks "given my budget ceiling, what's the maximum scale I can support?". Enter a target monthly budget; the engine solves for the maximum MAU the deployment fits under, plus enumerates which tradeoffs (cache hit, model swap, batch tier) would unlock more headroom without exceeding budget.
Why it matters. Procurement conversations usually start with a budget envelope ("you've got $50K/mo to work with"), not a traffic estimate. Forward-quoting from inputs leaves you guessing whether you'll fit; the solver tells you directly. Also tells you whether cost-optimization moves (cheaper model, batch async) buy you 10% or 60% more capacity — which informs whether they're worth the engineering effort.
How to interpret the results. Enter the monthly $ ceiling. The solver returns the max MAU at current per-query cost, plus a ranked list of moves that would extend headroom (e.g., "+18% MAU if cache hits 80%", "+45% MAU if you swap Drafter from Opus → Sonnet"). Use this to negotiate the budget envelope with stakeholders before locking in.
▶Model cost comparison switch models to see headline savings
What this is. Procurement-grade model comparison table. Re-runs the full deployment (same workload, hosting, infrastructure, federal multipliers, personnel) against every model in the rate card. Each row shows the model, monthly bill, and Δ annual difference vs. your current pick. Sorted cheapest to most expensive.
Why it matters. The model row in an RFP cost basis usually requires multiple-vendor pricing for fairness. This section generates the table directly so you don't have to re-run the calculator manually for each candidate. The annual Δ column is the number stakeholders care about when comparing "what if we used vendor X instead".
How to interpret the results. Cheapest is rarely the right answer for production — quality differences between, e.g., Haiku and Opus on the same workload can be substantial. Use this as a procurement reference, not a selection tool; pair with quality evals before committing. For per-agent model swaps (mixed fleets), Section C's "Compare models" expander is more precise than this whole-fleet uniform view.
| Model | $/query | $/month | $/year | Δ vs current (annual) |
|---|
⚠ Cheaper ≠ better. Smaller models (gpt-5-nano, haiku-4.5) cost a fraction of flagship models but lose accuracy on multi-step reasoning, RAG faithfulness, and tool use. Validate any switch against your own evaluation set before committing — savings here are only meaningful if the cheaper model still meets your quality bar.
▶Sensitivity how robust is the headline to input drift?
What this is. Procurement-side sensitivity analysis. Each row perturbs one input around its baseline and shows the resulting monthly cost. Sorted by impact (biggest driver on top). The black tick on each bar marks the baseline; red extends to the low case, green to the high case.
Why it matters. Procurement reviewers and finance committees require an explicit sensitivity section in any major-spend proposal. "Here's the central estimate" without "and here's what happens if MAU misses by 20%" is incomplete — it doesn't let stakeholders gauge whether the budget envelope has enough margin for realistic uncertainty.
How to interpret the results. The top bars are the inputs you should validate most carefully before signing — they're the ones that move the bill most. If any single-input ±20% perturbation blows past your budget ceiling, the proposal needs either a larger envelope or stronger optimization levers (cache, model swap, batch) before it's defensible. Copy the top 3 rows into your procurement memo's "Sensitivity" paragraph.
| Lever | Low case | Headline range | High case |
|---|---|---|---|
| Computing… | |||
Perturbations: MAU ±20%, Cache hit rate ±10pp, Provider rates ±15%, Bot factor ±20%, Turns/session ±20%. Cache and rate perturbations approximate the deltas analytically — re-run with your own scenario knobs for exact figures.
▶Cost over time monthly + cumulative projection given growth%
What this is. Time-series cost projection. Takes the current monthly headline and projects it forward using the Growth/month slider in Section A (default 20%/mo). Shows two curves: monthly cost climbing month-over-month, and running 36-month cumulative TCO.
Why it matters. Procurement envelopes are sized in years, not months. A $20K/mo deployment with 20% MoM growth hits $80K/mo by month 6, $300K/mo by month 12. Sizing the year-1 budget envelope from the current monthly bill is the most common way procurements run out of money mid-cycle. The cumulative curve is what you bring to the finance review.
How to interpret the results. Set growth/month conservatively — 5–15% for known-audience internal tools, 20–50% for public launches, >50% only for genuine viral profile. The 12-month cumulative is your year-1 budget ask; the 36-month is your multi-year envelope. If the curve looks too aggressive, revisit the growth assumption or build cost-optimization milestones into the Migration timeline.
Compounds the current simulator growth rate monthly. Caps at 36 months since further projections are unreliable. Headcount, seat-license, and contract reservations don't grow proportionally — adjust manually for those.
▶Side-by-side compare two scenarios diffed line-by-line
What this is. A diff tool. Picks two scenarios — your live config plus a saved or bundled comparison — and runs both through the same engine. Side-by-side table of inputs and outputs, with the % difference highlighted on every row that differs.
Why it matters. "Why does scenario A cost $30K and scenario B cost $50K?" is a question every procurement reviewer asks. Hunting through two configs manually is tedious and error-prone. A diff table answers it in one screen — which inputs differ, which outputs differ, and which difference explains the bulk of the gap.
How to interpret the results. Use this to defend a configuration change ("we cut the bill 40% by reducing RAG chunks and swapping the Drafter model"), or to compare your proposal against an alternative architecture. Rows with the largest % delta are the most important to explain in the procurement memo.
▶AS-IS vs proposed compare against your current contract
What this is. Incumbent-vs-proposed comparison. Enter what you're paying today (or what an incumbent vendor has quoted) and the calculator surfaces the delta against the proposed deployment — monthly savings or overrun, plus the payback window if you've entered a one-time migration cost.
Why it matters. Procurement justifications almost always require an AS-IS baseline — "we're saving $X/mo vs. the current contract" is a much stronger pitch than "the new system costs $Y/mo". Payback windows convert a one-time migration spend (engineering effort, data migration, recerts) into a months-to-recoup number that finance committees actually use.
How to interpret the results. Enter the incumbent's monthly bill from the actual contract or invoice (not a guess). One-time migration costs should include engineering hours × loaded rate, plus any data-migration / re-certification fees. Payback <6 months is usually an easy approval; >18 months requires defending why the move is strategic beyond pure cost.
Today's annual spend on whatever this deployment replaces. Vendor invoice, internal cost-allocation, or incumbent quote.
Switching cost — data migration, training, parallel running, contract exit fees. Use 0 if greenfield.
▶Infrastructure database, storage, networking, monitoring — fixed monthly costs
What this is. Fixed cloud-infrastructure costs that show up regardless of LLM hosting choice — RDS, S3, CloudWatch, ALB, NAT Gateway, Route 53, secrets manager, observability stack. Each line accepts flat $/mo, $ per query, or $ per GB scaling so you can model both fixed and traffic-scaled lines.
Why it matters. Usually the smaller part of total bill (5–15% of TCO for managed-API deployments; higher for self-host) but they add up — a federal-tier deployment can easily run $2K–$5K/mo just on observability + audit logging + redundant networking. Easy to forget when modeling, embarrassing to discover in the first invoice.
How to interpret the results. Pull line items from your existing AWS/Azure/GCP bill if you have one — that's the most accurate source. For greenfield estimates, the bundled defaults are a reasonable starting point for a single-region production deployment. Flows into the headline as a separate "Infrastructure" line, broken out in the report so reviewers see it isn't bundled into the model cost.
📊 Published cost benchmarks
Real cost numbers cited in vendor pricing pages, earnings calls, GAO reports, and academic studies. Use these to sanity-check your calculator output and to defend procurement budgets with citations. Click any source link to verify the number.
💲 Price book
Single source of truth for every price the calculator uses — LLM rates, API reservations, GPU instances, embeddings, vector DBs, AWS infra, personnel salaries, ATO costs. Edits here override the defaults for this workload only and persist in the URL hash. Defaults are validated against vendor pricing pages; last full refresh: —. Future plan: a scraper periodically fetches each source_url and bumps last_verified.
—
—
—
Cost composition
API vs self-host comparison
Per-segment breakdown
Infrastructure breakdown
Derivation of your numbers
Full math trace for this workload — copy-paste into ChatGPT/Claude/Gemini and ask "verify this math". For source citations and formula details, see Methodology in the sidebar.
Methodology, sources & disclosures planning only — refresh prices before procurement
Pricing sources, token-counting heuristics, confidence-interval math, and what this calculator does not model. For the math applied to your numbers, see Derivation of your numbers above.
- Anthropic API: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; includes cache write/read pricing
- OpenAI API: GPT-5.5, GPT-5.5 Pro, GPT-5.4 family, embeddings, web/file search, containers, and short/long context price tiers
- Google Gemini API: Gemini 3.1 Pro Preview, Gemini 3 Flash Preview, Gemini 3.1 Flash-Lite Preview, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite
- Together AI: Llama 3.3 70B Turbo provider-specific bootstrap rate
- p90 = exp(μ + 1.282σ)
- p99 = exp(μ + 2.326σ)
- σ = √(ln(1 + CV²)), where CV = coefficient of variation weighted by task mix
self_host.setup_amortized: roles × FTE × duration ÷ 12 → upfront total → ÷ amortization months → amortized monthly. Maintenance accounts for periodic re-specification as the domain drifts (design-lead loaded hourly × hours per session ÷ months between sessions).
Methodology-agnostic. The same shape models any structured agent-design methodology — DSPy, structured-prompt extraction, design-pattern playbooks, etc. Adjust role FTEs and duration to match your engagement.
Example — CARE (Collaborative Agent Reasoning Engineering): a three-party stage-gated approach pairing subject-matter experts, developers, and helper agents to produce structured agent specifications (interaction requirements, reasoning policies, evaluation criteria). The defaults shipped here (4-month design phase, 0.5/1.0/1.0/0.25 FTE for SME / design lead / developer / eval engineer, ~$400/mo helper-agent budget, quarterly re-spec cadence) approximate a CARE engagement. Ramachandran, Jha & Ramasubramanian, 2026 (arXiv:2604.28043).
- Images: 1568 tokens per 1568×1568 image (Anthropic), variable for OpenAI low/high detail mode
- Audio (STT): ~25 tokens/sec for English speech (Whisper baseline)
- PDF pages: ~1500 tokens/page average (varies by content density)
- Code interpreter: stdout/stderr counted as input tokens on next turn
- Fine-tuning training cost amortisation (separate calculator recommended)
- Infrastructure/hosting costs for self-hosted deployments (separate calculator recommended)
- Human-in-the-loop reviewer time costs (separate calculator recommended)
- Volume discount tiers (negotiate directly with vendors above 100M tok/mo)
- Network egress, storage, vector DB operational costs beyond the optional file-search/container placeholders
- Compliance overhead (FedRAMP, HIPAA, SOC2 audit costs)
- Latency SLA penalty costs
- Managed API: 1.00× (direct vendor list price)
- BYOK: 1.00× (bring-your-own-key, no aggregator markup)
- AWS Bedrock: 1.05× (typical 5% AWS markup, GovCloud available)
- Azure OpenAI: 1.00× (parity with OpenAI direct, ent compliance)
- OpenRouter: 1.05× (typical aggregator markup, varies)
- Self-Hosted: 0× per-token + $5,000/mo fixed cost (default; configurable; covers H100/A100 instance + light ops)
retry_rate × 1.5 × base_cost — accounting for partial output already generated before failure plus the full retry call. Previous versions used retry_rate × 1.0 × base_cost, which underestimates real retry waste by ~33%.