▶Project profile name, team, audience
Identify the deployment. These labels appear at the top of the report. Public-facing? drives the bot-factor multiplier and informs sensible daily-cap recommendations downstream.
Pick where the LLM runs. Downstream sections (Reservations, Self-host capacity, On-prem amortization) auto-show only when relevant.
Advanced — token budgets per call
| Model | API total | RAG share | Reason share | Guard+ | Tool tok | Tool fees+ | Cache saved | Retry+ | p50 | p90 | p99 |
|---|
▶Federal compliance & hosting FedRAMP, ATO, egress, audit retention
Federal AI deployments cost more than commercial. Two kinds of overhead apply: multipliers (FedRAMP/GovCloud premium and multi-region redundancy multiply your LLM compute cost), and additive line items (ATO, egress, log retention, vector DB hosting). Set to "none / single / 0" if you're modeling a commercial deployment.
Hosting multipliers
Multipliers stack — e.g. FedRAMP High + active-active = ×2.60 on LLM and GPU costs.
Additive monthly costs
Not modeled: ATO assessment labor (use Agent engineering or Personnel for that), sole-source procurement overhead, GSA Schedule discounts, model availability constraints (some Anthropic/OpenAI models aren't in GovCloud). Add fixed-cost approximations as line items in Infrastructure below.
▶Your typical question token sizes for one canonical query
A "token" is roughly ¾ of a word (so 1,000 tokens ≈ 750 English words). The cost depends on how big each question and answer is. Don't know your numbers? Use the wizard below — it asks plain-English questions and figures out the token math for you.
▶Question types full-pipeline, lookup-only, refusal, …
Different questions have different token cost. A simple lookup uses fewer tokens than a full multi-tool pipeline. Each shape's factor scales the typical query above (1.0 = same size; 0.5 = half). ⓘ
▶Multi-agent pipeline optional — if your system uses more than one LLM call per query
Many real deployments make several LLM calls per user question — e.g., a Planner decides which tools to call, a Retriever ranks documents, a Tool agent formulates a database query, a Summarizer writes the answer. Each agent has its own token budget. If you fill this in, the engine uses agent-sum instead of "Your typical question" above. Leave empty if your deployment is single-call.
Common pipeline templates
▶Traffic mix presets how questions split across types
Each preset is a different traffic distribution — e.g., "worst case" treats every question as full-pipeline; "mixed" reflects realistic production traffic. Weights must sum to 1.
▶Your users monthly active users, sessions, questions
Break your audience into segments — e.g., "internal staff" vs "public visitors". Each has a different usage pattern. Anonymous public segments get a "bot factor" multiplier to account for crawlers and abuse.
▶API reservations / committed-spend discount or PTU vs on-demand
Reserved API capacity (Azure PTU, AWS Bedrock provisioned, OpenAI Enterprise commit) buys a discount in exchange for commitment. Often 30–50% effective savings at >$10K/mo on-demand spend. Choose the plan + size in units (PTU-style) — the engine swaps in the right cost model. Requires API hosting.
Edit the rates / discounts in the Prices tab → API reservations table. Effective monthly cost shows in the report's "Reservation savings" row.
▶Embeddings (RAG) ingest cost + per-query embedding
RAG systems pay for embeddings twice: once at ingest (chunking the corpus into vectors, amortized over the re-embed cycle) and once at query time (embedding the user question to do vector search). For a 100M-token corpus + 1M queries/mo this often runs $50–$500/mo. Skip if not using RAG.
▶Migration timeline phased deployment over 36 months
Real procurement decisions span phases — "API year 1, hybrid year 2, self-host year 3". Each phase has its own hosting + reservation. Engine computes monthly cost per phase and total 3-year TCO. The chart in the report visualizes cost-over-time so you can see where transitions break even.
▶Personnel / staffing FTE allocations × loaded annual salary
Federal RFPs require fully-loaded labor in the cost basis. Each role: FTE allocation × annual base × total-comp multiplier (1.30 = +30% benefits/overhead) ÷ 12. Set FTE to 0 to skip a role. Salaries are editable in the Prices tab → Personnel. This is ongoing post-launch staffing; upfront design effort is in Agent engineering below.
▶Agent engineering upfront design effort + maintenance — amortized monthly
Effort to design, build, and validate the agent before launch — SME interviews, requirements/spec, evaluation criteria, prototype, calibration runs. Roles × FTE × duration. Methodology-agnostic; defaults approximate a CARE-style engagement. Amortized over deployment lifetime so it shows up as a monthly line item alongside Operations and Personnel. Ongoing post-launch staffing lives in Personnel above.
Roles × FTE during design phase
Each row uses the same loaded-salary math as Personnel. Edit role salaries in Prices → Personnel.
Maintenance — re-engineering cadence
—
—
—
▶Traffic safeguards bot rate limiting
Bot rate-limiting protects against runaway cost from anonymous traffic. (The old $/day spending cap was retired — it distorted the cost math by clipping queries at an artificial threshold. To answer "what scale fits my budget?", use the Budget solver section above.)
Public endpoints get hammered by bots. Rate-limiting strategy ranges from cheap edge filters to full WAF + CAPTCHA. The bot ceiling caps the "bot factor" that scales anonymous traffic (set in Project profile).
▶Self-host capacity only matters if you self-host
A catalog of GPU specs the engine can use. The currently-selected GPU is marked with a ●. Hosting strategy is set in Deployment posture at the top.
Critical for bursty traffic. A NOAA storm explainer might run only ~10% of the month; cutting GPU hours 10× changes the API-vs-self-host tradeoff dramatically.
▶Budget solver given a budget, find the affordable scale
Inverse of the calculator: enter a target monthly budget and the engine solves for the maximum MAU the deployment can support at the current cost-per-query. Plus what tradeoffs (cache, model swap, batch tier) would unlock more headroom.
▶Model cost comparison switch models to see headline savings
Re-runs the full calculation against every available model — same workload, same hosting, same infra, same federal multipliers. Sorted cheapest to most expensive. Δ shows annual difference vs your current pick.
| Model | $/query | $/month | $/year | Δ vs current (annual) |
|---|
⚠ Cheaper ≠ better. Smaller models (gpt-5-nano, haiku-4.5) cost a fraction of flagship models but lose accuracy on multi-step reasoning, RAG faithfulness, and tool use. Validate any switch against your own evaluation set before committing — savings here are only meaningful if the cheaper model still meets your quality bar.
▶Sensitivity how robust is the headline to input drift?
Each row perturbs one input around its baseline and shows the resulting monthly cost. Sorted by impact (biggest driver on top). The black tick on each bar marks the baseline; red extends to the low case, green to the high case. Useful for budget defense: if a 20% MAU miss blows past your budget ceiling, you need either a larger envelope or stronger optimization levers before signing.
| Lever | Low case | Headline range | High case |
|---|---|---|---|
| Computing… | |||
Perturbations: MAU ±20%, Cache hit rate ±10pp, Provider rates ±15%, Bot factor ±20%, Turns/session ±20%. Cache and rate perturbations approximate the deltas analytically — re-run with your own scenario knobs for exact figures.
▶Cost over time monthly + cumulative projection given growth%
Projects the monthly headline forward using the Growth/month slider in the simulator. Shows monthly cost climbing month-over-month plus the running 36-month cumulative. Useful for sizing 1-year and 3-year budget envelopes.
Compounds the current simulator growth rate monthly. Caps at 36 months since further projections are unreliable. Headcount, seat-license, and contract reservations don't grow proportionally — adjust manually for those.
▶Side-by-side compare two scenarios diffed line-by-line
Pick two scenarios. The calculator runs each through the same engine and shows the inputs and outputs side by side, with the % difference highlighted. Current = your live scenario in this calculator.
▶AS-IS vs proposed compare against your current contract
Enter what you're paying today (or what an incumbent vendor quoted you). The calculator surfaces the delta vs the proposed deployment: savings or overrun, plus the payback window if there's a one-time migration cost.
Today's annual spend on whatever this deployment replaces. Vendor invoice, internal cost-allocation, or incumbent quote.
Switching cost — data migration, training, parallel running, contract exit fees. Use 0 if greenfield.
▶Infrastructure database, storage, networking, monitoring — fixed monthly costs
Fixed monthly costs that show up regardless of LLM hosting choice — RDS, S3, CloudWatch, ALB, NAT Gateway, Route 53, etc. Usually the smaller part of total bill but they add up. Each line accepts flat $/mo, $ per query, or $ per GB scaling.
📊 Published cost benchmarks
Real cost numbers cited in vendor pricing pages, earnings calls, GAO reports, and academic studies. Use these to sanity-check your calculator output and to defend procurement budgets with citations. Click any source link to verify the number.
💲 Price book
Single source of truth for every price the calculator uses — LLM rates, API reservations, GPU instances, embeddings, vector DBs, AWS infra, personnel salaries, ATO costs. Edits here override the defaults for this workload only and persist in the URL hash. Defaults are validated against vendor pricing pages; last full refresh: —. Future plan: a scraper periodically fetches each source_url and bumps last_verified.
—
—
—
Cost composition
API vs self-host comparison
Per-segment breakdown
Infrastructure breakdown
Derivation of your numbers
Full math trace for this workload — copy-paste into ChatGPT/Claude/Gemini and ask "verify this math". For source citations and formula details, see Methodology in the sidebar.
Methodology, sources & disclosures planning only — refresh prices before procurement
Pricing sources, token-counting heuristics, confidence-interval math, and what this calculator does not model. For the math applied to your numbers, see Derivation of your numbers above.
- Anthropic API: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; includes cache write/read pricing
- OpenAI API: GPT-5.5, GPT-5.5 Pro, GPT-5.4 family, embeddings, web/file search, containers, and short/long context price tiers
- Google Gemini API: Gemini 3.1 Pro Preview, Gemini 3 Flash Preview, Gemini 3.1 Flash-Lite Preview, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite
- Together AI: Llama 3.3 70B Turbo provider-specific bootstrap rate
- p90 = exp(μ + 1.282σ)
- p99 = exp(μ + 2.326σ)
- σ = √(ln(1 + CV²)), where CV = coefficient of variation weighted by task mix
self_host.setup_amortized: roles × FTE × duration ÷ 12 → upfront total → ÷ amortization months → amortized monthly. Maintenance accounts for periodic re-specification as the domain drifts (design-lead loaded hourly × hours per session ÷ months between sessions).
Methodology-agnostic. The same shape models any structured agent-design methodology — DSPy, structured-prompt extraction, design-pattern playbooks, etc. Adjust role FTEs and duration to match your engagement.
Example — CARE (Collaborative Agent Reasoning Engineering): a three-party stage-gated approach pairing subject-matter experts, developers, and helper agents to produce structured agent specifications (interaction requirements, reasoning policies, evaluation criteria). The defaults shipped here (4-month design phase, 0.5/1.0/1.0/0.25 FTE for SME / design lead / developer / eval engineer, ~$400/mo helper-agent budget, quarterly re-spec cadence) approximate a CARE engagement. Ramachandran, Jha & Ramasubramanian, 2026 (arXiv:2604.28043).
- Images: 1568 tokens per 1568×1568 image (Anthropic), variable for OpenAI low/high detail mode
- Audio (STT): ~25 tokens/sec for English speech (Whisper baseline)
- PDF pages: ~1500 tokens/page average (varies by content density)
- Code interpreter: stdout/stderr counted as input tokens on next turn
- Fine-tuning training cost amortisation (separate calculator recommended)
- Infrastructure/hosting costs for self-hosted deployments (separate calculator recommended)
- Human-in-the-loop reviewer time costs (separate calculator recommended)
- Volume discount tiers (negotiate directly with vendors above 100M tok/mo)
- Network egress, storage, vector DB operational costs beyond the optional file-search/container placeholders
- Compliance overhead (FedRAMP, HIPAA, SOC2 audit costs)
- Latency SLA penalty costs
- Long-form simulation messages. When in workflow mode, both user prompts and agent responses use realistic multi-paragraph templates that match what research-orchestration agents actually produce — structured outputs with sections like "What I observed", "What cannot be determined", "What I need from you", "Risks detected". Token counts shown in the live stream now reflect actual session realities.
- DAG topology controls. Three execution patterns: Sequential (linear pipeline, 1 agent at a time), Parallel (all stages concurrently — wall-clock = max stage time but hits rate limits), Hybrid (sequential trunk + parallel sub-branches at certain stages, realistic for AKD-style workflows). Cost is identical regardless of topology, but parallelism affects rate limit overage costs and concurrent quota utilisation.
- Concurrent quota model. Each provider has a max concurrent request quota (e.g. Anthropic Tier 2 = 50 concurrent). Exceeding it triggers queueing/throttling — modelled as 2% surcharge per overflow request. Rate limit overage on retries adds 1.5× cost on the failed fraction.
- Sequential chain handoff — each stage's output becomes part of the next stage's input. Slider controls the % of prior output passed forward (typical 70–90%).
- Bulk document ingestion — PDFs/session × pages × tokens/page added to the % of stages that read the corpus. Distinct from RAG retrieval.
- Partial rerun — % of stages re-executed when users review and reject output. Multiplies base cost.
- Fact-check sidecar — separate verification call per stage with its own (typically cheaper) model. Common in research workflows where every output is checked.
- Template amortization — workflow planning is one-time but template runs many times. Subtracts a small saving as runs/template grows.
- HITL pause storage — session state retention during user review pauses. Negligible per session, accumulates at scale.
- Per-provider tool fees. Web search, file search, and container-session fees are now keyed by the agent's provider+family. Anthropic web search is $10/1k; OpenAI Assistants $10/1k web + $2.50/1k file + $0.03/container session; Google Vertex Search Grounding $35/1k. Bedrock/Azure/OpenRouter pass through with no separate billing. Self-hosted has zero tool fees.
- Tunable cache write share. Cached tokens are split between cache reads (cheap, 90% discount) and cache writes (premium, ~25% over list price for the first hit on Anthropic's 5-minute TTL). The new slider lets you model "cold start" vs "steady state" deployments — the share defaults to 10% (steady state) but rises to 30–50% for low-volume or fresh deployments where 5-minute TTL frequently expires.
- Multi-tier long context. Models can now declare a
tiers[]array with multiple price levels (e.g. Gemini standard ≤200k, long 200k–1M, ultra-long >1M). The legacylongThreshold/longIn/longOutbinary format remains supported. The engine picks the highestthresholdAt≤ current input size.
- Managed API: 1.00× (direct vendor list price)
- BYOK: 1.00× (bring-your-own-key, no aggregator markup)
- AWS Bedrock: 1.05× (typical 5% AWS markup, GovCloud available)
- Azure OpenAI: 1.00× (parity with OpenAI direct, ent compliance)
- OpenRouter: 1.05× (typical aggregator markup, varies)
- Self-Hosted: 0× per-token + $5,000/mo fixed cost (default; configurable; covers H100/A100 instance + light ops)
retry_rate × 1.5 × base_cost — accounting for partial output already generated before failure plus the full retry call. Previous versions used retry_rate × 1.0 × base_cost, which underestimates real retry waste by ~33%.