AI Cost Calculator

Total / mo (all-in)

Project profile name, team, audience

Identify the deployment. These labels appear at the top of the report. Public-facing? drives the bot-factor multiplier and informs sensible daily-cap recommendations downstream.

Hosting strategy

Pick where the LLM runs. Downstream sections (Reservations, Self-host capacity, On-prem amortization) auto-show only when relevant.

A Traffic & efficiency sessions, turns, cache, retries — drives queries-per-month and the headline cost
View Showing core knobs only — switch to Advanced for full control over reasoning, multimodal, prompt overhead, guardrails, etc.
Topology Multi-Agent Fleet
Global Parameters
Agents3
Parallel reasoning agents
Monthly active users50
Turns / session8
Daily return rate per user0.3
Bot factor1.5×
1.0 = internal · 1.5× = typical public · 2–3× = under crawler load
Cache hit rate45%
~90% discount on hits
Cache write share10%
Of cached tokens: % first-write (premium); rest are reads. ~10% steady state, ~50% cold start.
Batch async %0%
50% discount
Retry rate3%
Growth / month20%
Peak / avg ratio
Burstiness — >2× adds 5%/level
Language multiplier1.0×
EN=1.0, code=1.3, CJK=1.8-2.2
Comm patternorch
orch=0 · sup=+150·(N−1) · peer=+300·(N−1) · pipeline=Σ prior outputs (Eq. 7)
OK
RAG / Fact Reasoning
RAGREASON
RAG Pipeline Tokens
Chunks retrieved5
Top-K retrieval
Tokens/chunk512
Context tokens per chunk
Query embed tokens128
Per retrieval query
Retrieval calls/turn1
Multi-hop queries
RAG tokens/turn: 0  ·  Embed cost/sess: $0  ·  Retrieval augments input by 0%
Corpus size (tokens)
Re-embed every (months)
Monthly amortized: $0  ·  per-query embedding tokens are set on the slider above
Multimodal Inputs (advanced)
Images / turn0
~1568 tok per 1568px image
Audio sec / turn0
~25 tok/sec speech-to-text
PDF pages / turn0
~1500 tok/page average
Code interp output0
stdout/stderr fed back
Prompt Engineering Overhead (advanced)
Few-shot examples0
~250 tok per example
JSON schema overhead0
Structured output enforcement
Citations / grounding0
Output overhead per response
Memory / persistent ctx0
Long-term agent memory
Extended Thinking (advanced)
Thinking tokens0
Extended thinking budget
Reasoning % of turns0%
Turns using extended think
CoT chain length0
Steps × ~150t each
Reasoning tokens/turn: 0  ·  Added cost/sess: $0
Guardrails + Tools GUARD
Guardrail Token Overhead (advanced)
Input guardrail tokens0
Classifier prompt overhead
Output guardrail tokens0
Output scan overhead
PII scan tokens0
PII detection per msg
Content policy tokens0
Policy classifier
Guard block rate0%
Blocked = tokens wasted
Separate guard model$0
$/1M for guard calls
Guard tokens/turn: 0  ·  Guard cost/sess: $0  ·  Block waste: $0
Tool Calls
Tool calls/turn2
Schema tokens320
Result tokens800
Fed back into context
IA msg overhead80
Provider tool fees/sess: $0 · Web search/file search/container line items are charged separately from tokens.
Web search / turn0
Default $10 / 1k calls
File search / turn0
Default $2.50 / 1k calls
Container sess.0
1GB session placeholder
Embeddings model$0.02
text-embedding-3-small per 1M input tokens
System Prompt
System prompt tokens512
Amortised over turns
Workload mix — query types ×1.0 out
Global mix across all queries (not per-agent). Drives output-token multipliers — classify ≈ 0.3×, summary ≈ 0.65×, RAG ≈ 0.85×, code 2.8×, longform 3.6×, agentic 4.3×.
Fact-checking pipeline VERIFY
Adds an automated factuality check on a sample of answers — catches hallucinations at the cost of extra tokens per verified query. Skip this section if you don't run a verifier.
Sampling coverage (0–1)
Verifier preset
Atoms (claims) per response
Where the fact-check model runs
Source for evidence retrieval
Advanced — token budgets per call
Atomizer · input tokens
Atomizer · output tokens
Reviser · input tokens
Reviser · output tokens
NLI · input tokens
NLI · output tokens
Service pod monthly ($)
B Agent fleet per-agent role, model, prompt, RAG, tools, guardrails — 0 agents
Fleet actions
Use the main Agents slider to choose fleet size. Agent edits are preserved when resizing where possible, and included in Share/JSON export.
What changes here
D Cost breakdown per-turn token ledger and cumulative composition
Per-Turn Token Ledger turn 0
Token composition (horizontal stack)
Cumulative Token Breakdown
Token breakdown chart.
Sys prompt
0
amortised
History ctx
0
conversation
RAG chunks
0
retrieval
Thinking
0
extended CoT
Guardrails
0
safety scan
Tools
0
schema+results
Lognormal CI Distribution — Cost Per Session
CI distribution chart.
E Comparisons model-by-model cost table and 12-month projection
Model Comparison planning estimate
ModelAPI totalRAG shareReason shareGuard+Tool tokTool fees+Cache savedRetry+p50p90p99
Monthly Projection — 12 months
Projection chart.
Per-Component Cost Bars
Full Itemised Breakdown
Per-Agent Cost Contribution heterogeneous
Monthly projection exceeds budget. Tune cache rate, enable batch processing, or reduce RAG chunks.
F Multi-model routing traffic-split presets and weighted blended cost
Multi-Model Routing — Traffic Split
Production agentic systems route queries to different models based on complexity. Drag sliders to allocate traffic %. Cost is computed as weighted average across all routed models.
Routed cost / session
Weighted blended cost: $0.00000
vs single-model (Sonnet baseline):
Monthly p50 (routed): $0
Routing Pattern Presets
Cost comparison: routing vs single-model
G Stress test ±50% sensitivity, tornado chart, what-if scenarios
Sensitivity — ±50% Parameter Impact
Tornado Chart
Tornado chart.
What-If Scenarios

Federal compliance & hosting FedRAMP, ATO, egress, audit retention

Federal AI deployments cost more than commercial. Two kinds of overhead apply: multipliers (FedRAMP/GovCloud premium and multi-region redundancy multiply your LLM compute cost), and additive line items (ATO, egress, log retention, vector DB hosting). Set to "none / single / 0" if you're modeling a commercial deployment.

Hosting multipliers

Multipliers stack — e.g. FedRAMP High + active-active = ×2.60 on LLM and GPU costs.

Additive monthly costs

Not modeled: ATO assessment labor (use Agent engineering or Personnel for that), sole-source procurement overhead, GSA Schedule discounts, model availability constraints (some Anthropic/OpenAI models aren't in GovCloud). Add fixed-cost approximations as line items in Infrastructure below.

Your typical question token sizes for one canonical query

A "token" is roughly ¾ of a word (so 1,000 tokens ≈ 750 English words). The cost depends on how big each question and answer is. Don't know your numbers? Use the wizard below — it asks plain-English questions and figures out the token math for you.

Estimate tokens in 30 seconds → No engineering required

Question types full-pipeline, lookup-only, refusal, …

Different questions have different token cost. A simple lookup uses fewer tokens than a full multi-tool pipeline. Each shape's factor scales the typical query above (1.0 = same size; 0.5 = half).

+ Add another question type

Multi-agent pipeline optional — if your system uses more than one LLM call per query

Many real deployments make several LLM calls per user question — e.g., a Planner decides which tools to call, a Retriever ranks documents, a Tool agent formulates a database query, a Summarizer writes the answer. Each agent has its own token budget. If you fill this in, the engine uses agent-sum instead of "Your typical question" above. Leave empty if your deployment is single-call.

+ Add an agent
Common pipeline templates

Traffic mix presets how questions split across types

Each preset is a different traffic distribution — e.g., "worst case" treats every question as full-pipeline; "mixed" reflects realistic production traffic. Weights must sum to 1.

+ Add another mix preset

Your users monthly active users, sessions, questions

Break your audience into segments — e.g., "internal staff" vs "public visitors". Each has a different usage pattern. Anonymous public segments get a "bot factor" multiplier to account for crawlers and abuse.

+ Add another audience segment

API reservations / committed-spend discount or PTU vs on-demand

Reserved API capacity (Azure PTU, AWS Bedrock provisioned, OpenAI Enterprise commit) buys a discount in exchange for commitment. Often 30–50% effective savings at >$10K/mo on-demand spend. Choose the plan + size in units (PTU-style) — the engine swaps in the right cost model. Requires API hosting.

Edit the rates / discounts in the Prices tab → API reservations table. Effective monthly cost shows in the report's "Reservation savings" row.

Embeddings (RAG) ingest cost + per-query embedding

RAG systems pay for embeddings twice: once at ingest (chunking the corpus into vectors, amortized over the re-embed cycle) and once at query time (embedding the user question to do vector search). For a 100M-token corpus + 1M queries/mo this often runs $50–$500/mo. Skip if not using RAG.

Migration timeline phased deployment over 36 months

Real procurement decisions span phases — "API year 1, hybrid year 2, self-host year 3". Each phase has its own hosting + reservation. Engine computes monthly cost per phase and total 3-year TCO. The chart in the report visualizes cost-over-time so you can see where transitions break even.

+ Add a phase

Personnel / staffing FTE allocations × loaded annual salary

Federal RFPs require fully-loaded labor in the cost basis. Each role: FTE allocation × annual base × total-comp multiplier (1.30 = +30% benefits/overhead) ÷ 12. Set FTE to 0 to skip a role. Salaries are editable in the Prices tab → Personnel. This is ongoing post-launch staffing; upfront design effort is in Agent engineering below.

+ Add a role

Agent engineering upfront design effort + maintenance — amortized monthly

Effort to design, build, and validate the agent before launch — SME interviews, requirements/spec, evaluation criteria, prototype, calibration runs. Roles × FTE × duration. Methodology-agnostic; defaults approximate a CARE-style engagement. Amortized over deployment lifetime so it shows up as a monthly line item alongside Operations and Personnel. Ongoing post-launch staffing lives in Personnel above.

Roles × FTE during design phase

Each row uses the same loaded-salary math as Personnel. Edit role salaries in Prices → Personnel.

+ Add a design-phase role

Maintenance — re-engineering cadence

Agent engineering summary


Traffic safeguards bot rate limiting

Bot rate-limiting protects against runaway cost from anonymous traffic. (The old $/day spending cap was retired — it distorted the cost math by clipping queries at an artificial threshold. To answer "what scale fits my budget?", use the Budget solver section above.)

Bot rate limiting

Public endpoints get hammered by bots. Rate-limiting strategy ranges from cheap edge filters to full WAF + CAPTCHA. The bot ceiling caps the "bot factor" that scales anonymous traffic (set in Project profile).

Self-host capacity only matters if you self-host

A catalog of GPU specs the engine can use. The currently-selected GPU is marked with a . Hosting strategy is set in Deployment posture at the top.

+ Add another GPU instance

Critical for bursty traffic. A NOAA storm explainer might run only ~10% of the month; cutting GPU hours 10× changes the API-vs-self-host tradeoff dramatically.

Budget solver given a budget, find the affordable scale

Inverse of the calculator: enter a target monthly budget and the engine solves for the maximum MAU the deployment can support at the current cost-per-query. Plus what tradeoffs (cache, model swap, batch tier) would unlock more headroom.

Enter a monthly budget above.

Model cost comparison switch models to see headline savings

Re-runs the full calculation against every available model — same workload, same hosting, same infra, same federal multipliers. Sorted cheapest to most expensive. Δ shows annual difference vs your current pick.

Model $/query $/month $/year Δ vs current (annual)

Cheaper ≠ better. Smaller models (gpt-5-nano, haiku-4.5) cost a fraction of flagship models but lose accuracy on multi-step reasoning, RAG faithfulness, and tool use. Validate any switch against your own evaluation set before committing — savings here are only meaningful if the cheaper model still meets your quality bar.

Sensitivity how robust is the headline to input drift?

Each row perturbs one input around its baseline and shows the resulting monthly cost. Sorted by impact (biggest driver on top). The black tick on each bar marks the baseline; red extends to the low case, green to the high case. Useful for budget defense: if a 20% MAU miss blows past your budget ceiling, you need either a larger envelope or stronger optimization levers before signing.

Lever Low case Headline range High case
Computing…

Perturbations: MAU ±20%, Cache hit rate ±10pp, Provider rates ±15%, Bot factor ±20%, Turns/session ±20%. Cache and rate perturbations approximate the deltas analytically — re-run with your own scenario knobs for exact figures.

Cost over time monthly + cumulative projection given growth%

Projects the monthly headline forward using the Growth/month slider in the simulator. Shows monthly cost climbing month-over-month plus the running 36-month cumulative. Useful for sizing 1-year and 3-year budget envelopes.

Compounds the current simulator growth rate monthly. Caps at 36 months since further projections are unreliable. Headcount, seat-license, and contract reservations don't grow proportionally — adjust manually for those.

Side-by-side compare two scenarios diffed line-by-line

Pick two scenarios. The calculator runs each through the same engine and shows the inputs and outputs side by side, with the % difference highlighted. Current = your live scenario in this calculator.

AS-IS vs proposed compare against your current contract

Enter what you're paying today (or what an incumbent vendor quoted you). The calculator surfaces the delta vs the proposed deployment: savings or overrun, plus the payback window if there's a one-time migration cost.

Today's annual spend on whatever this deployment replaces. Vendor invoice, internal cost-allocation, or incumbent quote.

Switching cost — data migration, training, parallel running, contract exit fees. Use 0 if greenfield.

Infrastructure database, storage, networking, monitoring — fixed monthly costs

Fixed monthly costs that show up regardless of LLM hosting choice — RDS, S3, CloudWatch, ALB, NAT Gateway, Route 53, etc. Usually the smaller part of total bill but they add up. Each line accepts flat $/mo, $ per query, or $ per GB scaling.

+ Add an infrastructure line

📊 Published cost benchmarks

Real cost numbers cited in vendor pricing pages, earnings calls, GAO reports, and academic studies. Use these to sanity-check your calculator output and to defend procurement budgets with citations. Click any source link to verify the number.

Your calc: · /user/mo · /query · /yr
Visual comparison — pick a metric; your scenario plots alongside every cited benchmark.
Your scenario Commercial Federal Industry / academic

💲 Price book

Single source of truth for every price the calculator uses — LLM rates, API reservations, GPU instances, embeddings, vector DBs, AWS infra, personnel salaries, ATO costs. Edits here override the defaults for this workload only and persist in the URL hash. Defaults are validated against vendor pricing pages; last full refresh: . Future plan: a scraper periodically fetches each source_url and bumps last_verified.

Total monthly cost

View:
Annual
3-year TCO
$ / user / month
Queries / month

Cost composition

API vs self-host comparison

Per-segment breakdown

Infrastructure breakdown

Derivation of your numbers

Full math trace for this workload — copy-paste into ChatGPT/Claude/Gemini and ask "verify this math". For source citations and formula details, see Methodology in the sidebar.

Methodology, sources & disclosures planning only — refresh prices before procurement

Pricing sources, token-counting heuristics, confidence-interval math, and what this calculator does not model. For the math applied to your numbers, see Derivation of your numbers above.

PRICING SOURCES
Static bootstrap prices were patched on 2026-05-04. Your scraper should refresh these before procurement use. Prices are subject to change by providers. For procurement, verify against vendor pricing pages on the day of submission. Sources:
  • Anthropic API: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; includes cache write/read pricing
  • OpenAI API: GPT-5.5, GPT-5.5 Pro, GPT-5.4 family, embeddings, web/file search, containers, and short/long context price tiers
  • Google Gemini API: Gemini 3.1 Pro Preview, Gemini 3 Flash Preview, Gemini 3.1 Flash-Lite Preview, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite
  • Together AI: Llama 3.3 70B Turbo provider-specific bootstrap rate
TOKEN COUNTING
The browser token counter is a lightweight planning heuristic, not a true BPE tokenizer. It is useful for relative scenario exploration, but production estimates should be validated with provider token counters or logged API usage. Cross-provider tokenizer differences can be material, especially for long context, code, CJK text, and model-family upgrades.
CONFIDENCE INTERVALS — LOGNORMAL MODEL
Output token lengths follow a lognormal distribution (right-skewed). The model uses:
  • p90 = exp(μ + 1.282σ)
  • p99 = exp(μ + 2.326σ)
  • σ = √(ln(1 + CV²)), where CV = coefficient of variation weighted by task mix
CV values per task type are approximations based on observations of LLM output variance — not measured against published benchmarks. They should be treated as planning estimates with ±30% uncertainty bands themselves. For high-stakes budgeting, derive CV empirically from your own production logs.
TASK-TYPE OUTPUT MULTIPLIERS
Output multipliers (classify 0.30×, summary 0.65×, RAG 0.85×, code 2.80×, longform 3.60×, agentic 4.30×) are heuristic estimates derived from informal observations of typical task output lengths. They are not formally derived from HELM, LMSYS, or other published benchmarks. Real values for your specific use case may vary ±50%.
AGENT ENGINEERING COST MODEL
Upfront design effort is amortized over the deployment lifetime, then summed into the monthly headline alongside Operations and Personnel. The cost shape mirrors self_host.setup_amortized: roles × FTE × duration ÷ 12 → upfront total → ÷ amortization months → amortized monthly. Maintenance accounts for periodic re-specification as the domain drifts (design-lead loaded hourly × hours per session ÷ months between sessions).

Methodology-agnostic. The same shape models any structured agent-design methodology — DSPy, structured-prompt extraction, design-pattern playbooks, etc. Adjust role FTEs and duration to match your engagement.

Example — CARE (Collaborative Agent Reasoning Engineering): a three-party stage-gated approach pairing subject-matter experts, developers, and helper agents to produce structured agent specifications (interaction requirements, reasoning policies, evaluation criteria). The defaults shipped here (4-month design phase, 0.5/1.0/1.0/0.25 FTE for SME / design lead / developer / eval engineer, ~$400/mo helper-agent budget, quarterly re-spec cadence) approximate a CARE engagement. Ramachandran, Jha & Ramasubramanian, 2026 (arXiv:2604.28043).
CACHE DISCOUNT MODEL — LIMITATIONS
Anthropic prompt caching has a 5-minute TTL and only applies to specific prompt prefixes (system prompts cache best). The cache slider now splits eligible cached input into cache writes and cache reads where provider pricing exposes both. Anthropic 5-minute writes are modeled at the published write rate and reads at the published read rate. OpenAI/Gemini cached input is modeled as the published cached/context-caching input rate without a separate write surcharge. The cost engine now applies OpenAI/Gemini long-context tiers when a per-turn input exceeds the provider threshold encoded on the model. Actual cache eligibility still depends on stable prompt prefixes, TTL, provider cache thresholds, storage time, and request construction.
MULTIMODAL TOKEN ESTIMATES
  • Images: 1568 tokens per 1568×1568 image (Anthropic), variable for OpenAI low/high detail mode
  • Audio (STT): ~25 tokens/sec for English speech (Whisper baseline)
  • PDF pages: ~1500 tokens/page average (varies by content density)
  • Code interpreter: stdout/stderr counted as input tokens on next turn
WHAT THIS TOOL DOES NOT MODEL
  • Fine-tuning training cost amortisation (separate calculator recommended)
  • Infrastructure/hosting costs for self-hosted deployments (separate calculator recommended)
  • Human-in-the-loop reviewer time costs (separate calculator recommended)
  • Volume discount tiers (negotiate directly with vendors above 100M tok/mo)
  • Network egress, storage, vector DB operational costs beyond the optional file-search/container placeholders
  • Compliance overhead (FedRAMP, HIPAA, SOC2 audit costs)
  • Latency SLA penalty costs
DISCLAIMER
This tool produces planning estimates, not contractual cost commitments. For procurement decisions, federal budget submissions, or vendor negotiations: (a) verify pricing on the day of submission, (b) validate token estimates against your own production telemetry, (c) treat all p99 figures as soft upper bounds with their own ±20% uncertainty, (d) consult your finance team for compliance-specific cost adders. The author makes no warranty of accuracy and accepts no liability for procurement decisions based on these estimates.
v9.6 — DAG TOPOLOGY + REALISTIC LONG MESSAGES
Two refinements making the workflow simulation more representative of real systems like AKD Flow:
  • Long-form simulation messages. When in workflow mode, both user prompts and agent responses use realistic multi-paragraph templates that match what research-orchestration agents actually produce — structured outputs with sections like "What I observed", "What cannot be determined", "What I need from you", "Risks detected". Token counts shown in the live stream now reflect actual session realities.
  • DAG topology controls. Three execution patterns: Sequential (linear pipeline, 1 agent at a time), Parallel (all stages concurrently — wall-clock = max stage time but hits rate limits), Hybrid (sequential trunk + parallel sub-branches at certain stages, realistic for AKD-style workflows). Cost is identical regardless of topology, but parallelism affects rate limit overage costs and concurrent quota utilisation.
  • Concurrent quota model. Each provider has a max concurrent request quota (e.g. Anthropic Tier 2 = 50 concurrent). Exceeding it triggers queueing/throttling — modelled as 2% surcharge per overflow request. Rate limit overage on retries adds 1.5× cost on the failed fraction.
The AKD Flow preset now applies the hybrid topology by default with 3 parallel branches and 20 concurrent quota, matching what a typical research-workflow deployment looks like.
v9.5 — WORKFLOW DAG MODE
Adds an explicit Workflow DAG mode for sequential pipeline systems (e.g. AKD Flow research orchestration). Toggle in topbar. Workflow mode adds six cost components on top of the base per-agent calculation:
  • Sequential chain handoff — each stage's output becomes part of the next stage's input. Slider controls the % of prior output passed forward (typical 70–90%).
  • Bulk document ingestion — PDFs/session × pages × tokens/page added to the % of stages that read the corpus. Distinct from RAG retrieval.
  • Partial rerun — % of stages re-executed when users review and reject output. Multiplies base cost.
  • Fact-check sidecar — separate verification call per stage with its own (typically cheaper) model. Common in research workflows where every output is checked.
  • Template amortization — workflow planning is one-time but template runs many times. Subtracts a small saving as runs/template grows.
  • HITL pause storage — session state retention during user review pauses. Negligible per session, accumulates at scale.
Use the AKD Flow preset button in the Routing tab for a calibrated research-workflow configuration (5 stages, 8 PDFs/session, fact-check on every stage, 4 HITL pauses).
v9.4 — TOOL FEES, CACHE WRITE SHARE, MULTI-TIER PRICING
Three structural improvements to the pricing engine:
  • Per-provider tool fees. Web search, file search, and container-session fees are now keyed by the agent's provider+family. Anthropic web search is $10/1k; OpenAI Assistants $10/1k web + $2.50/1k file + $0.03/container session; Google Vertex Search Grounding $35/1k. Bedrock/Azure/OpenRouter pass through with no separate billing. Self-hosted has zero tool fees.
  • Tunable cache write share. Cached tokens are split between cache reads (cheap, 90% discount) and cache writes (premium, ~25% over list price for the first hit on Anthropic's 5-minute TTL). The new slider lets you model "cold start" vs "steady state" deployments — the share defaults to 10% (steady state) but rises to 30–50% for low-volume or fresh deployments where 5-minute TTL frequently expires.
  • Multi-tier long context. Models can now declare a tiers[] array with multiple price levels (e.g. Gemini standard ≤200k, long 200k–1M, ultra-long >1M). The legacy longThreshold/longIn/longOut binary format remains supported. The engine picks the highest thresholdAt ≤ current input size.
v9 — PER-AGENT HETEROGENEOUS CONFIG
v9.4 exposes per-agent configuration as a first-class Agent Settings tab. Each agent in a multi-agent fleet can have its own model, provider, turns share, RAG/tools/reasoning/guardrail settings, cache rate, max output cap, and task-type bias. The cost engine walks each agent independently and sums per-session costs. This reveals that uniform-fleet pricing materially overstates real heterogeneous system costs for some configurations and understates them for others (notably when expensive specialised models like Opus are used for high-context analyst roles).
PROVIDER COST MULTIPLIERS
  • Managed API: 1.00× (direct vendor list price)
  • BYOK: 1.00× (bring-your-own-key, no aggregator markup)
  • AWS Bedrock: 1.05× (typical 5% AWS markup, GovCloud available)
  • Azure OpenAI: 1.00× (parity with OpenAI direct, ent compliance)
  • OpenRouter: 1.05× (typical aggregator markup, varies)
  • Self-Hosted: 0× per-token + $5,000/mo fixed cost (default; configurable; covers H100/A100 instance + light ops)
These are representative averages; actual contracts vary. Negotiated enterprise agreements may be 10–20% below list price.
REALISTIC RETRY MODEL
v9 models retries as retry_rate × 1.5 × base_cost — accounting for partial output already generated before failure plus the full retry call. Previous versions used retry_rate × 1.0 × base_cost, which underestimates real retry waste by ~33%.
CONVERSATION SUMMARISATION OVERHEAD
When per-agent context fills above 70% of model context window, the agent must summarise older turns to continue. v9 models this as an additional API call costing approximately 30% of context tokens at input rate plus 30% × 30% at output rate. This is a real production cost ignored by simpler calculators.
BURSTINESS / PEAK PROVISIONING
Peak-to-average ratio above 2× incurs a 5% surcharge per ratio level above 2 (representing rate-limit overage charges, queue overflow handling, or premium-tier capacity). E.g. a 5× peak adds 5% × 3 = 15% to base cost.
LANGUAGE MULTIPLIER
Tokenizers compress different languages and model families differently. This multiplier is a coarse planning adjustment, not a provider-tokenizer substitute. Approximate multipliers: English 1.0×, Code 1.3×, Spanish/French 1.2×, Chinese/Japanese/Korean 1.8–2.2×, Arabic/Hebrew 1.4×. Validate with provider token counters or production usage logs before budget submission.