AI Cost Calculator

▶Project profile name, team, audience

Identify the deployment. These labels appear at the top of the report. Public-facing? drives the bot-factor multiplier and informs sensible daily-cap recommendations downstream.

Project name ⓘ

Team or organization ⓘ

One-paragraph description (shown in report) ⓘ

Hosting strategy

Pick where the LLM runs. Downstream sections (Reservations, Self-host capacity, On-prem amortization) auto-show only when relevant.

A Traffic & efficiency sessions, turns, cache, retries — drives queries-per-month and the headline cost

View Showing core knobs only — switch to Advanced for full control over reasoning, multimodal, prompt overhead, guardrails, etc.

Topology Multi-Agent Fleet

Single agent One LLM end-to-end. Simplest cost model. Good for chatbots, lookup, single-pipeline RAG. Fleet (parallel) Several specialists run in parallel, results synthesized. Higher token cost; lower latency than workflow. Workflow (sequential) Stages handoff sequentially. Cumulative-context cost grows with depth. Good for plan-then-execute pipelines.

Global Parameters

Agents3

Parallel reasoning agents

Monthly active users50

Turns / session8

Daily return rate per user0.3

Bot factor1.5×

1.0 = internal · 1.5× = typical public · 2–3× = under crawler load

Cache hit rate45%

~90% discount on hits

Cache write share

Of cached tokens: % first-write (premium); rest are reads. ~10% steady state, ~50% cold start.

Batch async %0%

50% discount

Retry rate3%

Growth / month20%

Peak / avg ratio1×

Burstiness — >2× adds 5%/level

Language multiplier1.0×

EN=1.0, code=1.3, CJK=1.8-2.2

Comm patternorch

orch=0 · sup=+150·(N−1) · peer=+300·(N−1) · pipeline=Σ prior outputs (Eq. 7)

Budget ($/mo) OK

RAG / Fact Reasoning

RAGREASON

RAG Pipeline Tokens

Chunks retrieved5

Top-K retrieval

Tokens/chunk512

Context tokens per chunk

Query embed tokens128

Per retrieval query

Retrieval calls/turn1

Multi-hop queries

RAG tokens/turn: 0 · Embed cost/sess: $0 · Retrieval augments input by 0%

Embedding corpus (RAG ingest)

Corpus size (tokens)

Re-embed every (months)

Monthly amortized: $0 · per-query embedding tokens are set on the slider above

Multimodal Inputs (advanced)

Images / turn0

~1568 tok per 1568px image

Audio sec / turn0

~25 tok/sec speech-to-text

PDF pages / turn0

~1500 tok/page average

Code interp output0

stdout/stderr fed back

Prompt Engineering Overhead (advanced)

Few-shot examples0

~250 tok per example

JSON schema overhead0

Structured output enforcement

Citations / grounding0

Output overhead per response

Memory / persistent ctx0

Long-term agent memory

Extended Thinking (advanced)

Thinking tokens0

Extended thinking budget

Reasoning % of turns0%

Turns using extended think

CoT chain length0

Steps × ~150t each

Reasoning tokens/turn: 0 · Added cost/sess: $0

Guardrails + Tools GUARD

Guardrail Token Overhead (advanced)

Input guardrail tokens0

Classifier prompt overhead

Output guardrail tokens0

Output scan overhead

PII scan tokens0

PII detection per msg

Content policy tokens0

Policy classifier

Guard block rate0%

Blocked = tokens wasted

Separate guard model$0

$/1M for guard calls

Guard tokens/turn: 0 · Guard cost/sess: $0 · Block waste: $0

Tool Calls

Tool calls/turn2

Schema tokens320

Result tokens800

Fed back into context

IA msg overhead80

Tool return shape ⓘ

Provider tool fees/sess: $0 · Web search/file search/container line items are charged separately from tokens.

Web search / turn0

Default $10 / 1k calls

File search / turn0

Default $2.50 / 1k calls

Container sess.0

1GB session placeholder

Embeddings model$0.02

text-embedding-3-small per 1M input tokens

System Prompt

System prompt tokens512

Amortised over turns

Workload mix — query types ×1.0 out

Global mix across all queries (not per-agent). Drives output-token multipliers — classify ≈ 0.3×, summary ≈ 0.65×, RAG ≈ 0.85×, code 2.8×, longform 3.6×, agentic 4.3×.

Fact-checking pipeline VERIFY

Adds an automated factuality check on a sample of answers — catches hallucinations at the cost of extra tokens per verified query. Skip this section if you don't run a verifier.

Enable verification on this deployment

Sampling coverage (0–1)

Verifier preset

Atoms (claims) per response

Where the fact-check model runs

Source for evidence retrieval

Advanced — token budgets per call

Atomizer · input tokens

Atomizer · output tokens

Reviser · input tokens

Reviser · output tokens

NLI · input tokens

NLI · output tokens

Service pod monthly ($)

B Agent fleet per-agent role, model, prompt, RAG, tools, guardrails — 0 agents

Fleet actions

Use the main Agents slider to choose fleet size. Agent edits are preserved when resizing where possible, and included in Share/JSON export.

What changes here

Workflow DAG Configuration sequential pipeline

DAG Topology

Cost is the same regardless of execution order, but parallel execution affects rate limits, concurrent quota usage, and peak burst pricing.

Parallel branches1

Independent sub-DAGs running concurrently

Concurrent quota10

Max simultaneous API calls allowed

Rate limit overage %0%

% requests hit rate limit, retry penalty

Sequential: linear pipeline, no concurrency issues. Total wall-clock = sum of stage times.

Pipeline Behaviour

In workflow mode each agent is a DAG stage. Stage N's output becomes part of Stage N+1's input context, accumulating tokens through the pipeline.

Output→input ratio80%

% of stage N output passed to stage N+1

Partial rerun rate15%

% of stages rerun on user review

Workflow runs / template10

Amortise planning cost across runs

Document Ingestion (per session)

PDFs ingested0

Bulk papers/docs uploaded

Pages / PDF15

Avg pages per document

Tokens / page800

~800 typical, ~1500 dense

% stages reading docs40%

Not all stages need full corpus

Fact-Check Sidecar (per stage)

Parallel verification pass on every stage output. Common in research workflows — separate cheap-model call per node.

Fact-check enabled %70%

% of stages with fact-check on

FC input tokens2000

Stage output read by checker

FC model price$0.8

$/1M for fact-check model

HITL Pause State

User-review pauses persist session state. Storage cost negligible per session but accumulates at scale.

Avg pause duration2h

Hours stage waits for approval

Pauses / session3

Approval gates per session

Storage $/GB-hour$0.0001

Session state retention

C Architecture live token-flow diagram across the agent pipeline

D Cost breakdown per-turn token ledger and cumulative composition

Per-Turn Token Ledger turn 0

Token composition (horizontal stack)

Cumulative Token Breakdown

Sys prompt

amortised

History ctx

conversation

RAG chunks

retrieval

Thinking

extended CoT

Guardrails

safety scan

Tools

schema+results

Lognormal CI Distribution — Cost Per Session

E Comparisons model-by-model cost table and 12-month projection

Model Comparison planning estimate

Model	API total	RAG share	Reason share	Guard+	Tool tok	Tool fees+	Cache saved	Retry+	p50	p90	p99

Monthly Projection — 12 months

Per-Component Cost Bars

Full Itemised Breakdown

Per-Agent Cost Contribution heterogeneous

F Multi-model routing traffic-split presets and weighted blended cost

Multi-Model Routing — Traffic Split

Production agentic systems route queries to different models based on complexity. Drag sliders to allocate traffic %. Cost is computed as weighted average across all routed models.

Routed cost / session

Weighted blended cost: $0.00000

vs single-model (Sonnet baseline): —

Monthly p50 (routed): $0

Routing Pattern Presets

Cost comparison: routing vs single-model

G Stress test ±50% sensitivity, tornado chart, what-if scenarios

Sensitivity — ±50% Parameter Impact

Tornado Chart

What-If Scenarios

▶Federal compliance & hosting FedRAMP, ATO, egress, audit retention

Federal AI deployments cost more than commercial. Two kinds of overhead apply: multipliers (FedRAMP/GovCloud premium and multi-region redundancy multiply your LLM compute cost), and additive line items (ATO, egress, log retention, vector DB hosting). Set to "none / single / 0" if you're modeling a commercial deployment.

Hosting multipliers

FedRAMP tier ⓘ

Multi-region / DR ⓘ

Multipliers stack — e.g. FedRAMP High + active-active = ×2.60 on LLM and GPU costs.

Additive monthly costs

ATO compliance tier ⓘ

Upfront amortization (months) ⓘ

ATO manual override ($/mo) ⓘ

Vector DB / retrieval infra ($/mo) ⓘ

Egress per query (GB) ⓘ

Egress cost ($/GB) ⓘ

Audit log size (KB/query) ⓘ

Retention (years) ⓘ

Storage ($/GB-mo) ⓘ

PII redaction ($/M tokens) ⓘ

Not modeled: ATO assessment labor (use Agent engineering or Personnel for that), sole-source procurement overhead, GSA Schedule discounts, model availability constraints (some Anthropic/OpenAI models aren't in GovCloud). Add fixed-cost approximations as line items in Infrastructure below.

▶Question types full-pipeline, lookup-only, refusal, …

Different questions have different token cost. A simple lookup uses fewer tokens than a full multi-tool pipeline. Each shape's factor scales the typical query above (1.0 = same size; 0.5 = half). ⓘ

+ Add another question type

▶Multi-agent pipeline optional — if your system uses more than one LLM call per query

Many real deployments make several LLM calls per user question — e.g., a Planner decides which tools to call, a Retriever ranks documents, a Tool agent formulates a database query, a Summarizer writes the answer. Each agent has its own token budget. If you fill this in, the engine uses agent-sum instead of "Your typical question" above. Leave empty if your deployment is single-call.

+ Add an agent

Common pipeline templates

▶Traffic mix presets how questions split across types

Each preset is a different traffic distribution — e.g., "worst case" treats every question as full-pipeline; "mixed" reflects realistic production traffic. Weights must sum to 1.

+ Add another mix preset

▶Your users monthly active users, sessions, questions

Break your audience into segments — e.g., "internal staff" vs "public visitors". Each has a different usage pattern. Anonymous public segments get a "bot factor" multiplier to account for crawlers and abuse.

+ Add another audience segment

▶API reservations / committed-spend discount or PTU vs on-demand

Reserved API capacity (Azure PTU, AWS Bedrock provisioned, OpenAI Enterprise commit) buys a discount in exchange for commitment. Often 30–50% effective savings at >$10K/mo on-demand spend. Choose the plan + size in units (PTU-style) — the engine swaps in the right cost model. Requires API hosting.

Apply a reservation

Reservation type ⓘ

PTU units ⓘ

Auto-size PTU units from peak TPS ⓘ

Edit the rates / discounts in the Prices tab → API reservations table. Effective monthly cost shows in the report's "Reservation savings" row.

▶Embeddings (RAG) ingest cost + per-query embedding

RAG systems pay for embeddings twice: once at ingest (chunking the corpus into vectors, amortized over the re-embed cycle) and once at query time (embedding the user question to do vector search). For a 100M-token corpus + 1M queries/mo this often runs $50–$500/mo. Skip if not using RAG.

Include embedding generation cost

Embedding model ⓘ

Corpus size (tokens) ⓘ

Re-embed every (months) ⓘ

Query embedding tokens ⓘ

▶Migration timeline phased deployment over 36 months

Real procurement decisions span phases — "API year 1, hybrid year 2, self-host year 3". Each phase has its own hosting + reservation. Engine computes monthly cost per phase and total 3-year TCO. The chart in the report visualizes cost-over-time so you can see where transitions break even.

Enable phased migration timeline

+ Add a phase

▶Personnel / staffing FTE allocations × loaded annual salary

Federal RFPs require fully-loaded labor in the cost basis. Each role: FTE allocation × annual base × total-comp multiplier (1.30 = +30% benefits/overhead) ÷ 12. Set FTE to 0 to skip a role. Salaries are editable in the Prices tab → Personnel. This is ongoing post-launch staffing; upfront design effort is in Agent engineering below.

Include personnel costs in monthly total

+ Add a role

▶Agent engineering upfront design effort + maintenance — amortized monthly

Effort to design, build, and validate the agent before launch — SME interviews, requirements/spec, evaluation criteria, prototype, calibration runs. Roles × FTE × duration. Methodology-agnostic; defaults approximate a CARE-style engagement. Amortized over deployment lifetime so it shows up as a monthly line item alongside Operations and Personnel. Ongoing post-launch staffing lives in Personnel above.

Include agent-engineering cost in monthly total

Design phase duration ⓘ (months)

Amortization period ⓘ (months)

Helper-agent LLM budget ⓘ ($/mo during design)

Roles × FTE during design phase

Each row uses the same loaded-salary math as Personnel. Edit role salaries in Prices → Personnel.

+ Add a design-phase role

Maintenance — re-engineering cadence

Re-spec interval ⓘ (months between sessions)

Hours per re-spec session ⓘ

Agent engineering summary
—
—
—

▶Traffic safeguards bot rate limiting

Bot rate-limiting protects against runaway cost from anonymous traffic. (The old $/day spending cap was retired — it distorted the cost math by clipping queries at an artificial threshold. To answer "what scale fits my budget?", use the Budget solver section above.)

Bot rate limiting

Public endpoints get hammered by bots. Rate-limiting strategy ranges from cheap edge filters to full WAF + CAPTCHA. The bot ceiling caps the "bot factor" that scales anonymous traffic (set in Project profile).

Strategy ⓘ

Strategy monthly cost ($) ⓘ

Bot-factor ceiling (max ×) ⓘ

▶Self-host capacity only matters if you self-host

A catalog of GPU specs the engine can use. The currently-selected GPU is marked with a ●. Hosting strategy is set in Deployment posture at the top.

GPU instance ⓘ

Commitment ⓘ

Cost mode ⓘ

+ Add another GPU instance

Diurnal peak factor ⓘ (peak ÷ avg)

Headroom multiplier ⓘ

Min replicas (HA floor) ⓘ

Default tokens per query (sizing) ⓘ

Duty cycle / scale-to-zero ⓘ 100%

Critical for bursty traffic. A NOAA storm explainer might run only ~10% of the month; cutting GPU hours 10× changes the API-vs-self-host tradeoff dramatically.

Compute platform ⓘ

▶Budget solver given a budget, find the affordable scale

Inverse of the calculator: enter a target monthly budget and the engine solves for the maximum MAU the deployment can support at the current cost-per-query. Plus what tradeoffs (cache, model swap, batch tier) would unlock more headroom.

Target monthly budget ($) ⓘ

Result

Enter a monthly budget above.

▶Model cost comparison switch models to see headline savings

Re-runs the full calculation against every available model — same workload, same hosting, same infra, same federal multipliers. Sorted cheapest to most expensive. Δ shows annual difference vs your current pick.

Model	$/query	$/month	$/year	Δ vs current (annual)

⚠ Cheaper ≠ better. Smaller models (gpt-5-nano, haiku-4.5) cost a fraction of flagship models but lose accuracy on multi-step reasoning, RAG faithfulness, and tool use. Validate any switch against your own evaluation set before committing — savings here are only meaningful if the cheaper model still meets your quality bar.

▶Sensitivity how robust is the headline to input drift?

Each row perturbs one input around its baseline and shows the resulting monthly cost. Sorted by impact (biggest driver on top). The black tick on each bar marks the baseline; red extends to the low case, green to the high case. Useful for budget defense: if a 20% MAU miss blows past your budget ceiling, you need either a larger envelope or stronger optimization levers before signing.

Lever	Low case	Headline range	High case
Computing…

Perturbations: MAU ±20%, Cache hit rate ±10pp, Provider rates ±15%, Bot factor ±20%, Turns/session ±20%. Cache and rate perturbations approximate the deltas analytically — re-run with your own scenario knobs for exact figures.

▶Cost over time monthly + cumulative projection given growth%

Projects the monthly headline forward using the Growth/month slider in the simulator. Shows monthly cost climbing month-over-month plus the running 36-month cumulative. Useful for sizing 1-year and 3-year budget envelopes.

Compounds the current simulator growth rate monthly. Caps at 36 months since further projections are unreliable. Headcount, seat-license, and contract reservations don't grow proportionally — adjust manually for those.

▶Side-by-side compare two scenarios diffed line-by-line

Pick two scenarios. The calculator runs each through the same engine and shows the inputs and outputs side by side, with the % difference highlighted. Current = your live scenario in this calculator.

Scenario A

Scenario B

▶AS-IS vs proposed compare against your current contract

Enter what you're paying today (or what an incumbent vendor quoted you). The calculator surfaces the delta vs the proposed deployment: savings or overrun, plus the payback window if there's a one-time migration cost.

Current annual contract ($)

Today's annual spend on whatever this deployment replaces. Vendor invoice, internal cost-allocation, or incumbent quote.

One-time migration cost ($, optional)

Switching cost — data migration, training, parallel running, contract exit fees. Use 0 if greenfield.

▶Infrastructure database, storage, networking, monitoring — fixed monthly costs

Fixed monthly costs that show up regardless of LLM hosting choice — RDS, S3, CloudWatch, ALB, NAT Gateway, Route 53, etc. Usually the smaller part of total bill but they add up. Each line accepts flat $/mo, $ per query, or $ per GB scaling.

+ Add an infrastructure line

📊 Published cost benchmarks

Real cost numbers cited in vendor pricing pages, earnings calls, GAO reports, and academic studies. Use these to sanity-check your calculator output and to defend procurement budgets with citations. Click any source link to verify the number.

Your calc: — · —/user/mo · —/query · —/yr

Visual comparison — pick a metric; your scenario plots alongside every cited benchmark.

Your scenario Commercial Federal Industry / academic

💲 Price book

Single source of truth for every price the calculator uses — LLM rates, API reservations, GPU instances, embeddings, vector DBs, AWS infra, personnel salaries, ATO costs. Edits here override the defaults for this workload only and persist in the URL hash. Defaults are validated against vendor pricing pages; last full refresh: —. Future plan: a scraper periodically fetches each source_url and bumps last_verified.

—

Total monthly cost

—

View: Monthly Annual 3-year TCO

Annual

—

3-year TCO

—

$ / user / month

—

Queries / month

—

Cost composition

API vs self-host comparison

—

Per-segment breakdown

Infrastructure breakdown

Derivation of your numbers

Full math trace for this workload — copy-paste into ChatGPT/Claude/Gemini and ask "verify this math". For source citations and formula details, see Methodology in the sidebar.

—

Methodology, sources & disclosures planning only — refresh prices before procurement

Pricing sources, token-counting heuristics, confidence-interval math, and what this calculator does not model. For the math applied to your numbers, see Derivation of your numbers above.

PRICING SOURCES

Static bootstrap prices were patched on 2026-05-04. Your scraper should refresh these before procurement use. Prices are subject to change by providers. For procurement, verify against vendor pricing pages on the day of submission. Sources:

Anthropic API: Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; includes cache write/read pricing
OpenAI API: GPT-5.5, GPT-5.5 Pro, GPT-5.4 family, embeddings, web/file search, containers, and short/long context price tiers
Google Gemini API: Gemini 3.1 Pro Preview, Gemini 3 Flash Preview, Gemini 3.1 Flash-Lite Preview, Gemini 2.5 Flash, Gemini 2.5 Flash-Lite
Together AI: Llama 3.3 70B Turbo provider-specific bootstrap rate

TOKEN COUNTING

The browser token counter is a lightweight planning heuristic, not a true BPE tokenizer. It is useful for relative scenario exploration, but production estimates should be validated with provider token counters or logged API usage. Cross-provider tokenizer differences can be material, especially for long context, code, CJK text, and model-family upgrades.

CONFIDENCE INTERVALS — LOGNORMAL MODEL

Output token lengths follow a lognormal distribution (right-skewed). The model uses:

p90 = exp(μ + 1.282σ)
p99 = exp(μ + 2.326σ)
σ = √(ln(1 + CV²)), where CV = coefficient of variation weighted by task mix

CV values per task type are approximations based on observations of LLM output variance — not measured against published benchmarks. They should be treated as planning estimates with ±30% uncertainty bands themselves. For high-stakes budgeting, derive CV empirically from your own production logs.

TASK-TYPE OUTPUT MULTIPLIERS

Output multipliers (classify 0.30×, summary 0.65×, RAG 0.85×, code 2.80×, longform 3.60×, agentic 4.30×) are heuristic estimates derived from informal observations of typical task output lengths. They are not formally derived from HELM, LMSYS, or other published benchmarks. Real values for your specific use case may vary ±50%.

AGENT ENGINEERING COST MODEL

Upfront design effort is amortized over the deployment lifetime, then summed into the monthly headline alongside Operations and Personnel. The cost shape mirrors self_host.setup_amortized: roles × FTE × duration ÷ 12 → upfront total → ÷ amortization months → amortized monthly. Maintenance accounts for periodic re-specification as the domain drifts (design-lead loaded hourly × hours per session ÷ months between sessions).

Methodology-agnostic. The same shape models any structured agent-design methodology — DSPy, structured-prompt extraction, design-pattern playbooks, etc. Adjust role FTEs and duration to match your engagement.

Example — CARE (Collaborative Agent Reasoning Engineering): a three-party stage-gated approach pairing subject-matter experts, developers, and helper agents to produce structured agent specifications (interaction requirements, reasoning policies, evaluation criteria). The defaults shipped here (4-month design phase, 0.5/1.0/1.0/0.25 FTE for SME / design lead / developer / eval engineer, ~$400/mo helper-agent budget, quarterly re-spec cadence) approximate a CARE engagement. Ramachandran, Jha & Ramasubramanian, 2026 (arXiv:2604.28043).

CACHE DISCOUNT MODEL — LIMITATIONS

Anthropic prompt caching has a 5-minute TTL and only applies to specific prompt prefixes (system prompts cache best). The cache slider now splits eligible cached input into cache writes and cache reads where provider pricing exposes both. Anthropic 5-minute writes are modeled at the published write rate and reads at the published read rate. OpenAI/Gemini cached input is modeled as the published cached/context-caching input rate without a separate write surcharge. The cost engine now applies OpenAI/Gemini long-context tiers when a per-turn input exceeds the provider threshold encoded on the model. Actual cache eligibility still depends on stable prompt prefixes, TTL, provider cache thresholds, storage time, and request construction.

MULTIMODAL TOKEN ESTIMATES

Images: 1568 tokens per 1568×1568 image (Anthropic), variable for OpenAI low/high detail mode
Audio (STT): ~25 tokens/sec for English speech (Whisper baseline)
PDF pages: ~1500 tokens/page average (varies by content density)
Code interpreter: stdout/stderr counted as input tokens on next turn

WHAT THIS TOOL DOES NOT MODEL

Fine-tuning training cost amortisation (separate calculator recommended)
Infrastructure/hosting costs for self-hosted deployments (separate calculator recommended)
Human-in-the-loop reviewer time costs (separate calculator recommended)
Volume discount tiers (negotiate directly with vendors above 100M tok/mo)
Network egress, storage, vector DB operational costs beyond the optional file-search/container placeholders
Compliance overhead (FedRAMP, HIPAA, SOC2 audit costs)
Latency SLA penalty costs

DISCLAIMER

This tool produces planning estimates, not contractual cost commitments. For procurement decisions, federal budget submissions, or vendor negotiations: (a) verify pricing on the day of submission, (b) validate token estimates against your own production telemetry, (c) treat all p99 figures as soft upper bounds with their own ±20% uncertainty, (d) consult your finance team for compliance-specific cost adders. The author makes no warranty of accuracy and accepts no liability for procurement decisions based on these estimates.

v9.6 — DAG TOPOLOGY + REALISTIC LONG MESSAGES

Two refinements making the workflow simulation more representative of real systems like AKD Flow:

Long-form simulation messages. When in workflow mode, both user prompts and agent responses use realistic multi-paragraph templates that match what research-orchestration agents actually produce — structured outputs with sections like "What I observed", "What cannot be determined", "What I need from you", "Risks detected". Token counts shown in the live stream now reflect actual session realities.
DAG topology controls. Three execution patterns: Sequential (linear pipeline, 1 agent at a time), Parallel (all stages concurrently — wall-clock = max stage time but hits rate limits), Hybrid (sequential trunk + parallel sub-branches at certain stages, realistic for AKD-style workflows). Cost is identical regardless of topology, but parallelism affects rate limit overage costs and concurrent quota utilisation.
Concurrent quota model. Each provider has a max concurrent request quota (e.g. Anthropic Tier 2 = 50 concurrent). Exceeding it triggers queueing/throttling — modelled as 2% surcharge per overflow request. Rate limit overage on retries adds 1.5× cost on the failed fraction.

The AKD Flow preset now applies the hybrid topology by default with 3 parallel branches and 20 concurrent quota, matching what a typical research-workflow deployment looks like.

v9.5 — WORKFLOW DAG MODE

Adds an explicit Workflow DAG mode for sequential pipeline systems (e.g. AKD Flow research orchestration). Toggle in topbar. Workflow mode adds six cost components on top of the base per-agent calculation:

Sequential chain handoff — each stage's output becomes part of the next stage's input. Slider controls the % of prior output passed forward (typical 70–90%).
Bulk document ingestion — PDFs/session × pages × tokens/page added to the % of stages that read the corpus. Distinct from RAG retrieval.
Partial rerun — % of stages re-executed when users review and reject output. Multiplies base cost.
Fact-check sidecar — separate verification call per stage with its own (typically cheaper) model. Common in research workflows where every output is checked.
Template amortization — workflow planning is one-time but template runs many times. Subtracts a small saving as runs/template grows.
HITL pause storage — session state retention during user review pauses. Negligible per session, accumulates at scale.

Use the AKD Flow preset button in the Routing tab for a calibrated research-workflow configuration (5 stages, 8 PDFs/session, fact-check on every stage, 4 HITL pauses).

v9.4 — TOOL FEES, CACHE WRITE SHARE, MULTI-TIER PRICING

Three structural improvements to the pricing engine:

Per-provider tool fees. Web search, file search, and container-session fees are now keyed by the agent's provider+family. Anthropic web search is $10/1k; OpenAI Assistants $10/1k web + $2.50/1k file + $0.03/container session; Google Vertex Search Grounding $35/1k. Bedrock/Azure/OpenRouter pass through with no separate billing. Self-hosted has zero tool fees.
Tunable cache write share. Cached tokens are split between cache reads (cheap, 90% discount) and cache writes (premium, ~25% over list price for the first hit on Anthropic's 5-minute TTL). The new slider lets you model "cold start" vs "steady state" deployments — the share defaults to 10% (steady state) but rises to 30–50% for low-volume or fresh deployments where 5-minute TTL frequently expires.
Multi-tier long context. Models can now declare a tiers[] array with multiple price levels (e.g. Gemini standard ≤200k, long 200k–1M, ultra-long >1M). The legacy longThreshold/longIn/longOut binary format remains supported. The engine picks the highest thresholdAt ≤ current input size.

v9 — PER-AGENT HETEROGENEOUS CONFIG

v9.4 exposes per-agent configuration as a first-class Agent Settings tab. Each agent in a multi-agent fleet can have its own model, provider, turns share, RAG/tools/reasoning/guardrail settings, cache rate, max output cap, and task-type bias. The cost engine walks each agent independently and sums per-session costs. This reveals that uniform-fleet pricing materially overstates real heterogeneous system costs for some configurations and understates them for others (notably when expensive specialised models like Opus are used for high-context analyst roles).

PROVIDER COST MULTIPLIERS

Managed API: 1.00× (direct vendor list price)
BYOK: 1.00× (bring-your-own-key, no aggregator markup)
AWS Bedrock: 1.05× (typical 5% AWS markup, GovCloud available)
Azure OpenAI: 1.00× (parity with OpenAI direct, ent compliance)
OpenRouter: 1.05× (typical aggregator markup, varies)
Self-Hosted: 0× per-token + $5,000/mo fixed cost (default; configurable; covers H100/A100 instance + light ops)

These are representative averages; actual contracts vary. Negotiated enterprise agreements may be 10–20% below list price.

REALISTIC RETRY MODEL

v9 models retries as retry_rate × 1.5 × base_cost — accounting for partial output already generated before failure plus the full retry call. Previous versions used retry_rate × 1.0 × base_cost, which underestimates real retry waste by ~33%.

CONVERSATION SUMMARISATION OVERHEAD

When per-agent context fills above 70% of model context window, the agent must summarise older turns to continue. v9 models this as an additional API call costing approximately 30% of context tokens at input rate plus 30% × 30% at output rate. This is a real production cost ignored by simpler calculators.

BURSTINESS / PEAK PROVISIONING

Peak-to-average ratio above 2× incurs a 5% surcharge per ratio level above 2 (representing rate-limit overage charges, queue overflow handling, or premium-tier capacity). E.g. a 5× peak adds 5% × 3 = 15% to base cost.

LANGUAGE MULTIPLIER

Tokenizers compress different languages and model families differently. This multiplier is a coarse planning adjustment, not a provider-tokenizer substitute. Approximate multipliers: English 1.0×, Code 1.3×, Spanish/French 1.2×, Chinese/Japanese/Korean 1.8–2.2×, Arabic/Hebrew 1.4×. Validate with provider token counters or production usage logs before budget submission.