The honest cost of running an AI chatbot

The mistake I almost made

Last week I was about to spin up a $317/month RunPod A10 instance for a chatbot I'm building. The reasoning seemed sound: "API costs scale with traffic, self-hosted is one fixed bill — so self-hosting wins at scale, right?"

Wrong.

I sat down with a spreadsheet for 30 minutes and realised I would have been burning idle compute for at least the first six months. The break-even point isn't where most builders think it is.

This post is the cost table I wish someone had handed me before I started planning the infrastructure.

What both options have in common

Whether you self-host the LLM or call an API, you need the same supporting infrastructure:

Component	Cost
App server (Next.js / FastAPI)	$5–20/mo (Vercel free tier or small VPS)
Vector DB (Supabase / pgvector)	$0–25/mo
Redis cache (Upstash)	$0–10/mo
Embeddings (local CPU model like bge-m3)	$0
Subtotal — both paths	~$10–55/mo

The difference comes down to how you serve the language model. Everything else is identical.

Path A — self-hosted LLM on a rented GPU

You rent a GPU on RunPod, Lambda Cloud, Hetzner, Vast.ai, or AWS. You run vLLM, llama.cpp, or Ollama on it, serving an open-weight model like Qwen 3, DeepSeek-R1, or Llama 3.3.

Cost structure: fixed, per hour. You pay the same whether you process 1 query or 100,000.

GPU	Provider	$/hr	$/mo 24×7	Capacity (Qwen 7B + vLLM)
T4 16 GB	RunPod	$0.20	$144	~50K cached q/day
A10 24 GB	RunPod	$0.44	$317	~200K cached q/day
L40S 48 GB	RunPod	$0.99	$713	~500K cached q/day
H100 80 GB	RunPod	$2.99	$2,160	1M+ cached q/day
g5.xlarge (A10)	AWS	$1.10	$790	⚠ 2.5× pricier for same hardware

Skip AWS for GPU workloads unless you're already locked in for compliance reasons. Specialty providers undercut it by 2–3×.

Path B — cloud LLM API + small instance

You build your app on Vercel or a VPS and call Anthropic, OpenAI, DeepSeek, or Mistral for each query. Pay per token.

Cost structure: variable, per query. Zero idle cost, scales linearly with traffic.

For a typical chatbot (~500 input + 300 output tokens per query):

API	Per query (¥)	Per query ($)
DeepSeek V3	¥0.08	$0.0005
GPT-4o-mini	¥0.20	$0.0013
Claude Haiku 4.5	¥0.30	$0.002
Claude Sonnet 4.6	¥0.90	$0.006
Claude Opus 4.7	¥4.50	$0.030

The actual head-to-head

Assuming 60% cache hit rate (typical for FAQ-shaped chatbots), so effective LLM calls = 40% of total queries:

Daily queries	Self-host A10 ($317)	Claude Haiku	DeepSeek V3	Winner
1,000	$317	$24	$6	API by 13×
10,000	$317	$240	$64	API still wins
50,000	$317	$1,200	$320	Self-host ties DeepSeek
100,000	$317	$2,400	$640	Self-host wins
500,000	$317–700	$12,000	$3,200	Self-host wins decisively
1,000,000	$700–2,160	$24,000	$6,400	Self-host wins 3–12×

Read this carefully:

At 1K queries/day, the API costs $6–24/month. Self-hosting costs $317/month. API wins by 13×.
At 10K queries/day, the API still wins, narrowly.
At 50K queries/day, self-hosting catches DeepSeek and beats Claude.
At 1M queries/day, self-hosting wins by 12× over Claude Haiku.

The three break-even points

Where self-hosting on a RunPod A10 (~$317/mo) starts winning, by API choice:

Claude Haiku 4.5 → ~13K cached queries/day
DeepSeek V3 → ~50K cached queries/day
Claude Sonnet 4.6 → ~4K cached queries/day

If you're choosing your API to compare against, DeepSeek pushes the break-even far higher than Claude does.

The trap: AWS GPU pricing

Same A10 GPU:

RunPod: $0.44/hour
Lambda Cloud: $0.75/hour
AWS g5.xlarge: $1.10/hour

AWS is 2.5× more expensive for the same hardware. Their value is locked in their broader ecosystem (RDS, S3, IAM, SOC2/HIPAA compliance). For pure GPU inference workloads, specialty providers undercut them dramatically.

If you're not enterprise-gated into AWS, don't pay the AWS GPU premium.

The lever everyone forgets: semantic caching

In a real chatbot, ~60–80% of queries are reformulations of the same handful of questions:

"What's the deadline for X?"
"How do I apply for Y?"
"Difference between A and B?"

A semantic cache — embed the question, find any cached entry with cosine similarity above 0.95, serve the cached response — dramatically changes the math. At 60% cache hit rate, your effective LLM call rate is 40% of queries, dropping monthly LLM cost by 60%.

If you skip caching, you're paying 2–3× more than you need to, on either path. Add Upstash Redis with semantic similarity matching on day one. The free tier handles 10,000 commands/day.

The cleanest ramp

For a typical product launch with traffic growth from 0 to 100K+ queries/day:

Phase	Daily queries	Stack	Monthly LLM cost
Month 1–3 — validation	0–2K	DeepSeek V3 API, no caching needed	$0–30
Month 4–6 — early traction	2K–10K	DeepSeek V3 + Upstash Redis semantic cache	$30–100
Month 7–9 — scale signal	10K–50K	Migrate to RunPod A10 + vLLM + DeepSeek-R1-distill-7B	$317 fixed
Month 10+ — real scale	50K+	Multi-GPU vLLM cluster, multi-region, observability	$700–2,000+

Pre-building stage 3 infra when you're at stage 0 is the classic startup-killing mistake. Every dollar spent on infrastructure for traffic that doesn't exist yet is a dollar that can't be spent on building demand.

Hidden costs that don't show up in either column

Self-hosting:

Ops time: 2–8 hours/month. If your time is worth $50+/hour, that's $100–400/month not in the cost table.
Cold starts: if you scale-to-zero, expect 30–60 seconds when a new instance spins up. Most products can't tolerate this.
Model updates: new versions ship every few weeks. You decide when to migrate — which is both freedom and burden.

API:

Vendor lock-in: switching from Claude to GPT requires re-testing every prompt. Real cost when it happens.
Rate limits: real at scale, hit harder than expected.
Latency variance: typically 2–3× higher than self-hosted, with a longer tail.

For most builders, the hidden cost calculus pushes them toward APIs even longer than the visible cost table does.

The bottom line

API to validate. GPU to scale.

Don't burn $300/month on idle GPU compute when you have 100 users.

Don't burn $24,000/month on API calls when you have 1M queries.

Match the cost shape to the traffic shape. The thing that kills startups isn't picking the wrong stack on day one — it's locking in infrastructure costs before validating that anyone wants the product.