The mistake I almost made
Last week I was about to spin up a $317/month RunPod A10 instance for a chatbot I'm building. The reasoning seemed sound: "API costs scale with traffic, self-hosted is one fixed bill — so self-hosting wins at scale, right?"
Wrong.
I sat down with a spreadsheet for 30 minutes and realised I would have been burning idle compute for at least the first six months. The break-even point isn't where most builders think it is.
This post is the cost table I wish someone had handed me before I started planning the infrastructure.
What both options have in common
Whether you self-host the LLM or call an API, you need the same supporting infrastructure:
| Component | Cost |
|---|---|
| App server (Next.js / FastAPI) | $5–20/mo (Vercel free tier or small VPS) |
| Vector DB (Supabase / pgvector) | $0–25/mo |
| Redis cache (Upstash) | $0–10/mo |
| Embeddings (local CPU model like bge-m3) | $0 |
| Subtotal — both paths | ~$10–55/mo |
The difference comes down to how you serve the language model. Everything else is identical.
Path A — self-hosted LLM on a rented GPU
You rent a GPU on RunPod, Lambda Cloud, Hetzner, Vast.ai, or AWS. You run vLLM, llama.cpp, or Ollama on it, serving an open-weight model like Qwen 3, DeepSeek-R1, or Llama 3.3.
Cost structure: fixed, per hour. You pay the same whether you process 1 query or 100,000.
| GPU | Provider | $/hr | $/mo 24×7 | Capacity (Qwen 7B + vLLM) |
|---|---|---|---|---|
| T4 16 GB | RunPod | $0.20 | $144 | ~50K cached q/day |
| A10 24 GB | RunPod | $0.44 | $317 | ~200K cached q/day |
| L40S 48 GB | RunPod | $0.99 | $713 | ~500K cached q/day |
| H100 80 GB | RunPod | $2.99 | $2,160 | 1M+ cached q/day |
| g5.xlarge (A10) | AWS | $1.10 | $790 | ⚠ 2.5× pricier for same hardware |
Skip AWS for GPU workloads unless you're already locked in for compliance reasons. Specialty providers undercut it by 2–3×.
Path B — cloud LLM API + small instance
You build your app on Vercel or a VPS and call Anthropic, OpenAI, DeepSeek, or Mistral for each query. Pay per token.
Cost structure: variable, per query. Zero idle cost, scales linearly with traffic.
For a typical chatbot (~500 input + 300 output tokens per query):
| API | Per query (¥) | Per query ($) |
|---|---|---|
| DeepSeek V3 | ¥0.08 | $0.0005 |
| GPT-4o-mini | ¥0.20 | $0.0013 |
| Claude Haiku 4.5 | ¥0.30 | $0.002 |
| Claude Sonnet 4.6 | ¥0.90 | $0.006 |
| Claude Opus 4.7 | ¥4.50 | $0.030 |
The actual head-to-head
Assuming 60% cache hit rate (typical for FAQ-shaped chatbots), so effective LLM calls = 40% of total queries:
| Daily queries | Self-host A10 ($317) | Claude Haiku | DeepSeek V3 | Winner |
|---|---|---|---|---|
| 1,000 | $317 | $24 | $6 | API by 13× |
| 10,000 | $317 | $240 | $64 | API still wins |
| 50,000 | $317 | $1,200 | $320 | Self-host ties DeepSeek |
| 100,000 | $317 | $2,400 | $640 | Self-host wins |
| 500,000 | $317–700 | $12,000 | $3,200 | Self-host wins decisively |
| 1,000,000 | $700–2,160 | $24,000 | $6,400 | Self-host wins 3–12× |
Read this carefully:
- At 1K queries/day, the API costs $6–24/month. Self-hosting costs $317/month. API wins by 13×.
- At 10K queries/day, the API still wins, narrowly.
- At 50K queries/day, self-hosting catches DeepSeek and beats Claude.
- At 1M queries/day, self-hosting wins by 12× over Claude Haiku.
The three break-even points
Where self-hosting on a RunPod A10 (~$317/mo) starts winning, by API choice:
- Claude Haiku 4.5 → ~13K cached queries/day
- DeepSeek V3 → ~50K cached queries/day
- Claude Sonnet 4.6 → ~4K cached queries/day
If you're choosing your API to compare against, DeepSeek pushes the break-even far higher than Claude does.
The trap: AWS GPU pricing
Same A10 GPU:
- RunPod: $0.44/hour
- Lambda Cloud: $0.75/hour
- AWS g5.xlarge: $1.10/hour
AWS is 2.5× more expensive for the same hardware. Their value is locked in their broader ecosystem (RDS, S3, IAM, SOC2/HIPAA compliance). For pure GPU inference workloads, specialty providers undercut them dramatically.
If you're not enterprise-gated into AWS, don't pay the AWS GPU premium.
The lever everyone forgets: semantic caching
In a real chatbot, ~60–80% of queries are reformulations of the same handful of questions:
- "What's the deadline for X?"
- "How do I apply for Y?"
- "Difference between A and B?"
A semantic cache — embed the question, find any cached entry with cosine similarity above 0.95, serve the cached response — dramatically changes the math. At 60% cache hit rate, your effective LLM call rate is 40% of queries, dropping monthly LLM cost by 60%.
The cleanest ramp
For a typical product launch with traffic growth from 0 to 100K+ queries/day:
| Phase | Daily queries | Stack | Monthly LLM cost |
|---|---|---|---|
| Month 1–3 — validation | 0–2K | DeepSeek V3 API, no caching needed | $0–30 |
| Month 4–6 — early traction | 2K–10K | DeepSeek V3 + Upstash Redis semantic cache | $30–100 |
| Month 7–9 — scale signal | 10K–50K | Migrate to RunPod A10 + vLLM + DeepSeek-R1-distill-7B | $317 fixed |
| Month 10+ — real scale | 50K+ | Multi-GPU vLLM cluster, multi-region, observability | $700–2,000+ |
Pre-building stage 3 infra when you're at stage 0 is the classic startup-killing mistake. Every dollar spent on infrastructure for traffic that doesn't exist yet is a dollar that can't be spent on building demand.
Hidden costs that don't show up in either column
Self-hosting:
- Ops time: 2–8 hours/month. If your time is worth $50+/hour, that's $100–400/month not in the cost table.
- Cold starts: if you scale-to-zero, expect 30–60 seconds when a new instance spins up. Most products can't tolerate this.
- Model updates: new versions ship every few weeks. You decide when to migrate — which is both freedom and burden.
API:
- Vendor lock-in: switching from Claude to GPT requires re-testing every prompt. Real cost when it happens.
- Rate limits: real at scale, hit harder than expected.
- Latency variance: typically 2–3× higher than self-hosted, with a longer tail.
For most builders, the hidden cost calculus pushes them toward APIs even longer than the visible cost table does.
The bottom line
API to validate. GPU to scale.
Don't burn $300/month on idle GPU compute when you have 100 users.
Don't burn $24,000/month on API calls when you have 1M queries.
Match the cost shape to the traffic shape. The thing that kills startups isn't picking the wrong stack on day one — it's locking in infrastructure costs before validating that anyone wants the product.