The mistake I almost made

Last week I was about to spin up a $317/month RunPod A10 instance for a chatbot I'm building. The reasoning seemed sound: "API costs scale with traffic, self-hosted is one fixed bill — so self-hosting wins at scale, right?"

Wrong.

I sat down with a spreadsheet for 30 minutes and realised I would have been burning idle compute for at least the first six months. The break-even point isn't where most builders think it is.

This post is the cost table I wish someone had handed me before I started planning the infrastructure.

What both options have in common

Whether you self-host the LLM or call an API, you need the same supporting infrastructure:

ComponentCost
App server (Next.js / FastAPI)$5–20/mo (Vercel free tier or small VPS)
Vector DB (Supabase / pgvector)$0–25/mo
Redis cache (Upstash)$0–10/mo
Embeddings (local CPU model like bge-m3)$0
Subtotal — both paths~$10–55/mo

The difference comes down to how you serve the language model. Everything else is identical.

Path A — self-hosted LLM on a rented GPU

You rent a GPU on RunPod, Lambda Cloud, Hetzner, Vast.ai, or AWS. You run vLLM, llama.cpp, or Ollama on it, serving an open-weight model like Qwen 3, DeepSeek-R1, or Llama 3.3.

Cost structure: fixed, per hour. You pay the same whether you process 1 query or 100,000.

GPUProvider$/hr$/mo 24×7Capacity (Qwen 7B + vLLM)
T4 16 GBRunPod$0.20$144~50K cached q/day
A10 24 GBRunPod$0.44$317~200K cached q/day
L40S 48 GBRunPod$0.99$713~500K cached q/day
H100 80 GBRunPod$2.99$2,1601M+ cached q/day
g5.xlarge (A10)AWS$1.10$790⚠ 2.5× pricier for same hardware

Skip AWS for GPU workloads unless you're already locked in for compliance reasons. Specialty providers undercut it by 2–3×.

Path B — cloud LLM API + small instance

You build your app on Vercel or a VPS and call Anthropic, OpenAI, DeepSeek, or Mistral for each query. Pay per token.

Cost structure: variable, per query. Zero idle cost, scales linearly with traffic.

For a typical chatbot (~500 input + 300 output tokens per query):

APIPer query (¥)Per query ($)
DeepSeek V3¥0.08$0.0005
GPT-4o-mini¥0.20$0.0013
Claude Haiku 4.5¥0.30$0.002
Claude Sonnet 4.6¥0.90$0.006
Claude Opus 4.7¥4.50$0.030

The actual head-to-head

Assuming 60% cache hit rate (typical for FAQ-shaped chatbots), so effective LLM calls = 40% of total queries:

Daily queries Self-host A10 ($317) Claude Haiku DeepSeek V3 Winner
1,000$317$24$6API by 13×
10,000$317$240$64API still wins
50,000$317$1,200$320Self-host ties DeepSeek
100,000$317$2,400$640Self-host wins
500,000$317–700$12,000$3,200Self-host wins decisively
1,000,000$700–2,160$24,000$6,400Self-host wins 3–12×

Read this carefully:

  • At 1K queries/day, the API costs $6–24/month. Self-hosting costs $317/month. API wins by 13×.
  • At 10K queries/day, the API still wins, narrowly.
  • At 50K queries/day, self-hosting catches DeepSeek and beats Claude.
  • At 1M queries/day, self-hosting wins by 12× over Claude Haiku.

The three break-even points

Where self-hosting on a RunPod A10 (~$317/mo) starts winning, by API choice:

  • Claude Haiku 4.5 → ~13K cached queries/day
  • DeepSeek V3 → ~50K cached queries/day
  • Claude Sonnet 4.6 → ~4K cached queries/day

If you're choosing your API to compare against, DeepSeek pushes the break-even far higher than Claude does.

The trap: AWS GPU pricing

Same A10 GPU:

  • RunPod: $0.44/hour
  • Lambda Cloud: $0.75/hour
  • AWS g5.xlarge: $1.10/hour

AWS is 2.5× more expensive for the same hardware. Their value is locked in their broader ecosystem (RDS, S3, IAM, SOC2/HIPAA compliance). For pure GPU inference workloads, specialty providers undercut them dramatically.

If you're not enterprise-gated into AWS, don't pay the AWS GPU premium.

The lever everyone forgets: semantic caching

In a real chatbot, ~60–80% of queries are reformulations of the same handful of questions:

  • "What's the deadline for X?"
  • "How do I apply for Y?"
  • "Difference between A and B?"

A semantic cache — embed the question, find any cached entry with cosine similarity above 0.95, serve the cached response — dramatically changes the math. At 60% cache hit rate, your effective LLM call rate is 40% of queries, dropping monthly LLM cost by 60%.

If you skip caching, you're paying 2–3× more than you need to, on either path. Add Upstash Redis with semantic similarity matching on day one. The free tier handles 10,000 commands/day.

The cleanest ramp

For a typical product launch with traffic growth from 0 to 100K+ queries/day:

PhaseDaily queriesStackMonthly LLM cost
Month 1–3 — validation0–2KDeepSeek V3 API, no caching needed$0–30
Month 4–6 — early traction2K–10KDeepSeek V3 + Upstash Redis semantic cache$30–100
Month 7–9 — scale signal10K–50KMigrate to RunPod A10 + vLLM + DeepSeek-R1-distill-7B$317 fixed
Month 10+ — real scale50K+Multi-GPU vLLM cluster, multi-region, observability$700–2,000+

Pre-building stage 3 infra when you're at stage 0 is the classic startup-killing mistake. Every dollar spent on infrastructure for traffic that doesn't exist yet is a dollar that can't be spent on building demand.

Hidden costs that don't show up in either column

Self-hosting:

  • Ops time: 2–8 hours/month. If your time is worth $50+/hour, that's $100–400/month not in the cost table.
  • Cold starts: if you scale-to-zero, expect 30–60 seconds when a new instance spins up. Most products can't tolerate this.
  • Model updates: new versions ship every few weeks. You decide when to migrate — which is both freedom and burden.

API:

  • Vendor lock-in: switching from Claude to GPT requires re-testing every prompt. Real cost when it happens.
  • Rate limits: real at scale, hit harder than expected.
  • Latency variance: typically 2–3× higher than self-hosted, with a longer tail.

For most builders, the hidden cost calculus pushes them toward APIs even longer than the visible cost table does.

The bottom line

API to validate. GPU to scale.

Don't burn $300/month on idle GPU compute when you have 100 users.

Don't burn $24,000/month on API calls when you have 1M queries.

Match the cost shape to the traffic shape. The thing that kills startups isn't picking the wrong stack on day one — it's locking in infrastructure costs before validating that anyone wants the product.