Talki Academy
Guide10 min read

Self-Hosted vs API: the Real TCO of an LLM (when local becomes rational)

An honest comparison (not pro-local advocacy) of an LLM's total cost of ownership: hidden costs (GPU, electricity, maintenance, latency, availability, human time), when the API wins, when self-hosting wins. The right criterion is not 'cheaper'.

By Talki Academy·Updated on June 4, 2026

“The API costs us €X a month, whereas a local GPU is free.” That sentence triggers more bad infrastructure decisions than any other. This guide is not pro-local advocacy: it is an honest comparison of total cost of ownership (TCO), with the hidden costs on both sides, and a decision criterion that is not “which is cheaper.”

The seductive — and misleading — math

The classic reasoning: look at the monthly API bill, compare it to the price of a GPU, and local wins “because after purchase it is free.” The problem: API and self-hosting do not have the same cost structure.

  • API: fixed cost ≈ 0, marginal cost per request. You pay exactly what you consume.
  • Self-hosting: high fixed cost (hardware + operations), marginal cost ≈ 0. You pay for capacity, whether it is used or not.

Comparing a marginal cost to a fixed cost without accounting for volume is like comparing rent to buying a home by looking only at the monthly payment.

The hidden costs of self-hosting

Hardware and amortization

The GPU is a capex to amortize, not a zero cost. A serious inference node runs into the thousands of euros, spread over 2 to 4 years — and the resale value of an AI GPU drops fast.

Electricity and cooling

An always-on node draws power 24/7, load or no load. Over a year, electricity (and cooling) becomes a real line item, especially where the kWh is expensive.

Maintenance and updates

Drivers, inference runtime, quantization formats, new models: the stack moves constantly. An update that breaks production is engineer time — and sometimes downtime.

Availability and latency

With self-hosting, there is no SLA: an outage is your problem, at 3 a.m. if needed. Latency and load handling are not guaranteed by a third party — getting them is on you.

Human time — the most underestimated cost

This is the line item that flips most calculations. The operational skill needed to run an LLM reliably in production is not free. A quantization format that leaks memory means several days of debugging before you find the right setting. If that skill is not already in-house, the “free” of local becomes very expensive.

The hidden costs of the API

The marginal cost that runs away

The API’s advantage (pay-as-you-go) becomes a flaw at high volume: at several million requests, cost per token ends up dominating every other consideration.

Confidentiality and sovereignty

Every request sends your data to a third party, often outside your jurisdiction. For sensitive or regulated data, this is not a price question but a control question.

Lock-in and price changes

Your cost depends on a pricing grid you do not control, and switching vendors has a cost. You also inherit the vendor’s rate limits and quotas.

The tipping point: volume

It all comes down to the crossover between a fixed cost (self-hosting) and a usage-growing cost (API):

API : cost ≈ requests × price_per_request (fixed ≈ 0) Self-hosting : cost ≈ hardware/amortized + power + ops (marginal ≈ 0) Low volume -> API wins (nothing to amortize) Growing volume -> approaching the tipping point High AND stable -> self-hosting amortizes its fixed cost CAUTION: this only shows machine cost. Add human time (ops) and the tipping point moves further out.

Well-run migrations show dramatic drops on machine cost alone (see our migration case study and our LLM cost benchmark). But those numbers never include the engineer-hours — which are precisely what make the operation profitable… or not.

When the API wins

  • Low or erratic volume: nothing to amortize, you just pay for usage.
  • No ops team: you buy a third party’s reliability rather than building it.
  • Need for the latest frontier models without managing hardware.
  • Fast product iteration: test without provisioning infrastructure.
  • Compliance you would rather delegate to the vendor.

When self-hosting wins

  • High and stable volume: enough to amortize, steady enough not to pay for idle hardware.
  • Confidentiality / sovereignty: the data must not leave.
  • Control over latency and the end-to-end stack.
  • You already have the ops skill in-house.
  • Predictable load, no dependence on an external pricing grid.

Our take

In practice, we run hybrid: self-hosting absorbs the repetitive volume (drafts, classification, internal tasks) via a local router, and the API/Claude steps in for supervision and rare, high-quality tasks. It is not “local or cloud,” it is “the right tool per workload type.” See also our AI cost optimization guide and local LLM in production.

And we own it: local is only rational if you have the operational skill to keep it running. Without it, the API is not an admission of weakness — it is the rational choice.

Decision table

Your situationRational choice
Low or unpredictable volumeAPI
High, stable volume + ops teamSelf-hosting
Sensitive data / sovereignty requiredSelf-hosting (or private cloud)
Need for recent frontier modelsAPI
No in-house ops skillAPI
Mixed volume (repetitive + spikes)Hybrid

Conclusion

The right criterion is not “cheaper.” It is the combination of four factors: control (over the stack and the data), volume (high and stable enough to amortize), confidentiality (must the data stay with you?), and operational skill (can you actually run it?). If you have all four, self-hosting becomes rational. If one is missing, the API — or hybrid — is probably the better decision. Unit price is just one variable among several, and rarely the most important.

Train your team in AI

Our training is eligible for funding — potential out-of-pocket cost: €0.

See all coursesCheck eligibility