EconomicsLlamaQwenDeepSeek

Why open-source inference finally pencils out in 2026

A year ago, running Llama-class models in production was a vanity project. Today, with quantization-aware kernels and pooled GPU capacity, the per-million-token math beats closed APIs for most non-frontier workloads.

Connor BrownCEO, Hoonify· April 22, 20268 min read

For most of 2024 and 2025, the cost story for open-source inference was a tease. The model weights were free. The serving stack was not. You either rented an entire H100 by the hour and ate the idle time, or you paid a closed API the same per-token price you would pay GPT-4o, with worse tooling.

In April 2026 the math finally inverts. Three things changed at once: FP8 quantization landed in mainstream serving stacks without an accuracy regression on the workloads that actually matter, multi-tenant GPU sharing got safe enough to schedule across customers, and operators stopped pricing capacity like it was scarce. The result is that Llama-3.3-70B served on H100 SXM clusters now lands at roughly $0.18 per million input tokens at the wholesale layer — about 7x cheaper than the closed-API equivalent.

What changed in the kernels

FP8 used to require dropping 1-2 points of MMLU. The current generation of attention and MLP kernels close most of that gap. On Llama-3.3 the delta is inside the run-to-run noise. On code generation it is slightly worse on edge cases and slightly better on long-context recall — net neutral.

The bigger shift is throughput. With FP8 on H100 SXM you can run roughly 1.7x the tokens per second of the BF16 baseline at the same batch size, and the ceiling on batch size is higher because the KV cache fits. Operators who run their own scheduler are seeing 70-80% sustained utilization on inference fleets, up from 35-45% a year ago.

Where it still does not work

Three workloads still pencil out better on a closed API: anything that needs frontier-tier reasoning where the gap is real (advanced math, novel coding tasks, multi-step planning), anything that fits inside the free tier of a hyperscaler offering, and anything where a single point of vendor accountability is more valuable than the cost saving.

For the rest — RAG over enterprise corpora, classification at scale, structured extraction, multilingual customer service, document summarization — open-source on pooled capacity is now the default choice if you have someone competent reading the bill.

If you are paying frontier prices for a workload that does not need frontier intelligence, you are subsidizing someone else's R&D budget.

What we built around it

Hoonify exists because the supply side is fragmented. Eighteen NeoCloud operators each have spare H100 capacity, but customers do not want to integrate eighteen times. We pool capacity across operators behind one OpenAI-compatible endpoint, route by latency and price, and surface the per-pool economics so you can decide what to spend.

The pitch is simple: real capacity, real prices, real tokens. No proprietary model. No lock-in. Open weights, open routing, open pricing.

Older FP8 vs BF16 on Llama-3.3-70B: where the quality delta actually shows upWe ran 14 evaluation suites on FP8 and BF16 weights of the same checkpoint. The headline numbers are within noise. The interesting answers live in the tails.

Back to all posts