BLOG // FIELD NOTES
Notes from the inference frontier.
What we are seeing across pools and operators — open-source models, GPU economics, the unglamorous engineering that keeps tokens flowing.
Real capacity, real prices, real tokens.
FEATURED · ECONOMICS
· Apr 22, 2026Why open-source inference finally pencils out in 2026
A year ago, running Llama-class models in production was a vanity project. Today, with quantization-aware kernels and pooled GPU capacity, the per-million-token math beats closed APIs for most non-frontier workloads.
ARCHIVE
More from the team
FP8 vs BF16 on Llama-3.3-70B: where the quality delta actually shows up
We ran 14 evaluation suites on FP8 and BF16 weights of the same checkpoint. The headline numbers are within noise. The interesting answers live in the tails.
Anatomy of a pooled GPU marketplace
What 'pool across operators' actually means under the hood: how listings get normalized, how the scheduler decides where your request lands, and why your H100 might be in three different metros depending on the hour.
H100 spot pricing recap: Q1 2026
The H100 spot market did something unusual this quarter: prices fell during the same window training demand spiked. Here is what we saw and what we think drove it.
The OpenAI-compatible API is the wrong default and the right default
Every inference provider speaks OpenAI's wire format. That is a gift to customers and a tax on innovation. Both things are true.
Structured output without grammars: when constrained decoding hurts more than it helps
Grammar-constrained decoding is sold as a way to guarantee valid JSON. It also degrades model quality in ways that do not show up until you are debugging a production incident.
What a NeoCloud actually is, and why it matters for inference pricing
The term gets used loosely. We use it specifically: an operator that sells GPUs as a primary product, not as a side effect of a broader cloud business. The distinction shapes everything about how prices form.
RAG on Llama-3.3 beats frontier models at half the cost. Here is why.
When the answer is in your documents, the model's job is to read carefully and synthesize. That is a job Llama-3.3 is good at, and where frontier intelligence is not the bottleneck.
Bare-metal GPUs vs managed Kubernetes: the case for both
We sell both. Customers ask which is right for them. The honest answer is most workloads benefit from a clean line between training and serving, run on different stacks.
DeepSeek, Qwen, Llama: how we choose between them on a per-workload basis
There is no best open-weights model in 2026. There are three good ones, each strongest at a different shape of work. Here is how we route between them.
Power, water, and the limits on inference scaling in 2026
The constraint on inference is not GPU supply anymore. It is interconnect at the rack level and water at the regional level. Both are showing up in our pricing.
Why we ship every new account with $50 of credits and no credit card
An onboarding bet, said plainly: we believe more people should try open-source inference before they decide whether to use it. Removing the credit card is the cheapest way to make that happen.