BLOG // FIELD NOTES

Notes from the inference frontier.

What we are seeing across pools and operators — open-source models, GPU economics, the unglamorous engineering that keeps tokens flowing.

Real capacity, real prices, real tokens.

FEATURED · ECONOMICS

· Apr 22, 2026

Why open-source inference finally pencils out in 2026

A year ago, running Llama-class models in production was a vanity project. Today, with quantization-aware kernels and pooled GPU capacity, the per-million-token math beats closed APIs for most non-frontier workloads.

Connor Brown· CEO, Hoonify8 min read

Read the post

More from the team

11 posts

BenchmarksApr 15, 2026

FP8 vs BF16 on Llama-3.3-70B: where the quality delta actually shows up

We ran 14 evaluation suites on FP8 and BF16 weights of the same checkpoint. The headline numbers are within noise. The interesting answers live in the tails.

Dmitri Toth · Performance Engineer11 min read

EngineeringApr 8, 2026

Anatomy of a pooled GPU marketplace

What 'pool across operators' actually means under the hood: how listings get normalized, how the scheduler decides where your request lands, and why your H100 might be in three different metros depending on the hour.

Jules Maeder · Inference Lead9 min read

EconomicsApr 1, 2026

H100 spot pricing recap: Q1 2026

The H100 spot market did something unusual this quarter: prices fell during the same window training demand spiked. Here is what we saw and what we think drove it.

Riya Saito · Capacity & Pricing7 min read

EngineeringMar 25, 2026

The OpenAI-compatible API is the wrong default and the right default

Every inference provider speaks OpenAI's wire format. That is a gift to customers and a tax on innovation. Both things are true.

Jules Maeder · Inference Lead6 min read

EngineeringMar 18, 2026

Structured output without grammars: when constrained decoding hurts more than it helps

Grammar-constrained decoding is sold as a way to guarantee valid JSON. It also degrades model quality in ways that do not show up until you are debugging a production incident.

Sasha Wen · Reliability Engineer9 min read

OperatorsMar 11, 2026

What a NeoCloud actually is, and why it matters for inference pricing

The term gets used loosely. We use it specifically: an operator that sells GPUs as a primary product, not as a side effect of a broader cloud business. The distinction shapes everything about how prices form.

Elena Liang · Operator Partnerships7 min read

InferenceMar 4, 2026

RAG on Llama-3.3 beats frontier models at half the cost. Here is why.

When the answer is in your documents, the model's job is to read carefully and synthesize. That is a job Llama-3.3 is good at, and where frontier intelligence is not the bottleneck.

Priya Nair · Solutions Engineer8 min read

GPUsFeb 25, 2026

Bare-metal GPUs vs managed Kubernetes: the case for both

We sell both. Customers ask which is right for them. The honest answer is most workloads benefit from a clean line between training and serving, run on different stacks.

Marcus Hale · Field CTO7 min read

Open SourceFeb 18, 2026

DeepSeek, Qwen, Llama: how we choose between them on a per-workload basis

There is no best open-weights model in 2026. There are three good ones, each strongest at a different shape of work. Here is how we route between them.

Priya Nair · Solutions Engineer9 min read

PolicyFeb 11, 2026

Power, water, and the limits on inference scaling in 2026

The constraint on inference is not GPU supply anymore. It is interconnect at the rack level and water at the regional level. Both are showing up in our pricing.

Connor Brown · CEO, Hoonify8 min read

InferenceFeb 4, 2026

Why we ship every new account with $50 of credits and no credit card

An onboarding bet, said plainly: we believe more people should try open-source inference before they decide whether to use it. Removing the credit card is the cheapest way to make that happen.

Connor Brown · CEO, Hoonify5 min read