EngineeringArchitectureSchedulingRouting

Anatomy of a pooled GPU marketplace

What 'pool across operators' actually means under the hood: how listings get normalized, how the scheduler decides where your request lands, and why your H100 might be in three different metros depending on the hour.

Jules MaederInference Lead· April 8, 20269 min read

When customers ask how Hoonify works, the short answer is 'we route your request to whichever operator has the right GPU available right now.' The long answer is more interesting and a few of the choices are non-obvious, so this post is the long answer.

Listings, not capacity

Operators do not give us their full inventory. They give us listings — slices of capacity they are willing to sell at a given price for a given window. A listing has a SKU (H100 SXM 80GB, MI300X, etc.), a region, a quantity, a price, and an expiry. Listings refresh on a 15-second cadence.

This is not the same as a futures market. A listing is a commitment from the operator that for the next refresh window, this capacity is reservable at this price. If you click buy, you get it. If you do not, the listing expires and the operator can re-list at a different price.

The pool abstraction

We group regions into three pools: North America, Europe, and APAC. Pools matter for two reasons. First, latency — a customer in Frankfurt does not want their inference round-tripping through Virginia. Second, regulatory — some workloads have to stay inside a specific jurisdiction.

When you make a request, you specify a pool (or accept the default, which is geo-IP-based). The scheduler picks the cheapest available listing inside that pool that meets your latency target. If no listing meets the target, it widens to the next closest pool and surfaces a warning.

Why your GPU might be in three metros

If you reserve a single H100 for a long-running training job, it stays in one metro. Inference is different. We treat inference capacity as fungible inside a pool — your traffic at 9am in New York might land on Equinix DC2, at 9pm on Cologix CHI3, and overnight on a smaller operator in Dallas that has cheaper capacity available.

This sounds noisy but the customer experience is identical: same OpenAI-compatible endpoint, same latency target, same per-token price. The only thing that changes is which operator earns the revenue for that hour.

Inference capacity inside a pool is fungible. We arbitrage so you do not have to.

What this means for operators

If you are an operator listing capacity on Hoonify, you do not get to know which specific customer is using which specific GPU. You get aggregate utilization, aggregate revenue, and the ability to set price floors per SKU per pool. We do this so we can price aggressively without leaking competitive information between operators.

Operators who try to game this — by withholding capacity to spike prices — get out-arbitraged by operators who do not. The system rewards consistent listing behavior.

NewerFP8 vs BF16 on Llama-3.3-70B: where the quality delta actually shows upWe ran 14 evaluation suites on FP8 and BF16 weights of the same checkpoint. The headline numbers are within noise. The interesting answers live in the tails.Older H100 spot pricing recap: Q1 2026The H100 spot market did something unusual this quarter: prices fell during the same window training demand spiked. Here is what we saw and what we think drove it.

Back to all posts