CONCEPTS · POOLS

Pools

Hoonify groups every operator into one of three pools: North America, Europe, and APAC. From your perspective each pool is a single, fungible compute resource — Hoonify decides which underlying operator handles a given request.

Why pools?

Operators are anonymous to your customers. You bring traffic; we route it to a Hoonify-managed operator with the right SKU, latency, and reliability profile. That keeps your customer roster opaque to competitors and means you don't need to load-balance across operators yourself.

Routing

For each request, Hoonify picks an operator within the target pool by:

  1. Filtering to operators that have the requested SKU available right now.
  2. Among those, preferring lowest network latency to the request origin.
  3. Breaking ties by reliability score (rolling 30-day uptime).
  4. Avoiding operators that are draining or in maintenance windows.

If no capacity exists in the requested pool, Hoonify can either fail with a 409 no_capacity or fall back to the next-closest pool — controlled by the X-Hoonify-Pool-Fallback header. Default is strict (no fallback) for predictable latency.

Picking a pool

PoolTypical latency fromCoverage
NAUS, Canada, parts of LATAM9 SKUs, 9 inference models, fastest spot turnover.
EUEU, UK, Nordics, Middle East7 SKUs, 5 models. GDPR-resident by default.
APACJP, KR, SG, AU, IN8 SKUs, 4 models. Strongest growth pool.

Pinning a pool

Set the X-Hoonify-Pool header on the request. Use this when:

  • You need data residency (e.g. EU customers must hit the EU pool).
  • You're benchmarking a specific region's tail latency.
  • Your inference is part of a multi-step pipeline and you want everything in one pool.
shell
curl https://api.hoonify.dev/v1/chat/completions \
  -H "Authorization: Bearer $HOONIFY_API_KEY" \
  -H "X-Hoonify-Pool: eu" \
  -d '{ "model": "llama-3.3-70b", "messages": [...] }'

Multi-pool replication

For latency-sensitive customer-facing workloads, configure your endpoint to replicate across pools — Hoonify will route each request to the closest replica:

json
POST /v1/endpoints
{
  "model": "llama-3.3-70b",
  "pools": ["na", "eu", "apac"],
  "concurrency": 64
}

Replication doesn't multiply your bill — the per-token rate is unchanged. You pay for tokens served across all pools, and we eat the fleet management.

What you don't see

The specific operators inside a pool are not exposed via the API. The system_fingerprint on the response identifies the model + runtime configuration but not the operator. This is deliberate: it gives Hoonify the flexibility to migrate workloads between operators without breaking your responses, and it keeps the operator list private.

Related: Quickstart · Chat completions API