API REFERENCE · CHAT COMPLETIONS

Chat completions

OpenAI-compatible endpoint for chat-style conversations. Use this for almost everything text-shaped: assistants, RAG, code generation, summarization, agents.

POSThttps://api.hoonify.dev/v1/chat/completions

Authentication

Send your API key as a bearer token. See the API keys page for creating and rotating keys.

shell
Authorization: Bearer hoon_sk_live_…

Request body

FieldTypeDescription
modelstring · requiredModel identifier — e.g. llama-3.3-70b. See catalog.
messagesarray · requiredConversation turns. Each item is {role, content}. Roles: system, user, assistant, tool.
temperaturenumber · 0–2Sampling temperature. Default 0.7. Lower = more deterministic.
max_tokensintegerHard cap on completion tokens. Defaults to model max if omitted.
top_pnumber · 0–1Nucleus sampling. Default 1. Use this or temperature, not both.
top_kintegerHoonify extension. Sample from the top-k logits. Default 0 (off).
streambooleanIf true, returns Server-Sent Events instead of one JSON body.
quantizationstringHoonify extension. fp16 / fp8 / int4. Defaults to the cheapest variant the model exposes.
stopstring · arrayOne or more stop sequences. Up to 4 strings.
toolsarrayFunction-calling. Same shape as the OpenAI tools array.

Headers

HeaderValueUse
X-Hoonify-Poolna · eu · apacPin a specific pool. Required for data-residency workloads.
X-Hoonify-Idempotency-KeyuuidDedupes retries within a 5-minute window. Recommended for non-streaming requests.

Example request

json
{
  "model": "llama-3.3-70b",
  "messages": [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user",   "content": "Summarize quantum tunneling in one paragraph."}
  ],
  "temperature": 0.7,
  "max_tokens": 1024,
  "top_p": 0.95,
  "stream": false,
  "quantization": "fp8",
  "stop": ["</done>"]
}

Response

Non-streaming responses return a single JSON body. Streaming responses are an SSE stream of chat.completion.chunk objects terminated by [DONE].

json
{
  "id": "chatcmpl-FRcX2Fe1k4vR",
  "object": "chat.completion",
  "created": 1745784012,
  "model": "llama-3.3-70b",
  "pool": "na",
  "system_fingerprint": "hoonify-fp8-r12",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Quantum tunneling is the phenomenon where a particle…"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 28,
    "completion_tokens": 142,
    "total_tokens": 170
  }
}

Hoonify extensions on the response

FieldDescription
poolThe pool that served this request — na, eu, or apac.
system_fingerprintHash identifying the model + quantization + runtime that produced the response. Stable while the operator setup is unchanged.

Errors

Errors return the OpenAI-style error envelope: {"error": {"type": ..., "message": ...}}.

StatusTypeCause
400invalid_requestMalformed JSON, unsupported field, or invalid value.
401unauthorizedMissing, malformed, or revoked API key.
403forbidden_scopeKey lacks the required scope (e.g. inference:write).
404model_not_foundModel ID does not exist or is not in your pool.
409no_capacityNo replica available in the requested pool.
429rate_limitedPer-key RPM exceeded. Back off and retry with jitter.
503pool_degradedPool is temporarily routing-degraded. Retry with X-Hoonify-Pool fallback.

Idempotency

For long completions, set X-Hoonify-Idempotency-Key on retries. Hoonify deduplicates within a 5-minute window — you won't pay twice for a request that succeeded server-side but failed mid-flight back to your client.

Tool calling

Pass tools with JSON-Schema function definitions. The model returns atool_calls array on the assistant message; you execute the call and post the result back as a tool-role message in the next turn. Same shape as the OpenAI tools API — drop-in compatible.

Rate limits

Defaults per API key: 1,200 RPM on Tier 2 (the default), 12,000 RPM on Tier 3. Token-per-minute caps follow the same scaling. Bumps are zero-touch — email support@hoonify.dev with your use case.