CONCEPTS · QUANTIZATION

Quantization

Quantization compresses a model's weights (and sometimes activations) to a narrower numeric format. The result: more throughput per GPU, lower latency, and lower cost — at the price of a small, measurable quality hit.

Levels Hoonify supports

LevelBitsQuality vs fp16ThroughputNotes
fp1616Reference1.0×The baseline. What benchmark numbers were measured on. Use when you absolutely need parity with a published eval.
fp88≈99% of fp16 on most evals1.6–1.9×The default. Calibrated per-model with activation-aware tuning; quality drop is below per-run noise on Llama-3.x and Qwen-2.5.
int4494–97% of fp162.4–3.0×Aggressive. Useful for cost-sensitive bulk workloads or when you can validate against your own eval set. Not all models expose it.

How Hoonify picks the default

For each model Hoonify chooses the cheapest variant where calibrated evals on MMLU-redux, GSM8K, and HumanEval-Plus stay within 1% of the fp16 reference. In practice that's fp8 for almost every model in the catalog. The model object lists the available variants in the quantizations field.

Overriding

Pass the quantization field on chat completions:

json
{
  "model": "llama-3.3-70b",
  "quantization": "fp16",
  "messages": [...]
}

Asking for a quantization the model doesn't expose returns 400 invalid_request with the supported set echoed back.

When to override the default

  • Reproducing a benchmark: pin to fp16 if you're comparing against published numbers.
  • Tight latency budget on small models: int4 on Llama-3.1-8B drops first-token latency under ~25ms in na.
  • Strict structural output: fp16 is slightly more stable on long JSON or function-calling traces.

Pricing

Quantization changes throughput, not the per-token rate. The price you see in the rate card applies regardless of variant — the difference shows up indirectly through tokens served per second when you're benchmarking.

What about embeddings?

Embedding models always serve at fp16 or bf16. Quantization changes vector statistics in ways that hurt downstream retrieval, so Hoonify doesn't expose lower precision there. Use the dimensions field on Matryoshka models if you want smaller vectors instead.

Related: Models · Rate card