CONCEPTS · QUANTIZATION
Quantization
Quantization compresses a model's weights (and sometimes activations) to a narrower numeric format. The result: more throughput per GPU, lower latency, and lower cost — at the price of a small, measurable quality hit.
Levels Hoonify supports
| Level | Bits | Quality vs fp16 | Throughput | Notes |
|---|---|---|---|---|
| fp16 | 16 | Reference | 1.0× | The baseline. What benchmark numbers were measured on. Use when you absolutely need parity with a published eval. |
| fp8 | 8 | ≈99% of fp16 on most evals | 1.6–1.9× | The default. Calibrated per-model with activation-aware tuning; quality drop is below per-run noise on Llama-3.x and Qwen-2.5. |
| int4 | 4 | 94–97% of fp16 | 2.4–3.0× | Aggressive. Useful for cost-sensitive bulk workloads or when you can validate against your own eval set. Not all models expose it. |
How Hoonify picks the default
For each model Hoonify chooses the cheapest variant where calibrated evals on MMLU-redux, GSM8K, and HumanEval-Plus stay within 1% of the fp16 reference. In practice that's fp8 for almost every model in the catalog. The model object lists the available variants in the quantizations field.
Overriding
Pass the quantization field on chat completions:
{
"model": "llama-3.3-70b",
"quantization": "fp16",
"messages": [...]
}Asking for a quantization the model doesn't expose returns 400 invalid_request with the supported set echoed back.
When to override the default
- Reproducing a benchmark: pin to
fp16if you're comparing against published numbers. - Tight latency budget on small models:
int4on Llama-3.1-8B drops first-token latency under ~25ms inna. - Strict structural output:
fp16is slightly more stable on long JSON or function-calling traces.
Pricing
What about embeddings?
Embedding models always serve at fp16 or bf16. Quantization changes vector statistics in ways that hurt downstream retrieval, so Hoonify doesn't expose lower precision there. Use the dimensions field on Matryoshka models if you want smaller vectors instead.