BenchmarksQuantizationLlama-3.3Eval

FP8 vs BF16 on Llama-3.3-70B: where the quality delta actually shows up

We ran 14 evaluation suites on FP8 and BF16 weights of the same checkpoint. The headline numbers are within noise. The interesting answers live in the tails.

Dmitri TothPerformance Engineer· April 15, 202611 min read

Every quantization announcement in the last eighteen months has shipped with the same talking point: 'no measurable degradation on standard benchmarks.' That is true and not very useful. Standard benchmarks compress real behavior into single numbers. The question we wanted to answer is more practical: which production workloads should we worry about when we move from BF16 to FP8 weights on Llama-3.3-70B?

We ran 14 evaluation suites covering reasoning, code generation, multilingual QA, structured extraction, and long-context retrieval. Same checkpoint, same prompts, same sampler settings, same hardware. Only the weight precision changed.

Headline numbers

On the aggregate scores you would expect from a model card, the deltas are inside run-to-run noise. MMLU shifted by 0.2 points in BF16's favor. HumanEval shifted by 0.4 points in FP8's favor. GSM8K was identical. None of these are statistically meaningful at our sample size.

Where FP8 is measurably worse

Two places. First, structured extraction with rare schemas — when the model has to output a JSON object with field names it has not seen often during training, FP8 makes more single-token errors that break the parse. We measured a 1.8 point drop in valid-JSON rate on a custom schema suite.

Second, very long contexts. Past 80k tokens, FP8 retrieval accuracy on 'needle in a haystack' style probes drops faster than BF16. At 128k the gap is about 4 points. For most production workloads this does not matter; if your application lives at the long end, it does.

Structured extraction with rare schemas: BF16 wins by ~1.8 points
Long-context retrieval at 128k: BF16 wins by ~4 points
Multi-step reasoning chains over 30 steps: BF16 marginally better, inside noise

Where FP8 is measurably better

Mostly latency-sensitive cases where higher batch sizes mean lower per-request latency at fixed concurrency. We routinely see 40-60% throughput improvement at the same TTFT. For chat applications, this directly translates to a better user experience on the same hardware.

There is also a small, consistent improvement on 'easy' prompts where the model would have over-thought in BF16. We do not have a clean explanation for this and we are not chasing it.

The recommendation

Default to FP8 for chat, RAG, classification, and code completion. Stay on BF16 for structured extraction with novel schemas, long-context analysis past 80k tokens, and any workload where a single bad output is unrecoverable. If you need both, run them on different endpoints — the price difference more than pays for the routing complexity.

NewerWhy open-source inference finally pencils out in 2026A year ago, running Llama-class models in production was a vanity project. Today, with quantization-aware kernels and pooled GPU capacity, the per-million-token math beats closed APIs for most non-frontier workloads.Older Anatomy of a pooled GPU marketplaceWhat 'pool across operators' actually means under the hood: how listings get normalized, how the scheduler decides where your request lands, and why your H100 might be in three different metros depending on the hour.

Back to all posts