EngineeringJSON modeStructured outputDecoding

Structured output without grammars: when constrained decoding hurts more than it helps

Grammar-constrained decoding is sold as a way to guarantee valid JSON. It also degrades model quality in ways that do not show up until you are debugging a production incident.

Sasha WenReliability Engineer· March 18, 20269 min read

Every inference provider ships some flavor of JSON mode or grammar-constrained decoding. The pitch is simple: you get back valid JSON, every time, no parse errors. The reality is more complicated.

Constrained decoding works by masking out tokens that would violate your grammar. At every step, the sampler can only emit tokens that keep the partial output parseable. This is an elegant trick. It is also a tax on the model's ability to reason.

Where it backfires

When the model wants to say 'I do not have enough information to answer this' but your grammar requires a non-empty 'answer' field, the constrained sampler will produce a confident-sounding hallucination. The grammar gave it no escape hatch.

When the model wants to think out loud before committing to a structured answer, the grammar prevents the chain-of-thought tokens that would have produced a better final answer. You get a syntactically valid output that is semantically worse than the unconstrained version.

What we recommend instead

For low-stakes structured output (classification, extraction from clean text, simple yes/no decisions), grammar-constrained decoding is fine. The model would have produced valid JSON anyway and the guarantee is cheap insurance.

For anything where the model needs to reason — multi-step planning, ambiguous extraction, anything with a 'this is unclear, please clarify' branch — let it generate freely and parse on the receiving end. Add a retry with exponential backoff if the parse fails. We see roughly a 1.4% parse failure rate on Llama-3.3 with this setup, which is much better than the 8-12% quality regression we see with constrained decoding on the same workload.

Use constrained decoding for: classification, extraction with stable schemas, structured tool calls
Skip constrained decoding for: reasoning tasks, ambiguous inputs, anything with a 'do not answer' branch
Always log the raw tokens, not just the parsed output — it is the only way to debug grammar regressions

The future is probably hybrid

The right answer is probably neither pure constrained decoding nor pure free generation. It is a sampler that switches mode based on what the model is currently doing — let it reason freely, then constrain only when it commits to a structured response. A few research stacks are doing this. None of them are production-ready yet. We expect this to be a 2026-into-2027 area of focus.

NewerThe OpenAI-compatible API is the wrong default and the right defaultEvery inference provider speaks OpenAI's wire format. That is a gift to customers and a tax on innovation. Both things are true.Older What a NeoCloud actually is, and why it matters for inference pricingThe term gets used loosely. We use it specifically: an operator that sells GPUs as a primary product, not as a side effect of a broader cloud business. The distinction shapes everything about how prices form.

Back to all posts