Structured output without grammars: when constrained decoding hurts more than it helps
Grammar-constrained decoding is sold as a way to guarantee valid JSON. It also degrades model quality in ways that do not show up until you are debugging a production incident.
Every inference provider ships some flavor of JSON mode or grammar-constrained decoding. The pitch is simple: you get back valid JSON, every time, no parse errors. The reality is more complicated.
Constrained decoding works by masking out tokens that would violate your grammar. At every step, the sampler can only emit tokens that keep the partial output parseable. This is an elegant trick. It is also a tax on the model's ability to reason.
Where it backfires
When the model wants to say 'I do not have enough information to answer this' but your grammar requires a non-empty 'answer' field, the constrained sampler will produce a confident-sounding hallucination. The grammar gave it no escape hatch.
When the model wants to think out loud before committing to a structured answer, the grammar prevents the chain-of-thought tokens that would have produced a better final answer. You get a syntactically valid output that is semantically worse than the unconstrained version.
What we recommend instead
For low-stakes structured output (classification, extraction from clean text, simple yes/no decisions), grammar-constrained decoding is fine. The model would have produced valid JSON anyway and the guarantee is cheap insurance.
For anything where the model needs to reason — multi-step planning, ambiguous extraction, anything with a 'this is unclear, please clarify' branch — let it generate freely and parse on the receiving end. Add a retry with exponential backoff if the parse fails. We see roughly a 1.4% parse failure rate on Llama-3.3 with this setup, which is much better than the 8-12% quality regression we see with constrained decoding on the same workload.
- Use constrained decoding for: classification, extraction with stable schemas, structured tool calls
- Skip constrained decoding for: reasoning tasks, ambiguous inputs, anything with a 'do not answer' branch
- Always log the raw tokens, not just the parsed output — it is the only way to debug grammar regressions
The future is probably hybrid
The right answer is probably neither pure constrained decoding nor pure free generation. It is a sampler that switches mode based on what the model is currently doing — let it reason freely, then constrain only when it commits to a structured response. A few research stacks are doing this. None of them are production-ready yet. We expect this to be a 2026-into-2027 area of focus.