JSON Output Reliability in Plain-Prompt Mode
8 Models via OpenRouter — StreamFix Benchmark v2.2 — February 2026
Disclosure: I'm the author of StreamFix. The benchmark code and raw JSONL are included for reproduction.
Executive Summary
We tested 8 LLMs across 672 API calls to measure JSON output reliability using plain prompts only (no response_format or structured output features). The results reveal a significant gap between what models can produce and what developers actually receive:
| Metric | Raw Output | StreamFix |
|---|---|---|
| Content JSON (strict parse) | 33.3% | 98.4% |
| Tool call arguments | 100% | 99.3% |
| Streaming tool calls | 100% | 98.6% |
Key finding: 95% of JSON failures are trivial wrapper issues (markdown fences), not semantic errors. The underlying JSON is correct — it's just not machine-parseable as-is.
The Problem
When you ask an LLM to "return JSON only," you often get:
Here's the JSON you requested:
```json
{"name": "Alice", "age": 30}
Hope this helps!
This is valid JSON wrapped in markdown — but `json.loads()` fails. Your application crashes. Users see errors. You add retry logic, regex hacks, or give up on smaller models.
**In Protocol A plain-prompt mode, JSON failures are overwhelmingly formatting wrappers (95% markdown fences), not invalid JSON structure.**
---
## Methodology
### Test Design
- **Protocol A only**: Plain prompts asking for JSON output (no `response_format`, no structured output features, no function calling tricks)
- **Forced tool calls**: For tool tasks, `tool_choice` was set to force the specific tool (measuring argument reliability, not tool selection)
- **8 production models**: Mix of 2026 releases, established players, and open-source
- **7 task types**: 4 content tasks + 3 tool-calling tasks
- **Full matrix**: sync + streaming × raw + repaired × 3 trials = 672 total tests
- **Temperature 0**: Maximum reproducibility
### Models Tested
| Model | Provider | Release |
|:------|:---------|:--------|
| kimi-k2.5 | Moonshot AI | Jan 2026 |
| glm-4.7-flash | Z-AI | Jan 2026 |
| seed-1.6-flash | ByteDance | Dec 2025 |
| mistral-small-creative | Mistral | Dec 2025 |
| devstral-2512 | Mistral | Dec 2025 |
| ministral-8b-2512 | Mistral | Dec 2025 |
| gpt-4o-mini | OpenAI | Baseline |
| llama-3.3-70b-instruct | Meta | Open source |
### Task Categories
**Content Tasks** (JSON in `message.content`):
1. Simple object: `{name, age, active}`
2. Nested object: `{id, name, address: {street, city, zip}}`
3. Array of objects: `[{id, name, price}, ...]`
4. Special characters: quotes, newlines, unicode
**Tool Tasks** (JSON in `tool_calls[0].function.arguments`):
1. `search_flights(from, to, date)`
2. `create_invoice(amount, currency, items)`
3. `schedule_meeting(title, datetime, attendees)`
### Metrics
| Metric | Definition |
|:-------|:-----------|
| **Strict parse** | `json.loads(content)` succeeds |
| **Extractable** | A valid JSON substring exists that can be extracted without modifying any JSON characters, and `json.loads()` on that substring succeeds |
| **Schema valid** | Required keys present, correct types |
| **Tool emitted** | Model returned a tool_call (not just content) |
All percentages include Wilson 95% confidence intervals.
---
## Results
### HTTP Success Rate
| Lane | Success Rate |
|:-----|:-------------|
| Raw | 336/336 = 100.0% [98.9, 100.0] |
| StreamFix | 335/336 = 99.7% [98.3, 99.9] |
Both lanes show near-perfect HTTP reliability. The single StreamFix timeout was a transient network issue.
### Content JSON Parse Rate
| Metric | Raw | StreamFix |
|:-------|:----|:----------|
| **Strict parse** | 33.3% [27.0, 40.3] | **98.4%** [95.5, 99.5] |
| **Extractable** | 99.5% [97.1, 99.9] | 99.0% [96.3, 99.7] |
| **Schema valid (strict)** | 33.3% | 98.4% |
| **Schema valid (any)** | 99.5% | 99.0% |
**The 99.5% extractable rate means valid JSON exists in the response** — it's just wrapped in markdown fences, prose, or think tags. No JSON repair needed, only wrapper removal.
### Tool Call Reliability
| Metric | Raw | StreamFix |
|:-------|:----|:----------|
| **Tool emitted** | 100.0% [97.4, 100.0] | 98.6% [95.0, 99.6] |
| **Args strict (given emitted)** | 100.0% [97.4, 100.0] | 99.3% [96.1, 99.9] |
| **Args schema valid** | 95.8% [91.2, 98.1] | 95.0% [90.1, 97.6] |
Tool calls show excellent reliability in both lanes. The ~5% schema failures are models returning slightly different field names or structures.
### Streaming Tool Calls
| Lane | Sync Mode | Streaming Mode |
|:-----|:----------|:---------------|
| Raw | 72/72 emitted | 72/72 emitted |
| StreamFix | 71/72 emitted | 70/71 emitted |
**Streaming tool calls work correctly in both lanes.** This was a critical validation point — earlier versions had a bug where StreamFix dropped streaming tool calls entirely.
---
## Failure Analysis
### Failure Taxonomy (n=134 strict failures)
| Failure Type | Count | Percentage |
|:-------------|:------|:-----------|
| `markdown_fence` | 128 | 95.5% |
| `no_tool_call` | 2 | 1.5% |
| `empty_content` | 2 | 1.5% |
| `prose_wrapper` | 1 | 0.7% |
| `truncated` | 1 | 0.7% |
**95.5% of failures are markdown fences** — the model wraps correct JSON in ` ```json ... ``` `. This is the dominant failure mode and is trivially repairable.
### Per-Model Breakdown (Content Tasks)
| Model | Raw Strict | StreamFix Strict |
|:------|:-----------|:-----------------|
| kimi-k2.5 | 75.0% | 95.8% |
| glm-4.7-flash | 41.7% | 91.7% |
| seed-1.6-flash | **100%** | **100%** |
| mistral-small-creative | 0% | **100%** |
| devstral-2512 | 0% | **100%** |
| ministral-8b-2512 | 0% | **100%** |
| gpt-4o-mini | 4.2% | **100%** |
| llama-3.3-70b-instruct | 29.2% | **100%** |
**Notable patterns:**
- **ByteDance seed-1.6-flash** is the only model that produces clean JSON natively (100% raw)
- **Mistral models** consistently wrap in markdown (0% raw → 100% repaired)
- **GPT-4o-mini** also wraps heavily (4.2% raw → 100% repaired)
---
## Implications
### For Developers
1. **Don't blame the model** — the JSON is usually correct, just wrapped
2. **Don't use regex hacks** — they break on edge cases
3. **Use a repair layer** — it's a solved problem at the protocol level
### For Model Providers
1. **The training signal is strong** — models produce valid JSON 99.5% of the time
2. **The formatting signal is wrong** — models add markdown because training data had it
3. **Structured output modes help** — but require provider-specific code paths
### For the Industry
The 33% → 98% improvement from a simple repair layer suggests:
- **Reliability is cheap** — no model retraining required
- **The fix is universal** — works across providers and models
- **The problem is solvable today** — not waiting for GPT-5
---
## Limitations
to providers
2. **Temperature 0** — higher temperatures may increase variability
3. **English prompts only** — non-English may behave differently
4. **Limited task complexity** — real-world JSON can be more complex
5. **Forced tool calls** — tool emission rate measures argument reliability, not tool selection accuracy
6. **Plain prompts only** — providers' native structured output features (when available) may show different results
4. **Limited task complexity** — real-world JSON can be more complex
---
## Reproducibility
Full benchmark code and raw results available at:
- Code: `benchmark/benchmark_v3.1.py`
- Results: `benchmark/results/benchmark_20260201_172225.jsonl`
To reproduce:
```bash
export OPENROUTER_API_KEY="your-key"
export STREAMFIX_API_KEY="your-key"
python benchmark/benchmark_v3.1.py
Conclusion
LLMs are remarkably good at producing valid JSON — they just can't stop themselves from wrapping it in helpful formatting. This is a protocol-level problem with a protocol-level solution.
The path to reliable JSON isn't better models. It's better middleware.
Appendix: Raw Data Summary
======================================================================
TOP-LINE METRICS (given HTTP 200)
======================================================================
RAW:
Content strict: 64/192 = 33.3% [27.0, 40.3]
Content extractable: 191/192 = 99.5% [97.1, 99.9]
Tool call emitted: 144/144 = 100.0% [97.4, 100.0]
Args strict: 144/144 = 100.0% [97.4, 100.0]
STREAMFIX:
Content strict: 189/192 = 98.4% [95.5, 99.5]
Content extractable: 190/192 = 99.0% [96.3, 99.7]
Tool call emitted: 141/143 = 98.6% [95.0, 99.6]
Args strict: 140/141 = 99.3% [96.1, 99.9]
FAILURE TAXONOMY:
markdown_fence: 128 (95.5%)
no_tool_call: 2 (1.5%)
empty_content: 2 (1.5%)
prose_wrapper: 1 (0.7%)
truncated: 1 (0.7%)
Study conducted February 1, 2026. All API costs borne by the researcher.