Home / Benchmark Study

JSON Output Reliability in Plain-Prompt Mode

8 Models via OpenRouter — StreamFix Benchmark v2.2 — February 2026

Disclosure: I'm the author of StreamFix. The benchmark code and raw JSONL are included for reproduction.

Executive Summary

We tested 8 LLMs across 672 API calls to measure JSON output reliability using plain prompts only (no response_format or structured output features). The results reveal a significant gap between what models can produce and what developers actually receive:

Metric Raw Output StreamFix
Content JSON (strict parse) 33.3% 98.4%
Tool call arguments 100% 99.3%
Streaming tool calls 100% 98.6%

Key finding: 95% of JSON failures are trivial wrapper issues (markdown fences), not semantic errors. The underlying JSON is correct — it's just not machine-parseable as-is.


The Problem

When you ask an LLM to "return JSON only," you often get:

Here's the JSON you requested:

```json
{"name": "Alice", "age": 30}

Hope this helps!


This is valid JSON wrapped in markdown — but `json.loads()` fails. Your application crashes. Users see errors. You add retry logic, regex hacks, or give up on smaller models.

**In Protocol A plain-prompt mode, JSON failures are overwhelmingly formatting wrappers (95% markdown fences), not invalid JSON structure.**

---

## Methodology

### Test Design

- **Protocol A only**: Plain prompts asking for JSON output (no `response_format`, no structured output features, no function calling tricks)
- **Forced tool calls**: For tool tasks, `tool_choice` was set to force the specific tool (measuring argument reliability, not tool selection)
- **8 production models**: Mix of 2026 releases, established players, and open-source
- **7 task types**: 4 content tasks + 3 tool-calling tasks
- **Full matrix**: sync + streaming × raw + repaired × 3 trials = 672 total tests
- **Temperature 0**: Maximum reproducibility

### Models Tested

| Model | Provider | Release |
|:------|:---------|:--------|
| kimi-k2.5 | Moonshot AI | Jan 2026 |
| glm-4.7-flash | Z-AI | Jan 2026 |
| seed-1.6-flash | ByteDance | Dec 2025 |
| mistral-small-creative | Mistral | Dec 2025 |
| devstral-2512 | Mistral | Dec 2025 |
| ministral-8b-2512 | Mistral | Dec 2025 |
| gpt-4o-mini | OpenAI | Baseline |
| llama-3.3-70b-instruct | Meta | Open source |

### Task Categories

**Content Tasks** (JSON in `message.content`):
1. Simple object: `{name, age, active}`
2. Nested object: `{id, name, address: {street, city, zip}}`
3. Array of objects: `[{id, name, price}, ...]`
4. Special characters: quotes, newlines, unicode

**Tool Tasks** (JSON in `tool_calls[0].function.arguments`):
1. `search_flights(from, to, date)`
2. `create_invoice(amount, currency, items)`
3. `schedule_meeting(title, datetime, attendees)`

### Metrics

| Metric | Definition |
|:-------|:-----------|
| **Strict parse** | `json.loads(content)` succeeds |
| **Extractable** | A valid JSON substring exists that can be extracted without modifying any JSON characters, and `json.loads()` on that substring succeeds |
| **Schema valid** | Required keys present, correct types |
| **Tool emitted** | Model returned a tool_call (not just content) |

All percentages include Wilson 95% confidence intervals.

---

## Results

### HTTP Success Rate

| Lane | Success Rate |
|:-----|:-------------|
| Raw | 336/336 = 100.0% [98.9, 100.0] |
| StreamFix | 335/336 = 99.7% [98.3, 99.9] |

Both lanes show near-perfect HTTP reliability. The single StreamFix timeout was a transient network issue.

### Content JSON Parse Rate

| Metric | Raw | StreamFix |
|:-------|:----|:----------|
| **Strict parse** | 33.3% [27.0, 40.3] | **98.4%** [95.5, 99.5] |
| **Extractable** | 99.5% [97.1, 99.9] | 99.0% [96.3, 99.7] |
| **Schema valid (strict)** | 33.3% | 98.4% |
| **Schema valid (any)** | 99.5% | 99.0% |

**The 99.5% extractable rate means valid JSON exists in the response** — it's just wrapped in markdown fences, prose, or think tags. No JSON repair needed, only wrapper removal.

### Tool Call Reliability

| Metric | Raw | StreamFix |
|:-------|:----|:----------|
| **Tool emitted** | 100.0% [97.4, 100.0] | 98.6% [95.0, 99.6] |
| **Args strict (given emitted)** | 100.0% [97.4, 100.0] | 99.3% [96.1, 99.9] |
| **Args schema valid** | 95.8% [91.2, 98.1] | 95.0% [90.1, 97.6] |

Tool calls show excellent reliability in both lanes. The ~5% schema failures are models returning slightly different field names or structures.

### Streaming Tool Calls

| Lane | Sync Mode | Streaming Mode |
|:-----|:----------|:---------------|
| Raw | 72/72 emitted | 72/72 emitted |
| StreamFix | 71/72 emitted | 70/71 emitted |

**Streaming tool calls work correctly in both lanes.** This was a critical validation point — earlier versions had a bug where StreamFix dropped streaming tool calls entirely.

---

## Failure Analysis

### Failure Taxonomy (n=134 strict failures)

| Failure Type | Count | Percentage |
|:-------------|:------|:-----------|
| `markdown_fence` | 128 | 95.5% |
| `no_tool_call` | 2 | 1.5% |
| `empty_content` | 2 | 1.5% |
| `prose_wrapper` | 1 | 0.7% |
| `truncated` | 1 | 0.7% |

**95.5% of failures are markdown fences** — the model wraps correct JSON in ` ```json ... ``` `. This is the dominant failure mode and is trivially repairable.

### Per-Model Breakdown (Content Tasks)

| Model | Raw Strict | StreamFix Strict |
|:------|:-----------|:-----------------|
| kimi-k2.5 | 75.0% | 95.8% |
| glm-4.7-flash | 41.7% | 91.7% |
| seed-1.6-flash | **100%** | **100%** |
| mistral-small-creative | 0% | **100%** |
| devstral-2512 | 0% | **100%** |
| ministral-8b-2512 | 0% | **100%** |
| gpt-4o-mini | 4.2% | **100%** |
| llama-3.3-70b-instruct | 29.2% | **100%** |

**Notable patterns:**
- **ByteDance seed-1.6-flash** is the only model that produces clean JSON natively (100% raw)
- **Mistral models** consistently wrap in markdown (0% raw → 100% repaired)
- **GPT-4o-mini** also wraps heavily (4.2% raw → 100% repaired)

---

## Implications

### For Developers

1. **Don't blame the model** — the JSON is usually correct, just wrapped
2. **Don't use regex hacks** — they break on edge cases
3. **Use a repair layer** — it's a solved problem at the protocol level

### For Model Providers

1. **The training signal is strong** — models produce valid JSON 99.5% of the time
2. **The formatting signal is wrong** — models add markdown because training data had it
3. **Structured output modes help** — but require provider-specific code paths

### For the Industry

The 33% → 98% improvement from a simple repair layer suggests:

- **Reliability is cheap** — no model retraining required
- **The fix is universal** — works across providers and models
- **The problem is solvable today** — not waiting for GPT-5

---

## Limitations
 to providers
2. **Temperature 0** — higher temperatures may increase variability
3. **English prompts only** — non-English may behave differently
4. **Limited task complexity** — real-world JSON can be more complex
5. **Forced tool calls** — tool emission rate measures argument reliability, not tool selection accuracy
6. **Plain prompts only** — providers' native structured output features (when available) may show different results
4. **Limited task complexity** — real-world JSON can be more complex

---

## Reproducibility

Full benchmark code and raw results available at:
- Code: `benchmark/benchmark_v3.1.py`
- Results: `benchmark/results/benchmark_20260201_172225.jsonl`

To reproduce:
```bash
export OPENROUTER_API_KEY="your-key"
export STREAMFIX_API_KEY="your-key"
python benchmark/benchmark_v3.1.py

Conclusion

LLMs are remarkably good at producing valid JSON — they just can't stop themselves from wrapping it in helpful formatting. This is a protocol-level problem with a protocol-level solution.

The path to reliable JSON isn't better models. It's better middleware.


Appendix: Raw Data Summary

======================================================================
TOP-LINE METRICS (given HTTP 200)
======================================================================

RAW:
  Content strict:        64/192 = 33.3% [27.0, 40.3]
  Content extractable:   191/192 = 99.5% [97.1, 99.9]
  Tool call emitted:     144/144 = 100.0% [97.4, 100.0]
  Args strict:           144/144 = 100.0% [97.4, 100.0]

STREAMFIX:
  Content strict:        189/192 = 98.4% [95.5, 99.5]
  Content extractable:   190/192 = 99.0% [96.3, 99.7]
  Tool call emitted:     141/143 = 98.6% [95.0, 99.6]
  Args strict:           140/141 = 99.3% [96.1, 99.9]

FAILURE TAXONOMY:
  markdown_fence: 128 (95.5%)
  no_tool_call: 2 (1.5%)
  empty_content: 2 (1.5%)
  prose_wrapper: 1 (0.7%)
  truncated: 1 (0.7%)

Study conducted February 1, 2026. All API costs borne by the researcher.