The inciting incident was a production incident. We were building a product that extracted structured data from documents using LLMs. The pipeline worked fine in testing. In production, with real models and real prompts, json.loads() raised exceptions on roughly one in three responses.
The errors weren't random. They were always the same few patterns: markdown fences (```json ... ```), trailing commas, unquoted keys. The underlying data was always correct — the model understood exactly what we wanted. It just couldn't stop formatting the output for a human audience.
We wrote a repair function. Then a better one. Then we open-sourced a benchmark on the r/LocalLLaMA subreddit measuring exactly how often each model fails and what the failures look like. That led to StreamFix.
The architecture decision: proxy, not library
The obvious approach is a Python library: import streamfix; data = streamfix.loads(llm_output). We built that first. It works. But it has a fatal problem: it only runs on the final assembled response, which means it can't help with streaming.
Streaming LLM responses are increasingly the default — users expect to see output token-by-token. But the streaming path is where JSON breaks most interestingly. A markdown fence arrives as ``` tokens before the JSON, then the JSON, then a closing ```. A library post-processing the final string works fine. But code trying to render a UI component from partial JSON mid-stream chokes immediately.
A proxy can intercept the stream at the token level, buffer intelligently, strip unwanted prefix tokens, and re-emit only the valid JSON portion — in real-time. That's not possible from a library sitting downstream of the SDK.
The streaming repair FSM
The core of StreamFix's streaming repair is a finite state machine. Each incoming token moves the FSM through states: WAITING → IN_FENCE → IN_JSON → DONE. States for think-tag detection, prose prefix skipping, and partial syntax repair run in parallel.
# Simplified illustration — actual implementation in app/core/streaming_repair.py class State(Enum): WAITING = "waiting" # before JSON starts IN_THINK = "in_think" # inside <think> block — discard IN_FENCE = "in_fence" # inside ``` fence — discard wrapper IN_JSON = "in_json" # emitting repaired JSON tokens AFTER_JSON = "after_json" # JSON closed — discard trailing prose
The tricky parts are transition detection. We can't split on characters when tokens are arbitrary substrings. ```json might arrive as one token or as three (```, json, newline). So every state transition is checked against a rolling buffer, not individual tokens.
The SSE framing bug we shipped to production
Our first production deploy had a subtle bug. The upstream generator was stripping blank lines from SSE chunks to reduce log noise. This broke everything silently.
Server-Sent Events require a blank line (\n\n) between each event. The OpenAI Python SDK uses this to delimit SSE events. When our proxy stripped blank lines, consecutive events merged in the client's buffer. The SDK then tried to json.loads() what looked like two JSON objects on one line — and raised JSONDecodeError: Extra data.
The fix was a one-function helper applied at every yield point in the streaming pipeline:
@staticmethod def _sse_frame(chunk: str) -> str: """Guarantee every yielded SSE chunk ends with the required \\n\\n.""" if chunk.endswith('\n\n'): return chunk return chunk.rstrip('\n') + '\n\n'
The lesson: when building a streaming proxy, every yield is a contract. The downstream client depends on framing that the upstream may or may not provide correctly — your proxy has to enforce it.
The benchmark: 672 calls, 8 models
Before shipping we ran a systematic benchmark: 8 models, 84 calls per model, covering 7 task types (4 content + 3 tool-calling), run across sync and streaming modes with 3 trials each. We intentionally used plain prompts only — no response_format, no structured output, no system prompt engineering. That's how most production code actually works before a developer hits the parsing error.
| Failure type | % of failures | Example |
|---|---|---|
| Markdown fences | 95.5% | ```json\n{...}\n``` |
| Missing tool call | 1.5% | model returned content instead of tool_call |
| Empty content | 1.5% | delta.content = "" with no data |
| Prose wrapper | 0.7% | Here is your JSON: {...} |
| Truncated response | 0.7% | {"name": "Ali... (cut off) |
The 95.5% figure surprised us even though we expected fences to dominate. It means the hard problem is already solved: if you can reliably strip markdown wrappers, you fix almost all failures. Everything else is long-tail syntax cleanup.
After StreamFix: 33.3% → 98.4% strict parse rate. The remaining ~1.6% are cases where the model produced genuinely empty responses, missing tool calls, or truncated output — that no amount of syntax repair can fix.
What we got wrong
We underestimated how much developers care about the pass-through path. Most responses from GPT-4o or Claude are already valid JSON. If a proxy adds 50ms of latency and complexity to 100% of requests to fix 33% of them, many teams prefer to just handle the exceptions in their retry logic. We spent a lot of time making the pass-through path as thin as possible: fast detection of already-valid JSON, zero-copy forwarding when no repair is needed.
The schema validation feature was harder than expected. We added a "Contract Mode" where you pass a JSON Schema and we validate and re-request if the response doesn't conform. The problem is that non-streaming re-requests are easy; streaming re-requests aren't — you've already started emitting tokens. We ended up disabling Contract Mode for streaming, which limits its utility for real-time UIs.
Prompt engineering beats repair, but developers don't do it. Adding response_format: {"type": "json_object"} in OpenAI API calls eliminates most failures. But most code in the wild doesn't use it — either because it's a newer feature, or the developer is using a model/provider that doesn't support it, or they just never hit the issue in testing. StreamFix targets the "I didn't know I needed this until it broke in production" case.
The stack
- FastAPI + asyncio for the proxy — async streaming is crucial, everything is an
AsyncGenerator - SQLite via SQLAlchemy async — simple enough for current scale, API key storage and usage logs
- Railway for deployment — zero-config, auto-deploy on git push, enough for early traction
- OpenRouter as the LLM backend — unified API for 200+ models, no need to manage individual provider keys
- No Redis, no Celery, no message queues — we specifically avoided operational complexity until we need it
What's next
The repair logic is solid. The gap is distribution — getting developers to try it in the first place. We're currently in open beta: free tier, no credit card, 1,000 credits on signup. If you're hitting JSON errors in your LLM pipeline, give it five minutes.
The full benchmark data, methodology, and model breakdown is in the benchmark study. The repair patterns are documented in the engineer's guide. Drop an email to rozetyp@gmail.com if you want to talk through your specific case.