How We Built StreamFix: A Streaming JSON Repair Proxy for LLMs

The inciting incident was a production incident. We were building a product that extracted structured data from documents using LLMs. The pipeline worked fine in testing. In production, with real models and real prompts, json.loads() raised exceptions on roughly one in three responses.

The errors weren't random. They were always the same few patterns: markdown fences (```json ... ```), trailing commas, unquoted keys. The underlying data was always correct — the model understood exactly what we wanted. It just couldn't stop formatting the output for a human audience.

We wrote a repair function. Then a better one. Then we open-sourced a benchmark on the r/LocalLLaMA subreddit measuring exactly how often each model fails and what the failures look like. That led to StreamFix.

The architecture decision: proxy, not library

The obvious approach is a Python library: import streamfix; data = streamfix.loads(llm_output). We built that first. It works. But it has a fatal problem: it only runs on the final assembled response, which means it can't help with streaming.

Streaming LLM responses are increasingly the default — users expect to see output token-by-token. But the streaming path is where JSON breaks most interestingly. A markdown fence arrives as ``` tokens before the JSON, then the JSON, then a closing ```. A library post-processing the final string works fine. But code trying to render a UI component from partial JSON mid-stream chokes immediately.

A proxy can intercept the stream at the token level, buffer intelligently, strip unwanted prefix tokens, and re-emit only the valid JSON portion — in real-time. That's not possible from a library sitting downstream of the SDK.

The streaming repair FSM

The core of StreamFix's streaming repair is a finite state machine. Each incoming token moves the FSM through states: WAITING → IN_FENCE → IN_JSON → DONE. States for think-tag detection, prose prefix skipping, and partial syntax repair run in parallel.

Simplified FSM states

# Simplified illustration — actual implementation in app/core/streaming_repair.py
class State(Enum):
    WAITING     = "waiting"      # before JSON starts
    IN_THINK    = "in_think"     # inside <think> block — discard
    IN_FENCE    = "in_fence"     # inside ``` fence — discard wrapper
    IN_JSON     = "in_json"      # emitting repaired JSON tokens
    AFTER_JSON  = "after_json"   # JSON closed — discard trailing prose

The tricky parts are transition detection. We can't split on characters when tokens are arbitrary substrings. ```json might arrive as one token or as three (```, json, newline). So every state transition is checked against a rolling buffer, not individual tokens.

The SSE framing bug we shipped to production

Our first production deploy had a subtle bug. The upstream generator was stripping blank lines from SSE chunks to reduce log noise. This broke everything silently.

Server-Sent Events require a blank line (\n\n) between each event. The OpenAI Python SDK uses this to delimit SSE events. When our proxy stripped blank lines, consecutive events merged in the client's buffer. The SDK then tried to json.loads() what looked like two JSON objects on one line — and raised JSONDecodeError: Extra data.

The fix was a one-function helper applied at every yield point in the streaming pipeline:

@staticmethod
def _sse_frame(chunk: str) -> str:
    """Guarantee every yielded SSE chunk ends with the required \\n\\n."""
    if chunk.endswith('\n\n'):
        return chunk
    return chunk.rstrip('\n') + '\n\n'

The lesson: when building a streaming proxy, every yield is a contract. The downstream client depends on framing that the upstream may or may not provide correctly — your proxy has to enforce it.

The benchmark: 672 calls, 8 models

Before shipping we ran a systematic benchmark: 8 models, 84 calls per model, covering 7 task types (4 content + 3 tool-calling), run across sync and streaming modes with 3 trials each. We intentionally used plain prompts only — no response_format, no structured output, no system prompt engineering. That's how most production code actually works before a developer hits the parsing error.

Failure type	% of failures	Example
Markdown fences	95.5%	```json\n{...}\n```
Missing tool call	1.5%	model returned content instead of tool_call
Empty content	1.5%	delta.content = "" with no data
Prose wrapper	0.7%	Here is your JSON: {...}
Truncated response	0.7%	{"name": "Ali... (cut off)

The 95.5% figure surprised us even though we expected fences to dominate. It means the hard problem is already solved: if you can reliably strip markdown wrappers, you fix almost all failures. Everything else is long-tail syntax cleanup.

After StreamFix: 33.3% → 98.4% strict parse rate. The remaining ~1.6% are cases where the model produced genuinely empty responses, missing tool calls, or truncated output — that no amount of syntax repair can fix.

What we got wrong

We underestimated how much developers care about the pass-through path. Most responses from GPT-4o or Claude are already valid JSON. If a proxy adds 50ms of latency and complexity to 100% of requests to fix 33% of them, many teams prefer to just handle the exceptions in their retry logic. We spent a lot of time making the pass-through path as thin as possible: fast detection of already-valid JSON, zero-copy forwarding when no repair is needed.

The schema validation feature was harder than expected. We added a "Contract Mode" where you pass a JSON Schema and we validate and re-request if the response doesn't conform. The problem is that non-streaming re-requests are easy; streaming re-requests aren't — you've already started emitting tokens. We ended up disabling Contract Mode for streaming, which limits its utility for real-time UIs.

Prompt engineering beats repair, but developers don't do it. Adding response_format: {"type": "json_object"} in OpenAI API calls eliminates most failures. But most code in the wild doesn't use it — either because it's a newer feature, or the developer is using a model/provider that doesn't support it, or they just never hit the issue in testing. StreamFix targets the "I didn't know I needed this until it broke in production" case.

The stack

FastAPI + asyncio for the proxy — async streaming is crucial, everything is an AsyncGenerator
SQLite via SQLAlchemy async — simple enough for current scale, API key storage and usage logs
Railway for deployment — zero-config, auto-deploy on git push, enough for early traction
OpenRouter as the LLM backend — unified API for 200+ models, no need to manage individual provider keys
No Redis, no Celery, no message queues — we specifically avoided operational complexity until we need it

What's next

The repair logic is solid. The gap is distribution — getting developers to try it in the first place. We're currently in open beta: free tier, no credit card, 1,000 credits on signup. If you're hitting JSON errors in your LLM pipeline, give it five minutes.

The full benchmark data, methodology, and model breakdown is in the benchmark study. The repair patterns are documented in the engineer's guide. Drop an email to rozetyp@gmail.com if you want to talk through your specific case.