Tool-Call JSON Failures in LLM Agents:
What 288 Real API Calls Revealed
OpenAI's documentation quietly warns that tool-call arguments are "not guaranteed to be valid JSON." We decided to measure exactly how often that happens — and what breaks when it does.
TL;DR
- ✓ Strong models (GPT-4o-mini, DeepSeek-R1) produce clean tool arguments — 0% JSON errors
- ✗ Llama-3.3-70b returned malformed JSON in 71% of complex tool-call arguments
- ✓ StreamFix repaired 100% of the malformed arguments it received
- ⚠ Provider availability failures (503s, 404s) are a separate but related problem for agents
Methodology
We ran 288 real API calls across 6 models, 2 tool schemas, 12 trials each,
comparing direct OpenRouter access vs the same calls routed via StreamFix.
All calls used temperature=0 and
tool_choice="required" to force tool use.
Simple schema
5 parameters, 1 enum, flight search
search_flights(origin, destination, date, passengers, cabin_class)
Complex schema
Nested objects, arrays, 3 levels deep
create_order(customer{}, items[{}], shipping_address{}, priority)
| Model | Trials | Type |
|---|---|---|
| openai/gpt-4o-mini | 12 × 2 | Control (frontier) |
| deepseek/deepseek-r1 | 12 × 2 | Control (reasoning) |
| mistralai/mistral-small-24b | 12 × 2 | Mid-tier |
| meta-llama/llama-3.3-70b | 12 × 2 | Mid-tier open |
| qwen/qwen-2.5-7b-instruct | 12 × 2 | Small open |
| microsoft/phi-4 | 12 × 2 | Small frontier |
Results
Complex Schema Results (create_order)
% of trials where tool-call arguments were valid JSON
Deep Dive: Llama-3.3-70b Complex Schema
When Llama-3.3-70b returned tool-call arguments for a nested 3-level schema (customer, items array, shipping address), the arguments were valid JSON only 29% of the time (2 out of 7 successful tool calls). Typical errors:
Trailing commas and truncated nested objects were the dominant failure patterns — exactly the defects StreamFix's repair engine targets. When StreamFix received malformed arguments, it repaired 100% of them successfully.
What This Means for Agent Pipelines
In a typical agent loop (LangChain, CrewAI, AutoGen), when a tool-call argument fails to parse, the framework raises an exception and the entire agent run fails — not just one step. The cost isn't one bad API call; it's the whole multi-step run.
Wasted compute
A 10-step agent run that fails at step 7 wastes all 7 prior LLM calls
Silent failures
JSON parse errors in tool args often appear as vague framework exceptions, not obvious LLM errors
Compounds at scale
17% success rate × 5 tool-call steps = under 1% chance of completing a full run
AgentExecutionError: Tool 'create_order' failed
Caused by: json.JSONDecodeError: Expecting ',' delimiter: line 4 col 3
Tool call arguments were:
{"customer": {"name": "Jane", "email": "j@co.com",},
"items": [{"product_id": "SKU-789" "quantity": 3}] ← missing comma
With StreamFix, the proxy repairs the argument string before it reaches your framework's JSON parser. The agent never sees the error. The run continues.
from langchain_openai import ChatOpenAI # Route through StreamFix — tool-call arguments repaired automatically llm = ChatOpenAI( model="meta-llama/llama-3.3-70b-instruct", api_key="sk_YOUR_STREAMFIX_KEY", base_url="https://streamfix.up.railway.app/v1" )
Honest Caveats
GPT-4o-mini and DeepSeek-R1 produced 0 JSON errors across all trials. If you're only using frontier models, tool-call argument repair is likely a no-op for you.
Mistral-small's simple schema went from 8% → 100%, but direct calls hit 503 provider errors — StreamFix's improvement here came from handling those errors, not repairing JSON.
Llama-3.3-70b sometimes returned plain text instead of a tool call (4 out of 12 trials). StreamFix can't force a model to use tools — it only repairs the JSON when a tool call IS returned.
Larger N would give tighter confidence intervals. We'll rerun with N=50 per model in a follow-up. Core pattern (Llama struggles with complex schemas) matches what we observed in our earlier 672-call streaming benchmark.
When StreamFix Helps Most for Agents
✅ Use StreamFix if you're:
- •Using open-source models (Llama, Mistral, Qwen) for cost reasons
- •Running multi-step agent loops where one failure breaks everything
- •Passing complex nested schemas as tool parameters
- •Routing across multiple providers and want one reliability layer
⏭ Skip StreamFix if you're:
- •Only using GPT-4o or Claude — both produce clean tool arguments
- •Already using
response_formateverywhere - •Using constrained decoding (llama.cpp GBNF, vLLM guided decoding)
Try it on your agent
One base_url change. 1,000 free requests. Works with LangChain, CrewAI, and any OpenAI-compatible framework.