Agent Reliability Benchmark · March 2026

Tool-Call JSON Failures in LLM Agents:
What 288 Real API Calls Revealed

OpenAI's documentation quietly warns that tool-call arguments are "not guaranteed to be valid JSON." We decided to measure exactly how often that happens — and what breaks when it does.

TL;DR

✓ Strong models (GPT-4o-mini, DeepSeek-R1) produce clean tool arguments — 0% JSON errors
✗ Llama-3.3-70b returned malformed JSON in 71% of complex tool-call arguments
✓ StreamFix repaired 100% of the malformed arguments it received
⚠ Provider availability failures (503s, 404s) are a separate but related problem for agents

Methodology

We ran 288 real API calls across 6 models, 2 tool schemas, 12 trials each, comparing direct OpenRouter access vs the same calls routed via StreamFix. All calls used temperature=0 and tool_choice="required" to force tool use.

Simple schema

5 parameters, 1 enum, flight search

search_flights(origin, destination, date, passengers, cabin_class)

Complex schema

Nested objects, arrays, 3 levels deep

create_order(customer{}, items[{}], shipping_address{}, priority)

Model	Trials	Type
openai/gpt-4o-mini	12 × 2	Control (frontier)
deepseek/deepseek-r1	12 × 2	Control (reasoning)
mistralai/mistral-small-24b	12 × 2	Mid-tier
meta-llama/llama-3.3-70b	12 × 2	Mid-tier open
qwen/qwen-2.5-7b-instruct	12 × 2	Small open
microsoft/phi-4	12 × 2	Small frontier

Results

71%

of Llama-3.3-70b tool-call arguments contained JSON errors

on the complex nested schema (create_order), direct OpenRouter, 12 trials

Complex Schema Results (create_order)

% of trials where tool-call arguments were valid JSON

gpt-4o-mini

Direct

100%

StreamFix

100%

No change

deepseek-r1

Direct

100%

StreamFix

92%

−8% noise

mistral-small-24b

Direct

92%

StreamFix

100%

+8%

llama-3.3-70b ⚠️

Direct

17%

StreamFix

42%

+25%

Deep Dive: Llama-3.3-70b Complex Schema

When Llama-3.3-70b returned tool-call arguments for a nested 3-level schema (customer, items array, shipping address), the arguments were valid JSON only 29% of the time (2 out of 7 successful tool calls). Typical errors:

// Actual malformed output from Llama-3.3-70b:

{"customer": {"name": "Jane Smith", "email": "jane@example.com", "phone": "+1-555-0123",},

"items": [{"product_id": "SKU-789", "quantity": 3, "unit_price": 29.99},

{"product_id": "SKU-456", "quantity": 1, "unit_price": 149.99,}],

"shipping_address": {/* truncated

Trailing commas and truncated nested objects were the dominant failure patterns — exactly the defects StreamFix's repair engine targets. When StreamFix received malformed arguments, it repaired 100% of them successfully.

What This Means for Agent Pipelines

In a typical agent loop (LangChain, CrewAI, AutoGen), when a tool-call argument fails to parse, the framework raises an exception and the entire agent run fails — not just one step. The cost isn't one bad API call; it's the whole multi-step run.

💸

Wasted compute

A 10-step agent run that fails at step 7 wastes all 7 prior LLM calls

⏱️

Silent failures

JSON parse errors in tool args often appear as vague framework exceptions, not obvious LLM errors

📈

Compounds at scale

17% success rate × 5 tool-call steps = under 1% chance of completing a full run

Example: LangChain agent failure

AgentExecutionError: Tool 'create_order' failed
  Caused by: json.JSONDecodeError: Expecting ',' delimiter: line 4 col 3

  Tool call arguments were:
  {"customer": {"name": "Jane", "email": "j@co.com",},
   "items": [{"product_id": "SKU-789" "quantity": 3}]  ← missing comma

With StreamFix, the proxy repairs the argument string before it reaches your framework's JSON parser. The agent never sees the error. The run continues.

from langchain_openai import ChatOpenAI

# Route through StreamFix — tool-call arguments repaired automatically
llm = ChatOpenAI(
    model="meta-llama/llama-3.3-70b-instruct",
    api_key="sk_YOUR_STREAMFIX_KEY",
    base_url="https://streamfix.up.railway.app/v1"
)

Honest Caveats

⚠

Strong models don't need this

GPT-4o-mini and DeepSeek-R1 produced 0 JSON errors across all trials. If you're only using frontier models, tool-call argument repair is likely a no-op for you.

⚠

Some results are availability, not JSON repair

Mistral-small's simple schema went from 8% → 100%, but direct calls hit 503 provider errors — StreamFix's improvement here came from handling those errors, not repairing JSON.

⚠

StreamFix doesn't fix "no tool call returned"

Llama-3.3-70b sometimes returned plain text instead of a tool call (4 out of 12 trials). StreamFix can't force a model to use tools — it only repairs the JSON when a tool call IS returned.

⚠

12 trials per model is a small sample

Larger N would give tighter confidence intervals. We'll rerun with N=50 per model in a follow-up. Core pattern (Llama struggles with complex schemas) matches what we observed in our earlier 672-call streaming benchmark.

When StreamFix Helps Most for Agents

✅ Use StreamFix if you're:

•Using open-source models (Llama, Mistral, Qwen) for cost reasons
•Running multi-step agent loops where one failure breaks everything
•Passing complex nested schemas as tool parameters
•Routing across multiple providers and want one reliability layer

⏭ Skip StreamFix if you're:

•Only using GPT-4o or Claude — both produce clean tool arguments
•Already using response_format everywhere
•Using constrained decoding (llama.cpp GBNF, vLLM guided decoding)

Try it on your agent

One base_url change. 1,000 free requests. Works with LangChain, CrewAI, and any OpenAI-compatible framework.

base_url="https://streamfix.up.railway.app/v1"

Get Free Key →

Benchmark date: March 2, 2026 288 total API calls 6 models tested Raw data: available on request

Tool-Call JSON Failures in LLM Agents:What 288 Real API Calls Revealed

TL;DR

Methodology

Simple schema

Complex schema

Results

Complex Schema Results (create_order)

Deep Dive: Llama-3.3-70b Complex Schema

What This Means for Agent Pipelines

Wasted compute

Silent failures

Compounds at scale

Honest Caveats

When StreamFix Helps Most for Agents

✅ Use StreamFix if you're:

⏭ Skip StreamFix if you're:

Try it on your agent

Tool-Call JSON Failures in LLM Agents:
What 288 Real API Calls Revealed