Agent Reliability Benchmark · March 2026

Tool-Call JSON Failures in LLM Agents:
What 288 Real API Calls Revealed

OpenAI's documentation quietly warns that tool-call arguments are "not guaranteed to be valid JSON." We decided to measure exactly how often that happens — and what breaks when it does.

TL;DR

Methodology

We ran 288 real API calls across 6 models, 2 tool schemas, 12 trials each, comparing direct OpenRouter access vs the same calls routed via StreamFix. All calls used temperature=0 and tool_choice="required" to force tool use.

Simple schema

5 parameters, 1 enum, flight search

search_flights(origin, destination, date, passengers, cabin_class)

Complex schema

Nested objects, arrays, 3 levels deep

create_order(customer{}, items[{}], shipping_address{}, priority)
Model Trials Type
openai/gpt-4o-mini12 × 2Control (frontier)
deepseek/deepseek-r112 × 2Control (reasoning)
mistralai/mistral-small-24b12 × 2Mid-tier
meta-llama/llama-3.3-70b12 × 2Mid-tier open
qwen/qwen-2.5-7b-instruct12 × 2Small open
microsoft/phi-412 × 2Small frontier

Results

71%
of Llama-3.3-70b tool-call arguments contained JSON errors
on the complex nested schema (create_order), direct OpenRouter, 12 trials

Complex Schema Results (create_order)

% of trials where tool-call arguments were valid JSON

gpt-4o-mini
Direct
100%
StreamFix
100%
No change
deepseek-r1
Direct
100%
StreamFix
92%
−8% noise
mistral-small-24b
Direct
92%
StreamFix
100%
+8%
llama-3.3-70b ⚠️
Direct
17%
StreamFix
42%
+25%

Deep Dive: Llama-3.3-70b Complex Schema

When Llama-3.3-70b returned tool-call arguments for a nested 3-level schema (customer, items array, shipping address), the arguments were valid JSON only 29% of the time (2 out of 7 successful tool calls). Typical errors:

// Actual malformed output from Llama-3.3-70b:
{"customer": {"name": "Jane Smith", "email": "jane@example.com", "phone": "+1-555-0123",},
  "items": [{"product_id": "SKU-789", "quantity": 3, "unit_price": 29.99},
           {"product_id": "SKU-456", "quantity": 1, "unit_price": 149.99,}],
  "shipping_address": {/* truncated

Trailing commas and truncated nested objects were the dominant failure patterns — exactly the defects StreamFix's repair engine targets. When StreamFix received malformed arguments, it repaired 100% of them successfully.

What This Means for Agent Pipelines

In a typical agent loop (LangChain, CrewAI, AutoGen), when a tool-call argument fails to parse, the framework raises an exception and the entire agent run fails — not just one step. The cost isn't one bad API call; it's the whole multi-step run.

💸

Wasted compute

A 10-step agent run that fails at step 7 wastes all 7 prior LLM calls

⏱️

Silent failures

JSON parse errors in tool args often appear as vague framework exceptions, not obvious LLM errors

📈

Compounds at scale

17% success rate × 5 tool-call steps = under 1% chance of completing a full run

Example: LangChain agent failure
AgentExecutionError: Tool 'create_order' failed
  Caused by: json.JSONDecodeError: Expecting ',' delimiter: line 4 col 3

  Tool call arguments were:
  {"customer": {"name": "Jane", "email": "j@co.com",},
   "items": [{"product_id": "SKU-789" "quantity": 3}]  ← missing comma

With StreamFix, the proxy repairs the argument string before it reaches your framework's JSON parser. The agent never sees the error. The run continues.

from langchain_openai import ChatOpenAI

# Route through StreamFix — tool-call arguments repaired automatically
llm = ChatOpenAI(
    model="meta-llama/llama-3.3-70b-instruct",
    api_key="sk_YOUR_STREAMFIX_KEY",
    base_url="https://streamfix.up.railway.app/v1"
)

Honest Caveats

Strong models don't need this

GPT-4o-mini and DeepSeek-R1 produced 0 JSON errors across all trials. If you're only using frontier models, tool-call argument repair is likely a no-op for you.

Some results are availability, not JSON repair

Mistral-small's simple schema went from 8% → 100%, but direct calls hit 503 provider errors — StreamFix's improvement here came from handling those errors, not repairing JSON.

StreamFix doesn't fix "no tool call returned"

Llama-3.3-70b sometimes returned plain text instead of a tool call (4 out of 12 trials). StreamFix can't force a model to use tools — it only repairs the JSON when a tool call IS returned.

12 trials per model is a small sample

Larger N would give tighter confidence intervals. We'll rerun with N=50 per model in a follow-up. Core pattern (Llama struggles with complex schemas) matches what we observed in our earlier 672-call streaming benchmark.

When StreamFix Helps Most for Agents

✅ Use StreamFix if you're:

  • Using open-source models (Llama, Mistral, Qwen) for cost reasons
  • Running multi-step agent loops where one failure breaks everything
  • Passing complex nested schemas as tool parameters
  • Routing across multiple providers and want one reliability layer

⏭ Skip StreamFix if you're:

  • Only using GPT-4o or Claude — both produce clean tool arguments
  • Already using response_format everywhere
  • Using constrained decoding (llama.cpp GBNF, vLLM guided decoding)

Try it on your agent

One base_url change. 1,000 free requests. Works with LangChain, CrewAI, and any OpenAI-compatible framework.

base_url="https://streamfix.up.railway.app/v1"
Get Free Key →
Benchmark date: March 2, 2026 288 total API calls 6 models tested Raw data: available on request