Building Reliable Tool-Calling Agents: Avoiding the Pitfalls of LLM Function Calling in Production

Function calling (or tool use) is one of the most powerful capabilities in modern LLMs — and one of the most fragile in production. An LLM that calls tools can perform actions: query a database, send an email, call an API, execute code. This power means that when it goes wrong, it goes wrong consequentially. This post is about making it go right.

How Tool Calling Works and Where It Breaks

When you define tools for an LLM, you provide a schema describing each tool's name, purpose, and parameter types. The model decides when to call a tool and generates the arguments. Your code executes the tool and returns the result, which the model uses to continue its response.

The failure points are at every junction in this chain:

1The model calls the wrong tool for the task — because tool descriptions are ambiguous or overlapping.
2The model generates invalid arguments — a string where an integer is required, an out-of-range value, a hallucinated ID that does not exist.
3The tool execution fails — network timeout, database error, permission denied — and the model does not handle the error gracefully.
4The model misinterprets the tool result — acts on a success response as if it were a failure, or vice versa.
5The model calls tools in a loop — each call produces a result that triggers another call indefinitely.

Tool Design: The First Line of Defense

Most tool-calling failures originate in poor tool design, not model limitations. A well-designed tool is hard to misuse. An ambiguous one invites errors.

One tool, one purpose: A tool that does three things will be called for one of them when another was intended. Split multi-purpose tools.
Explicit, unambiguous descriptions: The description is what the model uses to decide whether to call the tool. 'Search the product catalog by exact SKU — use this only when you have a specific SKU' is better than 'Search products'.
Constrained parameter types: Use enums for parameters with a known valid set. Use specific types (integer, not string) for numeric IDs. The model cannot generate an invalid enum value if you constrain it.
Include examples in tool descriptions: A few example calls in the tool's description dramatically improve call accuracy, especially for tools with non-obvious argument formats.

python

# Well-designed tool definition
tools = [
    {
        "name": "lookup_customer",
        "description": (
            "Retrieve a customer record by their unique customer ID. "
            "Use this when you have a specific customer ID and need their details. "
            "Do NOT use this for searching by name — use search_customers instead. "
            "Example: lookup_customer(customer_id='CUST-12345')"
        ),
        "parameters": {
            "type": "object",
            "properties": {
                "customer_id": {
                    "type": "string",
                    "pattern": "^CUST-[0-9]{5}$",
                    "description": "Customer ID in format CUST-XXXXX"
                }
            },
            "required": ["customer_id"]
        }
    }
]

Input Validation Before Execution

Never execute a tool call with LLM-generated arguments without validating them first. The model can and does generate arguments that look plausible but violate your system's constraints. Validate before execution; return a structured error to the model if validation fails.

python

from pydantic import BaseModel, ValidationError, field_validator
import re

class LookupCustomerArgs(BaseModel):
    customer_id: str

    @field_validator("customer_id")
    @classmethod
    def validate_format(cls, v: str) -> str:
        if not re.match(r"^CUST-[0-9]{5}$", v):
            raise ValueError(f"Invalid customer ID format: {v}")
        return v

def execute_tool_call(tool_name: str, raw_args: dict) -> dict:
    validators = {"lookup_customer": LookupCustomerArgs}

    if tool_name not in validators:
        return {"error": f"Unknown tool: {tool_name}"}

    try:
        args = validators[tool_name](**raw_args)
    except ValidationError as e:
        # Return structured error — the model can correct its call
        return {"error": "Invalid arguments", "details": str(e)}

    return run_tool(tool_name, args)

Retry Logic and Error Handling

When a tool call fails — validation error, network timeout, permission denied — the model should receive a structured error response and have the opportunity to retry or take a different path. The common mistake is catching the error silently and passing an empty or misleading result back to the model.

Return errors as structured tool results, not exceptions: the model can read and reason about a JSON error response; it cannot handle a Python exception.
Include actionable information in error messages: 'Customer ID CUST-99999 does not exist' is more useful to the model than 'Not found'.
Set a maximum retry count per tool call: allow the model to retry a failed call once after receiving the error. Beyond one retry, return a final error to the user rather than looping.
Distinguish transient from permanent errors: a database timeout is retryable; a validation error is not. Return different error types so the model can choose the appropriate response.

Preventing Runaway Tool Loops

A model that calls tools in a loop will exhaust your API budget and return nothing useful. The loop happens when the model's tool results do not resolve its uncertainty — it keeps calling tools searching for information it will never find.

Hard step limit: enforce a maximum number of tool calls per agent invocation (typically 10–20 for most use cases). When the limit is hit, return whatever has been gathered so far.
Loop detection: track the sequence of tool calls. If the same tool is called with the same arguments twice, interrupt and return an error.
Progress requirement: require that each tool call result in new information being added to the agent's context. If a tool result is semantically identical to a previous result, treat it as a loop.

Warning:For tools that have side effects — sending emails, writing to databases, calling external APIs — add an explicit confirmation step before execution, or restrict these tools to agents operating in a human-in-the-loop mode. An agent that loops on a side-effectful tool can cause irreversible damage.

Testing Tool-Calling Reliability

Tool-calling agents require a different testing approach from standard LLM outputs. For each tool, write unit tests that verify the model selects the correct tool and generates valid arguments for representative inputs:

Correct tool selection: given a query that should trigger tool X, verify the model calls X and not Y.
Valid argument generation: run 50 representative queries through the agent and validate that generated arguments pass your schema validators. Track the validation failure rate.
Error recovery: deliberately return error responses from tools and verify the model handles them gracefully rather than hallucinating a success.
Loop resistance: craft inputs designed to produce ambiguous tool results and verify the agent terminates within its step limit.