Tool Use & Function Calling

The mechanism that turns a chatbot into something that can actually do work. Without it, you have a conversation. With it, you have an agent.

Agents & Automation

The Technical Definition

Tool use (also called function calling) is the mechanism that lets a language model invoke external code. The model doesn’t run the function itself — it produces a structured request (“call this function with these arguments”) that your application executes, then feeds the result back into the model’s context. The model can then read the result and decide what to do next: call another function, ask the user a question, or produce a final answer.

This is the foundation of every agent. An LLM by itself is a text predictor; it can’t query your database, send an email, or read a file. Function calling is the wire that connects the model’s language ability to the systems where work actually happens.

What This Actually Means for Your Business

Every “AI agent” pitch you’ve heard in the last eighteen months is built on function calling. The vendor describes a system that books meetings, processes invoices, runs analyses, sends reports — but the only thing the model itself does is decide which function to call, with which arguments, in what order. Your engineers (or the vendor’s) wrote the actual functions. The model is the conductor; the functions are the orchestra.

This matters because the quality of your agent is mostly the quality of your tool definitions, not the quality of your model. A great LLM with badly-described functions will fail. A merely good LLM with crisp, well-documented functions and clear input schemas will work. The work of building a useful agent is mostly the work of designing the toolset — what functions to expose, what they accept, what they return, what error states they signal back. That’s traditional software engineering, not AI work.

Two operational realities sneak up on teams. First: every function call is an attack surface. A function that can update a record can update the wrong record. A function that can send an email can send it to the wrong person. The model doesn’t validate intent — it executes what its reasoning landed on. Permissioning, logging, and rollback are not optional. Second: function calling is non-deterministic in a way that traditional integrations are not. The same prompt can produce different function calls on different runs. Testing has to account for this. A single demo working doesn’t mean the system works.

The cost dimension also gets ignored. Each function call adds a round trip — model call, function execution, model call again — and each round trip is tokens billed and latency added. A workflow with five function calls is five times the cost and latency of a single response. At scale, that math compounds.

Reality Check

What the vendor says: “Our AI agent has access to over 200 tools and can take action across your entire tech stack.”

What that means in practice: They’ve defined 200 function schemas. Whether the model picks the right one, calls it with the right arguments, and handles errors correctly varies by task. More tools is not better — past about 30 to 50, the model’s accuracy at choosing the right one drops measurably. Ask which tools are actually used in production and how often the agent picks the wrong one.

What Operators Actually Do

The pattern that works: start with a small, sharp toolset. Five well-designed functions covering the critical workflow beat fifty mediocre ones. Add tools only when the missing capability is causing real friction. Every tool you add increases the attack surface, the test matrix, and the chance the model picks something wrong.

Smart teams also separate read tools from write tools and gate them differently. The agent can freely query — read records, search documents, pull data. Anything that writes (creates a record, sends a message, charges a card) goes through a confirmation step, an approval queue, or a strict permissions check. The cost of a wrong read is wasted compute. The cost of a wrong write is a customer incident.

The other pattern: structured logging of every function call, with arguments and results, queryable by your compliance and ops teams. When something goes wrong at 2 AM, you need to be able to see what the agent decided to do, what it actually did, and why. Without that, you’re debugging a black box.

The Questions to Ask

How many tools does the agent actually use, and how often does it pick the wrong one? Vendors brag about tool count. Operators care about tool selection accuracy. Ask for the number, not the marketing.
Which tools can write versus read, and what gates the writes? If the answer is “the model decides,” that’s the answer. Plan accordingly.
What’s the audit trail when the agent calls a function with the wrong arguments? Can your team reconstruct what happened, why, and what the impact was? If not, you’re operating without a flight recorder.

The Technical Definition

What This Actually Means for Your Business

Reality Check

What Operators Actually Do

The Questions to Ask

One operator. Every other Wednesday.