What is LLM Function Calling for Integrations? (2026 Architecture Guide)
LLM function calling lets AI agents trigger external APIs via structured JSON. Learn how to architect it for production B2B SaaS with multi-tenant OAuth, rate limits, and pagination.
You have built an AI agent that correctly identifies user intent, formats the required JSON arguments, and triggers a function call. In your local development environment, it works flawlessly. Then you deploy it to production. Within hours, your agent is trapped in an infinite pagination loop, blocked by aggressive rate limits, and failing because an OAuth token expired mid-reasoning.
The model is not the problem. The integration infrastructure is.
This guide explains exactly what LLM function calling is in the context of B2B SaaS integrations, why local prototypes fail in production, and how to architect a system that reliably connects AI agents to third-party APIs at scale.
What is LLM Function Calling? (The Basics)
LLM function calling (also called tool calling) is the mechanism by which a large language model outputs a structured JSON object describing which external API to call and with what arguments — instead of generating free-form text. The LLM never executes the call itself. Your application receives the JSON, executes the API request, returns the result, and the LLM uses that result to formulate its response.
Function calling is the bridge between an LLM's reasoning and real-world action. It lets an AI agent do things — create a Jira ticket, fetch CRM contacts, update an HRIS record — rather than just talk about doing them.
Here is the mechanical flow:
- Context Provision: You provide the LLM with a system prompt and a list of available "tools." Each tool includes a name, a description, and a JSON Schema defining its required inputs.
- Intent Parsing: The user asks a question (e.g., "List my open deals in Salesforce").
- Function Generation: The LLM recognizes it needs external data. It halts text generation and outputs a structured JSON object targeting your
list_dealstool, populating the required fields. - Execution: Your application intercepts this JSON, executes the actual HTTP request to the third-party API (like HubSpot or Salesforce), and captures the response.
- Resolution: Your application feeds the API response back to the LLM, which then summarizes the result for the user.
sequenceDiagram
participant User
participant App as Your Application
participant LLM
participant API as External API<br>(Salesforce, Jira, etc.)
User->>App: "List my open deals in Salesforce"
App->>LLM: User message + tool definitions
LLM->>App: {"function": "list_deals",<br>"args": {"status": "open"}}
App->>API: GET /deals?status=open
API->>App: [deal_1, deal_2, deal_3]
App->>LLM: Tool result: [deal_1, deal_2, deal_3]
LLM->>App: "You have 3 open deals..."
App->>User: Formatted responseEvery major model provider — OpenAI, Anthropic, Google — now supports function calling natively. Frameworks like LangChain and LlamaIndex provide abstractions on top. The concept is straightforward. The engineering challenge lives entirely in what happens at step 4: executing the actual API call against a real third-party system in a production, multi-tenant environment.
The SaaS Integration Bottleneck for AI Agents
The industry is rapidly shifting from generative AI to agentic AI. Gartner predicts that 40% of enterprise applications will be integrated with task-specific AI agents by the end of 2026, up from less than 5% today — marking one of the fastest transformations in enterprise technology since the adoption of the public cloud. Enterprise AI adoption has already crossed the mainstream threshold, with 78% of organizations using AI in at least one business function as of early 2025.
But there is a massive roadblock. According to the MuleSoft Connectivity Benchmark Report (produced in collaboration with Vanson Bourne and Deloitte Digital, surveying over 1,000 IT leaders), 95% of IT leaders say integration issues block AI adoption. The average enterprise manages 897 applications, yet only 29% are integrated. That means 71% of enterprise data is siloed in applications your AI agent cannot access. When an agent tries to answer a question that spans data in the CRM, the HRIS, and the ticketing system, it is operating with roughly a quarter of the available context.
Organizations estimate that integration challenges cost them an average of $6.8 million annually in lost productivity and delayed projects. If your AI agent cannot securely read from and write to the specific Jira, Workday, or Salesforce instances your customers use, it is nothing more than an expensive chatbot.
Why Local AI Agent Prototypes Fail in Production
Building a function-calling demo that works against a single Salesforce sandbox takes an afternoon. Shipping one that works across 50 customers, each with their own Salesforce org, HubSpot account, or Workday tenant — that takes months. Frameworks like LangChain and LlamaIndex make it trivial to build a local agent. You hardcode a personal API key, define a simple Python function, and watch the agent work. However, as we explored in our breakdown of integration platforms for LangChain and LlamaIndex, the gap between prototype and production is almost entirely an infrastructure problem, not a model problem.
Here is what breaks when you move to a multi-tenant B2B SaaS environment:
The Multi-Tenant OAuth Nightmare
Each of your customers authenticates against their own instance of a third-party app. You need to store, refresh, and rotate OAuth tokens per customer, per integration. LLMs do not understand OAuth 2.0 flows. When an access token expires while the LLM is planning a multi-step execution, the subsequent API call fails with a 401 Unauthorized error. If you feed that 401 back to the LLM, it will either hallucinate a fix, retry the same broken call in a loop, or apologize and stop trying — burning tokens and latency in the process. Token management — refreshing tokens, handling invalid_grant errors, and securing credentials — must happen entirely outside the LLM's context window.
Aggressive and Inconsistent Rate Limits
When an AI agent is tasked with summarizing 50 recent support tickets, it will try to fetch them as fast as possible. HubSpot's API gives you 500 requests per 10 seconds on a Professional plan. Salesforce has concurrent request limits that depend on the edition. Jira Cloud returns 429 Too Many Requests with a Retry-After header. As we've covered in our guide on handling third-party API rate limits, your AI agent has no concept of rate budgets. If a user asks "get me all the contacts from our CRM," the agent will happily try to paginate through 100,000 records, burning through rate limits in seconds. Feeding a 429 error back to the LLM is a mistake — it will either retry immediately (getting blocked again) or give up entirely.
Pagination Hallucinations
Every API handles pagination differently. Salesforce uses cursor-based pagination with a nextRecordsUrl. HubSpot uses after cursors. Some legacy APIs still use offset/limit. A few return all results in a single response unless you explicitly request pages. If you hand raw pagination metadata to an LLM and ask it to fetch the next page, it will frequently hallucinate the cursor string, try to decode base64-encoded cursors, strip URL parameters, or miscalculate offsets.
Undocumented API Edge Cases
Vendor API documentation is notoriously unreliable. Required fields are often missing from the spec. Date formats vary between endpoints in the same API. Custom fields are wildly different between customer instances. If your tool schema does not perfectly match the reality of the customer's specific Salesforce instance, the API will reject the payload with a 400 Bad Request. These are the papercuts that kill production integrations, and no amount of prompt engineering will fix them.
How to Architect LLM Tool Calling for Integrations
To build production-grade agentic workflows, you must physically separate the LLM reasoning loop from the API execution realities. You achieve this by introducing a proxy layer between your LLM agent and the third-party APIs — a layer that handles every piece of infrastructure the LLM should never have to think about.
sequenceDiagram
participant User
participant LLM as AI Agent (LLM)
participant Proxy as Proxy API Layer
participant Vendor as Third-Party API (CRM, ATS)
User->>LLM: "Find recent high-priority tickets"<br>(Natural Language)
LLM->>Proxy: Tool Call: list_tickets<br>{"priority": "high"}
Note over Proxy: Injects OAuth Token<br>Normalizes Pagination<br>Applies Rate Limiting
Proxy->>Vendor: GET /api/v2/tickets?priority=high
Vendor-->>Proxy: 429 Too Many Requests
Note over Proxy: Intercepts 429<br>Executes Exponential Backoff<br>Retries Request
Proxy->>Vendor: GET /api/v2/tickets?priority=high
Vendor-->>Proxy: 200 OK + Raw Data
Proxy-->>LLM: Normalized JSON Response
LLM-->>User: "I found 3 high-priority tickets..."<br>(Natural Language)Here is what each component does and why it matters for function calling specifically:
OAuth token management outside the agent loop
The proxy layer maintains a secure, encrypted store of integrated accounts. When the LLM requests data, it simply passes an integrated_account_id. The proxy intercepts the request, looks up the correct OAuth credentials, checks the token expiry, and refreshes it if necessary before injecting the Bearer token into the outbound request. Token refreshes should happen proactively — before they expire — so the agent never sees an authentication failure. If a token does expire unexpectedly (revoked by the user, rotated by the provider), the proxy should attempt a refresh and replay the request transparently, only surfacing an error to the agent if re-authentication is truly required. The LLM never sees a token, and it never sees a 401.
Rate limit interception with exponential backoff
When the proxy receives a 429 response, it should:
- Parse the
Retry-Afterheader (or use a sensible default) - Apply exponential backoff with jitter
- Retry the request transparently
- Return the successful result to the agent
The LLM should never see a rate limit error. If it does, it will likely include the error in its reasoning, apologize to the user, and stop trying — exactly the wrong behavior when the correct action is to wait 2 seconds and retry. From the LLM's perspective, it simply experiences a slightly longer response time.
Pagination normalization
Instead of forcing the LLM to learn 50 different pagination strategies, the proxy normalizes all pagination into a standard limit and next_cursor format. When the agent calls a list tool, the proxy handles full traversal (up to a configurable limit) and returns results with a standardized opaque cursor. The proxy translates that generic cursor back into the vendor-specific format — whether that is an offset, a page number, or a Salesforce nextRecordsUrl. This matters because if you pass a raw vendor cursor back to the model, it may try to interpret, decode, or modify it, corrupting the pagination state entirely.
Error normalization
Every API returns errors differently. Salesforce returns an array of error objects. HubSpot returns a single message string. Jira returns errorMessages (an array) alongside errors (an object). The proxy layer should normalize all of these into a consistent shape so the agent can reason about failures without needing API-specific error handling logic.
Unified APIs vs. Proxy APIs for Agentic Workflows
When evaluating integration infrastructure for AI agents, engineering teams usually debate between Unified APIs and Proxy APIs. Understanding the difference is critical for agent performance.
A Unified API normalizes data across multiple providers into a single schema. A call to /crm/contacts returns the same field names whether the underlying provider is Salesforce, HubSpot, or Pipedrive. This is excellent for standard programmatic syncing — building a generic dashboard, for example.
A Proxy API passes through the native data shape of each provider, handling only authentication, pagination, and rate limiting. You get the raw fields, custom objects, and provider-specific metadata exactly as the vendor's API returns them.
For agentic workflows, Proxy APIs are often the better choice — and this is a nuanced point that catches many teams off guard.
| Consideration | Unified API | Proxy API |
|---|---|---|
| Data fidelity | Normalized — some fields dropped or renamed | Full native shape with all custom fields |
| LLM context | Less context for reasoning | Rich context, provider-specific nuance |
| Custom fields | Often excluded or require mapping | Included automatically |
| Predictable schema | Yes, consistent across providers | No, varies by provider |
| Best for | Programmatic sync, dashboards | Agent reasoning, custom workflows |
Unified APIs achieve normalization by stripping away context. They drop custom fields, ignore vendor-specific objects, and flatten complex relationships. But AI agents thrive on context. If a customer uses a custom churn_risk_score field in Salesforce, the AI agent needs to see it to reason effectively. A rigid unified schema hides that field entirely. When a Unified API strips out fields like Territory__c or Lead_Score__c to fit a common model, it is removing exactly the information that makes the agent's response useful to that specific customer.
LLMs are surprisingly good at handling varied data shapes. They do not need every provider to return identically structured JSON — they need rich context to reason about.
The ideal architecture gives you both options. Use Unified APIs for deterministic sync jobs where schema consistency matters. Use Proxy APIs for agent workflows where data completeness matters. For more on this architectural distinction, read our guide to Proxy APIs.
Truto exposes both layers. The Proxy API returns the exact native shape of each provider's response, with all the infrastructure plumbing (OAuth, pagination, rate limits) handled transparently. The Unified API layer sits on top for cases where you need a normalized schema. For AI agent use cases, the Proxy API endpoints are automatically exposed as tools with descriptions and JSON schemas — ready for an LLM to call without any agent-side normalization code.
What Good Tool Definitions Actually Look Like
The quality of your tool descriptions directly determines how reliably the LLM picks the right tool and passes correct arguments. This is prompt engineering for tools, and it is where most teams underinvest.
Here is a real-world example of a well-structured tool definition for listing CRM contacts:
{
"name": "list_all_salesforce_contacts",
"description": "List all contacts from the connected Salesforce account. Returns name, email, phone, account association, and any custom fields. Use this when the user asks about contacts, people, or stakeholders in their CRM.",
"query_schema": {
"type": "object",
"properties": {
"limit": {
"type": "string",
"description": "The number of records to fetch"
},
"next_cursor": {
"type": "string",
"description": "The cursor to fetch the next set of records. Always send back exactly the cursor value you received without decoding or modifying it."
}
}
}
}Notice the explicit instruction about next_cursor. Without it, models will try to decode base64-encoded cursors, strip URL parameters, or otherwise mangle the value. Telling the model to pass the cursor back verbatim is not optional — it is a hard requirement for reliable pagination in agentic workflows.
Zero Integration-Specific Code
Maintaining tool schemas for hundreds of APIs is exhausting. APIs change, new fields are added, and endpoints are deprecated. If your integration infrastructure relies on hardcoded logic — if (provider === 'hubspot') { ... } — you will spend all your engineering cycles updating tool descriptions instead of improving your core product.
The most resilient architectures use a generic execution engine driven entirely by declarative configuration. In this model, the API specification (resources, methods, query parameters) is stored as data. The system dynamically reads this data and translates it into JSON Schema for the LLM.
If a customer requests a new endpoint or you notice the LLM is struggling to use a specific tool, you simply update the text description in the configuration data. The system instantly reflects the updated schema in the /tools endpoint without requiring a single line of code or a service redeployment.
This is how Truto handles Agent Toolsets. It allows product managers to tweak tool descriptions in real-time to improve agent success rates, completely bypassing the engineering deployment queue.
Enter MCP (Model Context Protocol)
As the industry standardizes how LLMs interact with external systems, the Model Context Protocol (MCP) has emerged as the definitive open standard for tool calling. Originally developed by Anthropic, MCP defines how AI models discover and invoke external tools. Think of it as what OpenAPI/Swagger did for REST APIs, but purpose-built for LLM function calling.
Instead of every framework (LangChain, LlamaIndex, CrewAI) having its own tool registration format, MCP provides a single protocol. The model connects to an MCP server, calls tools/list to discover available tools, and calls tools/call to execute one. The protocol uses JSON-RPC 2.0 over HTTP, and an MCP server exposes:
- Tool discovery (
tools/list): Returns the available tools with their names, descriptions, and parameter schemas - Tool execution (
tools/call): Accepts a tool name and arguments, executes the underlying operation, and returns the result
For integration platforms, MCP eliminates an entire class of glue code. Instead of writing custom tool registration logic for each LLM framework, you expose an MCP server and any MCP-compatible client — Claude, ChatGPT, or a custom agent — can discover and call your integration tools automatically.
Here is how a robust MCP implementation works in practice:
- Dynamic Tool Generation: The system iterates over every defined resource and method for an integration (e.g.,
list_contacts,create_ticket). It parses the required query and body schemas and assembles them into MCP-compliant tool definitions. - Tag-Based Filtering: You do not want to expose 500 API endpoints to an LLM at once — it clutters the context window and degrades reasoning. The system allows you to filter tools by tags (e.g., only exposing tools tagged
"support"to your customer service agent). - Method Restrictions: You can configure the MCP server to only allow read operations, ensuring the agent cannot accidentally delete records during a hallucination.
- Dual Authentication: The MCP server requires a secure token to access, but can also be configured to require standard API token authentication, ensuring only authenticated users within your multi-tenant environment can trigger the tools.
By leveraging MCP, you decouple your agent framework from your integration infrastructure entirely. You can swap Claude for OpenAI, or LangChain for LlamaIndex, and the integration layer remains perfectly intact.
Truto generates MCP servers directly from existing integration configurations. When you connect a customer's Salesforce account, every configured resource (contacts, deals, tasks, custom objects) is exposed as a callable MCP tool — with descriptions, query schemas, and body schemas that are editable through the UI. Changes take effect immediately, no redeployment needed.
For a deeper dive into this protocol, see our 2026 Guide to MCP.
The Trade-offs You Should Know About
It would be dishonest to pretend that any architecture eliminates all complexity. Here are the real trade-offs:
Proxy APIs give you full data fidelity, but variable schemas. Your agent code needs to handle the fact that Salesforce contacts and HubSpot contacts have different field names. This is usually fine for LLMs (they are good at interpreting varied JSON), but it means you cannot write deterministic data processing logic against a fixed schema.
MCP is a young standard. Adoption is growing fast, but the ecosystem of MCP-compatible clients is still maturing. If you are locked into a framework that does not support MCP yet, you will need framework-specific tool adapters as a bridge.
Third-party API instability is everyone's problem. No proxy layer can save you when a vendor ships a breaking API change without notice, deprecates an endpoint, or has an extended outage. Your infrastructure should include monitoring, alerting, and graceful degradation — but the underlying vendor dependency remains.
Token costs scale with tool definitions. Tool definitions become part of the context on every LLM call, consuming tokens and affecting cost and latency. If you expose 200 tools to a single agent, the token overhead for tool descriptions alone can be significant. This is why tag-based and method-based filtering matters — scope each agent to only the tools it actually needs.
The Strategic Takeaway
LLM function calling is the bridge between AI reasoning and real-world execution. But building that bridge requires acknowledging the painful realities of B2B software engineering. Vendor APIs are messy, rate limits are aggressive, and OAuth is unforgiving.
The teams that ship successfully are the ones that separate the AI reasoning layer from the integration plumbing. Your LLM should think about what data to fetch and what actions to take. A proxy layer should handle how to authenticate, paginate, retry, and normalize. Mixing these concerns is how you end up with agents that break at 2 AM when an OAuth token expires.
By deploying a Proxy API layer that normalizes the mechanics of API communication while preserving the raw, native shape of the data, you give your agents the context they need and the reliability your enterprise customers demand.
If you are evaluating how to build this infrastructure, our comparison of unified APIs for LLM tool calling covers the current landscape. If you want to skip building the integration layer entirely and go straight to exposing third-party APIs as agent tools, that is the problem Truto exists to solve.
Frequently Asked Questions
- What is the difference between LLM function calling and tool calling?
- They are the same thing. Function calling and tool calling both refer to the mechanism by which an LLM outputs structured JSON to invoke an external API or function, rather than executing it directly. The terms coexist for historical reasons — OpenAI originally used 'function calling' while Anthropic and others adopted 'tool calling.'
- Why do AI agents fail when connecting to production SaaS APIs?
- Production environments require multi-tenant OAuth management, rate limit handling, pagination normalization, and error mapping across different providers. LLMs cannot handle token refreshes, 429 retries, or cursor-based pagination on their own, causing agents to break or hallucinate when these issues arise.
- Should I use Unified APIs or Proxy APIs for AI agent workflows?
- Proxy APIs are generally better for AI agents because they return the full native data shape including custom fields, giving the LLM richer context for reasoning. Unified APIs are better for deterministic programmatic integrations where schema consistency matters more than data completeness.
- What is MCP (Model Context Protocol) for LLM integrations?
- MCP is an open standard for how AI models discover and invoke external tools via JSON-RPC 2.0. An MCP server exposes tools/list for discovery and tools/call for execution, allowing any compatible LLM client to use integration tools without framework-specific glue code.
- How does a proxy layer improve LLM function calling reliability?
- A proxy layer sits between the LLM and third-party APIs, handling OAuth token refreshes, rate limit retries with exponential backoff, pagination normalization, and error standardization. This ensures the LLM only receives clean data and never has to reason about infrastructure failures.