Table of contents:
Working on a voice-based travel booking system taught us a lesson we keep returning to: the hardest part of building real AI agents isn’t the AI.
It’s the architecture.
Phone-based travel enquiries look like a solved problem. You talk, the system listens, requirements get captured, proposals appear. It’s straightforward enough as a concept, but in practice, a single destination search can return 25,000 room variants and 50 MB of raw provider data. Factor in consultants in meetings who can’t type, callers who’ve never spoken to this company before and provider APIs that go down at the worst possible moment.
A chatbot can’t solve all of this. A system can.
This article is about the architectural pattern that made our system work: separating what the large language model (LLM) is responsible for from what code is responsible for. We’ll use a voice-to-booking PoC as the concrete reference point – but the principle applies well beyond travel, and well beyond voice.
What breaks when you chain too many LLM calls
Most agent frameworks today run on the same basic loop: the model picks a tool, calls it, observes the result, picks another tool. For simple tasks, this is fine.
For anything non-trivial – many steps, large payloads, flaky integrations, tight latency requirements – this loop becomes a liability.
Every additional LLM turn is another opportunity to misinterpret, hallucinate, or pick the wrong next action. At a 90% per-step success rate (which is generous), research shows a 10-step chain delivers roughly 35% overall reliability. And errors in practice aren’t independent – they cascade.
Voice makes this concrete and painful. A phone call has maybe two seconds of tolerance for silence before the caller starts wondering if the line died. There’s no room for “model thinks… calls provider A… processes 50 MB response… calls provider B… synthesizes options…” A voice agent reasoning through a tool-call chain in real time will feel broken, because it is.
The fix is conceptually simple, even if the implementation requires discipline: stop using the LLM as the orchestration engine for long workflows. Use it as the planning layer. Let code handle the execution.
A mental model that helps
The CEO / management / execution framing is useful here.
A CEO makes high-leverage strategic decisions – expensive ones that only they can make. Management translates strategy into concrete work packages. Execution teams carry out the work at scale, predictably, at lower per-unit cost.
Tool-calling agents often force your CEO to do execution work one decision at a time. That’s both expensive (every loop iteration burns tokens) and unnecessarily slow.
The better structure: the LLM understands intent, picks an approach and produces a plan or code that implements the plan. The runtime executes deterministically from there. Parallel API calls, filtering, ranking, retries, persistence – all at normal compute cost, not LLM-token cost.
This isn’t a radically new idea. It’s closer to how you’d design a normal software system, with AI added where reasoning is required.
Why travel forces you to get this right
Travel search is an unusually revealing stress test because the data volumes are real and large.
For a single destination – Dubai is a good example – you can easily end up with 25,000 room variants after multiplying hotels by providers, board types, cancellation conditions and pricing tiers. Raw response payloads from provider APIs can reach 50 MB for a single query. This isn’t an edge case; it’s a normal request.
Pushing that volume through an LLM – even with chunking strategies – is slow, expensive and produces unreliable results. Large-context reasoning over noisy structured data is exactly where models struggle. You get confident-sounding output that doesn’t accurately reflect what was actually in the data.
The right approach is to use code for the heavy lifting: stream and filter provider results as they arrive, deduplicate near-identical offers, apply business rules deterministically (budget bands, board type preferences, preferred suppliers, cancellation policy tiers), rank candidates with a scoring function you can actually inspect and adjust. Only then should the LLM see anything, which should be a compact Top-N shortlist plus key constraints, not 50 MB of noise.
This reduces latency, cuts costs and produces more auditable outputs. That last point matters when a consultant is going to put their name on a proposal.
The two-agent split
For this PoC, we built two agents with fundamentally different responsibilities and SLAs.
Voice Agent (real-time). This agent runs during the call. Its job is to keep the conversation flowing: identify the caller, collect missing information, handle the scenario gracefully (consultant calling from a client meeting, returning customer, first-time caller). It has a deliberately narrow tool set: it can trigger back-office processing, as well as one lightweight lookup tool for cases where a quick answer genuinely helps the conversation. It does not attempt to generate proposals on the call. When it has what it needs, it schedules the work and wraps up.
Execution Agent (background). This is where the actual work happens. It creates a plan, resolves location strings to the IDs that provider APIs actually accept (using a combination of vector search and a domain knowledge base), then generates TypeScript code that interacts with provider tools via an MCP server. It makes multiple calls in parallel, filters results, ranks candidates and runs that code in a controlled sandbox. Importantly, it can take its time and it can retry if something goes wrong.
The call ends before any of this processing begins. The caller gets confirmation that their enquiry is being prepared. The back-office agent does its work without any real-time SLA pressure from the conversation layer.
The technical mechanism that makes this decoupling work is quite simple: the Voice Agent fires an HTTP request to the Execution Agent and immediately receives 202 Accepted. That single response code is what prevents provider latency from ever touching the call experience, and it’s what makes the two layers independently scalable.
Code execution via MCP
The Execution Agent generating and running code – rather than calling tools in a loop – is the central architectural choice. The mechanism that makes it practical, safe and cost-effective is MCP: the Model Context Protocol.
MCP is an open standard, originally developed by Anthropic, that defines how AI models interact with tools and external systems. In most MCP setups, the model receives all available tool definitions upfront, including schemas, descriptions and parameters. This works fine at small scale. But when you have dozens of provider integrations and tool responses that return large datasets, the context window fills fast and the per-call cost grows accordingly.
The code execution pattern sidesteps this structurally. Instead of calling individual tools one by one and piping intermediate results back through the model, the agent writes code that interacts with the MCP server directly. The model plans once; the code executes many operations in a single pass; only the filtered, summarized result comes back to context. Anthropic’s engineering team documented a similar approach and measured a 98.7% reduction in token usage compared to equivalent tool-calling workflows. In our PoC, the AI layer (combining both agents) came out at roughly $0.13 per end-to-end execution. That figure isn’t accidental.
The MCP boundary also becomes the enforcement point for the sandbox. The Execution Agent doesn’t run in an environment with access to arbitrary infrastructure; it only has access to what its MCP server exposes: specific provider search tools, filtering utilities and output templates with defined schemas. Generating code that calls an MCP-provided tool is categorically different from generating code that makes raw HTTP requests. The first is bounded; the second is not.
This distinction matters beyond security. Every tool invocation through the MCP server produces a structured log entry: what was called, what parameters went in, what came back. When something behaves unexpectedly, debugging means reading a trace; not reconstructing what the model was reasoning about at step six of a tool-calling chain. That’s a meaningful operational difference in a system that will run thousands of times in production.
How much does it cost?
Cost questions come up early in these conversations, so: across the three caller scenarios (consultant, returning customer, new caller), the average per-call cost in the PoC was approximately $0.48. Voice infrastructure accounted for most of it – around $0.10 per minute. The AI layer, Voice Agent plus Execution Agent, ran to roughly $0.13 per complete end-to-end flow.
Whether that’s cheap depends on what it replaces. For a travel consultancy where a missed call or a slow follow-up has measurable conversion impact, the economics aren’t complicated.
The more durable point is that the cost is predictable and optimizable. When the LLM’s role is bounded to planning steps, token usage is bounded too. The heavy work runs at compute cost.
What a production version actually needs
While our PoC validated the pattern, a production system would require more engineering:
- End-to-end distributed tracing (e.g., OpenTelemetry) across voice, execution and sandbox layers – not just per-component logs
- A durable job queue with idempotency guarantees, so retry logic doesn’t create duplicate records downstream
- Versioned prompts and tool contracts, decoupled so you can update one without breaking the other
- Explicit PII handling: what gets retained, what gets redacted, what appears in logs and for how long
- Per-call cost monitoring with alerting
None of this is unusual. It’s standard production engineering applied to a system that happens to include an LLM in the planning layer.
The pattern beyond travel
Voice was the example we looked at, but the principle can be applied elsewhere.
Whenever you’re building a workflow where tasks have many steps, data volumes are large, integrations are occasionally flaky, governance matters and cost needs to be predictable – the same split applies. LLMs for planning and adaptation. Code for execution. A runtime that makes execution auditable and enforceable.
Most agent frameworks today are optimized in the opposite direction: adding more tools to a loop is easy, almost frictionless. The pattern described here requires more design upfront. But it’s the pattern that consistently performs when a demo becomes a product and a product meets real operational conditions.
The agents that survive production are the ones that treat the LLM as an expensive, high-leverage component to be used deliberately, not as a general-purpose execution engine for every step in a workflow.
Want to learn more about how to design, develop and deploy agentic AI in your operations? Get in touch with experts.
FAQ
What are LLM limitations?
Reliability degrades with chain length: at 90% per-step accuracy, a 10-step tool-calling chain delivers roughly 35% end-to-end reliability. LLMs also struggle with large, structured data – feeding 50 MB of raw results into a context window often produces inaccurate outputs. Every model call has latency and token cost, and outputs are non-deterministic, which makes LLMs a poor fit for workflows where auditability matters.
What is the model context protocol (MCP)?
MCP is an open standard developed by Anthropic that defines how AI models communicate with external tools and data sources. For code-executing agents, MCP turns a sandbox into a controlled API surface: the model interacts only with what the MCP server explicitly exposes. Every tool invocation is logged, bounded and inspectable.
What are some best practices for executing code with MCP in an Agentic system?
Keep the MCP server surface minimal – expose only the tools the agent genuinely needs. Treat generated code as untrusted and run it in a sandbox with resource limits and structured logging. Let the model generate a complete execution plan in one turn rather than iterating through tool calls; one well-structured generation beats five iterative calls. Finally, version prompts and MCP tool contracts independently so either can evolve without silently breaking the other.
What’s the difference between tool calling and code generation in an agent?
With tool calling, the model decides which tool to invoke next, observes the result, then decides again; one LLM turn per action. With code generation, the model writes a function that performs many actions, and the runtime executes it in one pass. For short, simple tasks the difference is minor. For workflows with many steps or large data volumes, code generation compresses LLM turns dramatically, moves control flow into deterministic code, and makes the whole execution inspectable as a software artifact rather than a sequence of model decisions.
When does a multi-agent architecture make sense vs a single agent?
A single agent is usually the right starting point as it means less coordination and simpler debugging. Multi-agent makes sense when two parts of a system have fundamentally different SLAs or resource profiles that can’t coexist in one loop. In the voice-to-booking case, the real-time conversation layer and the background processing layer have nothing in common: one needs sub-second responsiveness, the other needs to handle 50 MB payloads and parallel API calls. Forcing them into a single agent would mean optimizing for both at once, which in practice means doing neither well.
How do you handle failures in code-executing agents?
The sandbox treats partial results as valid outputs rather than failures. If three out of four provider calls return data, the agent works with what it has and notes what’s missing. Retries are handled in code, not by re-prompting the model; this keeps failure handling deterministic and cheap. For persistent failures, the job queue records the last successful state so the workflow can resume from there rather than restart from scratch. The key shift is that failures become software problems (logs, traces, known error codes) rather than opaque model behaviour to reconstruct after the fact.
Does this pattern only apply to voice use cases?
No. Voice was the stress test, not the prerequisite. The same split between a planning layer and a code-executing back office applies to any workflow where tasks have many steps, data volumes are large, integrations are occasionally unreliable, or cost needs to be predictable. Whether document processing pipelines, complex search and ranking systems, multi-step data enrichment, the pattern fits wherever you’d otherwise end up with a long tool-calling chain and wonder why reliability is lower than expected.
About the authorKrystian Mikrut
A .NET engineer with over twelve years of experience, Krystian focuses on building reliable, scalable and maintainable systems. In his current role he combines hands-on development with technical leadership, to help teams deliver solutions based on microservices, event-driven communication and cloud-native architecture. A background that includes leading development teams, shaping system architecture and supporting engineers through mentoring and knowledge sharing enables him to work on systems that handle millions of daily events, while requiring high availability, careful design and smoorth, zero-downtime deployments. A strong believer in simple, clean solutions, continuous improvement and steady, consistent progress, Krystian views himself as a “regular team member” who just happens to enjoy taking responsibility when needed.
