How to build an AI agent — practical guide from a team that ships them

The internet has roughly a hundred guides on how to build an AI agent that get you to a working demo. Almost none of them get you to a production agent that works on day 60.

This is the version we'd give a senior engineer joining our team, focused on the parts that actually matter when real users start using the thing.

Step 0: Pick the right use case

Before any code, decide if this should be an agent at all.

Good fit for an agent:

Multi-step task where the path varies.
Touches multiple systems (CRM, helpdesk, database, calendar).
"Mostly right" is fine; humans handle edge cases.
Volume justifies the build cost.

Bad fit:

Deterministic transformation (input → known steps → output). That's workflow automation. Build that instead.
High-stakes single action with no acceptable failure mode.
Genuinely novel work where there's no pattern to learn.
Low volume (5 instances a month). The build cost won't pay back.

If your use case fails the first test, building an agent is the wrong move. Stop here.

Step 1: Design the system, not the prompt

Most agent failures come from people optimizing prompts before designing the system. The system has several pieces; the prompt is one of them.

Design these before writing code:

The goal and success criteria

What is the agent supposed to accomplish? What does "success" mean concretely? How will you measure it? Written down, sharable, agreed.

The tools

What can the agent do? Each tool is a function with a name, description, input schema, output schema. The tool descriptions are part of the agent's understanding of its world — write them carefully.

Common starting tools:

Read from the CRM (by ID, by query)
Update CRM record (with strict schema)
Search knowledge base (RAG)
Send email (with approval gate for production)
Lookup customer order
Issue refund (with approval gate)
Escalate to human (with reason)

The escalate-to-human tool is mandatory. Build it first.

The memory

What does the agent remember within a session? Across sessions? About users? About facts in the world?

Most agents need at least:

Conversation history within session (passed in context).
User profile retrievable by ID.
Recent interactions retrievable by user ID.

Vector stores (pgvector, Pinecone) are useful for semantic retrieval. SQL is fine for structured lookup. Use both.

The escalation rules

When does the agent stop autonomy and call for a human?

Below a confidence threshold (you set the threshold).
When a destructive action is requested (refund > $X, deletion, mass action).
When the user explicitly asks for a human.
When the agent has tried twice and failed.
When sensitive topics come up (legal threats, accusations, safety).

Each escalation should bring the human the full context — not the customer typing "I want a human" and the human starting from zero.

The evals

What test cases will tell you the agent works? Build the eval suite before optimizing the prompts. Without it, you're guessing.

A starter eval suite for a customer support agent:

20 happy-path cases with expected actions and tone.
20 edge cases the agent should escalate.
20 sensitive cases (frustrated users, complex policies) the agent should handle carefully.
10 adversarial cases (prompt injection attempts, jailbreaks).

Run the eval suite after every prompt or model change. If quality drops, you'll see it immediately.

Step 2: Pick the model

Default: Anthropic Claude (we use Claude Sonnet for reasoning, Haiku for routing/cheap steps).

Alternatives:

OpenAI GPT-4o when you need their specific strengths (function calling shape, multimodal).
Open models (Llama, Mistral) for cost-sensitive or compliance-sensitive deployments.
Multi-model: route by task. Haiku for cheap classification, Sonnet for the hard cases, Opus for the few cases that need it.

Model choice is per task, not per project. Multi-model architectures are standard for any production agent.

Step 3: Build the MVP

Build the simplest agent that handles the happy path against real data:

import { Anthropic } from "@anthropic-ai/sdk";
import { tools } from "./tools";

const client = new Anthropic();

async function runAgent(userInput: string, context: AgentContext) {
  const messages = [{ role: "user", content: userInput }];
  
  while (true) {
    const response = await client.messages.create({
      model: "claude-sonnet-4-6",
      max_tokens: 4096,
      tools,
      messages,
      system: SYSTEM_PROMPT,
    });
    
    if (response.stop_reason === "end_turn") {
      return response;
    }
    
    if (response.stop_reason === "tool_use") {
      const toolUse = response.content.find(c => c.type === "tool_use");
      const toolResult = await executeTool(toolUse.name, toolUse.input, context);
      messages.push({ role: "assistant", content: response.content });
      messages.push({ role: "user", content: [{ type: "tool_result", tool_use_id: toolUse.id, content: toolResult }] });
    }
  }
}

That's the loop. Real production code is more elaborate (parallel tool calls, structured outputs, retries, observability) but the shape is the same.

Use MCP for tool definitions if you can — it gives you portability across models and frameworks.

Step 4: Wire up observability before anything else

Every agent run should produce a trace:

The user input
The system prompt
Every tool call (name, input, output, latency, cost)
Every model call (model, tokens, cost)
The final response
The total time and total cost

Tools: LangSmith, Helicone, Langfuse, or your own logging. Without observability, debugging is impossible.

We've seen teams skip this in the name of speed. They always regret it within four weeks.

Step 5: Run evals; tune

Run your eval suite. Identify failures. Tune prompts, tool descriptions, or the system design (sometimes the prompt isn't the problem — the architecture is).

Common patterns we see:

Tool descriptions too vague. The agent picks the wrong tool. Fix: write clearer descriptions with concrete examples.
Missing context in memory. The agent doesn't remember a relevant prior fact. Fix: expand what gets retrieved before each call.
Hallucinated outputs. The agent makes up customer names or details. Fix: structured outputs with required fields validated against real data.
Sycophancy. The agent agrees with the user too readily. Fix: system prompt explicitly empowers the agent to disagree, escalate, or refuse.

Iterate. Each pass should improve eval scores measurably.

Step 6: Pilot with real users

When evals look solid, pilot with real users under controlled conditions. Initial setup:

Small user group (10–50 people).
All agent actions logged.
All escalations reviewed by a human within hours.
Daily review of failure cases.

This is where you find the edge cases you didn't think to put in the eval suite. Your eval suite grows with each one.

Plan to spend 2–4 weeks in pilot before broader rollout. Teams that skip this lose more time fighting production fires.

Step 7: Production deployment

Production setup needs:

Authentication and authorization. Who's allowed to ask the agent? What can the agent access on their behalf?
Rate limiting. Per user, per organization.
Budget controls. Daily / monthly model cost caps. Alerts on anomalies.
Monitoring. Latency, error rate, escalation rate, user satisfaction (if measurable).
Kill switch. A way to disable the agent for all users or specific cohorts immediately if needed.
Versioning. Pin the model version, the prompts, the tool definitions. Roll back cleanly.

Deployment platforms we use: Vercel, AWS Lambda, Cloudflare Workers, Modal — depending on workload shape and where the rest of your stack lives.

Step 8: Operate

The agent's not done at launch. Ongoing operations include:

Reviewing escalations weekly. Each one is a candidate for an eval case or a prompt improvement.
Tuning when models update. New model releases change behavior. Re-run evals before switching.
Expanding scope incrementally. Once trust builds, give the agent more autonomy or more tools.
Decommissioning failure modes. Some edge cases you never solve in the agent — formally hand them to humans.

Most agents end up with steady-state cost in operations (the "run" phase) that matches or exceeds the build cost over a year. Plan for it.

A reality check

Most "AI agents" you'll see launched in 2026 will be retired or rebuilt within 12 months. The reasons are usually one of:

Used wrong (built an agent for a workflow-automation problem).
Skipped evals.
No observability.
No human-in-the-loop.
Underestimated production ops cost.

The teams that succeed are the ones that approach agent-building as serious software engineering — not as a clever prompt. The pieces above are what serious engineering looks like in this domain.

Where to go from here

For a deeper conceptual primer: What is an AI agent.

For the higher-level decision of whether to build vs. buy: Build vs Buy AI Agent.

For a specific agent shape: see our AI Agent Development service page for production examples and engagement options.

If you'd rather we just build it for you: Start a Project.