Building AI Agents: Lessons from the Trenches - Blog post cover image
Back to Blog
ai 6 min read

Building AI Agents: Lessons from the Trenches

AI Agents LangGraph Python Automation

Building AI agents that work in demos is easy. Building agents that work reliably in production? That’s where things get interesting.

Over the past year, I’ve built agents for everything from marketing automation at Robynn AI to helping local coffee shops automate their scheduling. Here’s what I’ve learned.

The Agent Architecture Spectrum

There’s a spectrum of agent architectures, and choosing the right one depends entirely on your use case:

Simple Tool-Calling Agents

For straightforward tasks with clear inputs and outputs, you don’t need complex orchestration. A simple loop works:

def simple_agent(task: str, tools: list[Tool]) -> str:
    messages = [{"role": "user", "content": task}]

    while True:
        response = llm.generate(messages, tools=tools)

        if response.is_complete:
            return response.content

        # Execute tool calls
        for tool_call in response.tool_calls:
            result = execute_tool(tool_call)
            messages.append({"role": "tool", "content": result})

This pattern handles 80% of use cases. Don’t over-engineer it. (This is part of my philosophy on boring technology—save complexity for where it matters.)

State Machine Agents

When you need deterministic control flow with AI-powered decisions at each step, state machines shine:

from langgraph.graph import StateGraph

workflow = StateGraph(AgentState)

# Define nodes
workflow.add_node("analyze", analyze_task)
workflow.add_node("plan", create_plan)
workflow.add_node("execute", execute_plan)
workflow.add_node("validate", validate_results)

# Define edges
workflow.add_edge("analyze", "plan")
workflow.add_conditional_edges(
    "plan",
    should_execute,
    {"execute": "execute", "revise": "plan"}
)

Multi-Agent Systems

For complex domains, multiple specialized agents often outperform one generalist:

  • Coordinator: Routes tasks to specialists
  • Researcher: Gathers and synthesizes information
  • Executor: Takes actions
  • Validator: Checks results

The Patterns That Actually Work

1. Structured Outputs Are Non-Negotiable

Free-form text output from LLMs is unreliable. Always constrain the output:

from pydantic import BaseModel

class ActionPlan(BaseModel):
    reasoning: str
    actions: list[Action]
    confidence: float

response = llm.generate(
    prompt,
    response_format=ActionPlan
)

This alone eliminates 50% of production bugs.

2. Retry with Exponential Backoff

LLM APIs fail. Rate limits hit. Network hiccups happen. Build resilience in:

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10),
    retry=retry_if_exception_type(RateLimitError)
)
def call_llm(prompt: str) -> str:
    return llm.generate(prompt)

3. Observability from Day One

You can’t debug what you can’t see. Log everything:

  • Input prompts (anonymized)
  • Tool calls and results
  • Token usage
  • Latency breakdowns
  • Error rates by category

I use a simple pattern:

@trace("agent_step")
def process_step(state: AgentState) -> AgentState:
    with span("llm_call"):
        response = llm.generate(state.messages)

    log_metrics({
        "tokens_used": response.usage.total_tokens,
        "latency_ms": response.latency,
        "tool_calls": len(response.tool_calls)
    })

    return state.update(response)

4. Human-in-the-Loop Escape Hatches

No agent should run completely autonomously in production. Always provide:

  • Confidence thresholds that trigger human review
  • Easy ways to pause and inspect state
  • Clear audit trails
if action.confidence < 0.8:
    await notify_human(
        action=action,
        context=state,
        timeout=timedelta(hours=1)
    )

Common Pitfalls

The “Just Add More Context” Trap

When an agent fails, the temptation is to add more instructions to the prompt. This usually makes things worse. Instead:

  1. Analyze why it failed
  2. Constrain the output format
  3. Add specific examples (few-shot)
  4. Consider splitting into sub-tasks

Ignoring Edge Cases

Agents in production see inputs you never imagined. That “simple” task of parsing customer emails? Wait until you see:

  • Emails in multiple languages
  • Forwarded threads with nested quotes
  • Attachments described but not attached
  • Sarcasm and ambiguity

Build defensive parsing and graceful degradation.

Over-Relying on the LLM

Not everything needs AI. A well-crafted regex or a simple rule often beats an LLM call for:

  • Data validation
  • Format checking
  • Simple transformations
  • Deterministic routing

Save the LLM for actual reasoning tasks. Use boring, proven tools for everything else.

What I’m Excited About

The field is moving fast. A few things I’m watching:

Tool Libraries: Standardized tool definitions (like MCP) make agents more portable and composable.

Smaller, Faster Models: Claude 3 Haiku and similar models make agent loops economically viable for more use cases.

Memory Systems: Long-term memory and retrieval are becoming table stakes for production agents.

Evaluation Frameworks: We desperately need better ways to test agent behavior at scale.

Wrapping Up

Building production AI agents is part software engineering, part prompt engineering, and part systems design. It’s one of those domains where being a generalist actually helps—you need to understand the full stack. The agents that work best are:

  • Simple where possible
  • Observable always
  • Constrained in their outputs
  • Resilient to failures
  • Honest about their limitations

Start simple, measure everything, and iterate based on real-world feedback. Showing up consistently and making incremental improvements is the path to agents that actually work.


Have questions about building AI agents? Drop me a line at architgupta941@gmail.com or find me on X.

Archit Gupta
Written by Archit Gupta

Software Engineer specializing in AI agents at Robynn AI. Pool enthusiast.