How to Build an AI Agent with AutoGen

Multi-agent systems beat single-agent approaches for complex tasks—and AutoGen is the fastest way to build them.

Definition

Microsoft AutoGen is a Python framework that enables you to build conversational multi-agent systems where agents collaborate by exchanging messages. Each agent runs code, calls tools, and makes decisions autonomously, but they coordinate through a conversation protocol to solve tasks that no single agent could handle alone.

TL;DR

AutoGen v0.4 introduced a simpler architecture than v0.3 with better tool integration and human-in-the-loop support
Start with a two-agent setup (assistant + user proxy) before scaling to group chats with 5+ agents
Tool calling is native to AutoGen—register functions directly without separate SDKs
GroupChat automatically routes messages between agents; manual message passing is a common pitfall
Production deployments need cost controls, token limits, and human approval workflows for critical actions

Multi-agent workflows solve real problems faster than single chatbots. A legal document analyzer, a fact-checker, and a summarizer working together produce better results than one agent trying to do all three. Gartner estimates 40% of enterprise applications will feature AI agents by 2026, and organizations adopting multi-agent systems report 3-4 hour weekly time savings on coordination tasks alone. But building them requires thinking differently about system design.

This guide walks you through AutoGen v0.4 from installation to production. You'll move from "hello world" agents to a real multi-agent system with tool integration, failure handling, and cost controls.

Step 1: Install AutoGen and Set Up Your Environment

AutoGen requires Python 3.10 or higher. Install it via pip:

pip install pyautogen

For v0.4 specifically, verify your installation:

python -c "import autogen; print(autogen.__version__)"

You also need an LLM API key. AutoGen supports OpenAI, Azure, Anthropic, and local models. For this tutorial, we'll use OpenAI, but the patterns work for any provider.

Set your API key as an environment variable:

export OPENAI_API_KEY="your-api-key-here"

In your Python code, configure the LLM settings:

import autogen

config_list = [
    {
        "model": "gpt-4-turbo",
        "api_key": "your-api-key",
        "temperature": 0.7,
    }
]

Create a separate configuration file (recommended for production):

# config.py
LLM_CONFIG = {
    "config_list": [
        {
            "model": "gpt-4-turbo",
            "api_key": "your-key",
            "temperature": 0.7,
            "timeout": 120,
        }
    ],
    "cache_seed": 42,
}

The cache_seed parameter enables caching—identical prompts return cached results, cutting API costs by 40-60% in development. You'll revisit this setting for production use.

Step 2: Create Your First Agent

An AutoGen agent is a wrapper around an LLM with memory, tool access, and message handling. Let's build a simple assistant agent:

from autogen import AssistantAgent

assistant = AssistantAgent(
    name="assistant",
    llm_config=LLM_CONFIG,
    system_message="You are a helpful AI assistant. Provide clear, concise answers."
)

This agent has a system prompt, knows which LLM to use, and maintains conversation history. It can call tools, but we haven't registered any yet.

Create a user proxy agent that simulates a human:

from autogen import UserProxyAgent

user_proxy = UserProxyAgent(
    name="user",
    human_input_mode="TERMINATE",
    max_consecutive_auto_reply=10,
)

The human_input_mode="TERMINATE" means the agent stops at human input—it won't loop forever. max_consecutive_auto_reply=10 prevents runaway agent chains.

Now initiate a conversation:

user_proxy.initiate_chat(
    assistant,
    message="What is the capital of France?"
)

Run this and you'll see the agent respond. It's basic, but it works. Most AutoGen tutorials stop here. You shouldn't.

Step 3: Add Tool Integration

Tools are what separate agents from chatbots. A tool is any Python function your agent can call to gather information or take action.

Define a simple tool:

def search_wikipedia(query: str) -> str:
    """Search Wikipedia and return a summary."""
    import requests
    url = "https://en.wikipedia.org/w/api.php"
    params = {
        "action": "query",
        "list": "search",
        "srsearch": query,
        "format": "json",
    }
    response = requests.get(url, params=params)
    results = response.json().get("query", {}).get("search", [])
    if results:
        return f"Found: {results[0]['title']} - {results[0]['snippet']}"
    return "No results found."

assistant.register_for_execution()(search_wikipedia)

Also register it with the user proxy so it knows tools exist:

user_proxy.register_for_llm(
    description="Search Wikipedia for information about a topic"
)(search_wikipedia)

Now update your conversation:

user_proxy.initiate_chat(
    assistant,
    message="Find information about the history of the Eiffel Tower."
)

The agent will recognize it can call search_wikipedia, invoke it, and incorporate results into its response. This is the pattern for any tool: define, register, call.

Tip

AutoGen caches tool results by default. If the same tool call runs twice, you get the cached response. For APIs with rate limits, this saves money. For real-time data (stock prices, weather), disable caching or set a short TTL.

Step 4: Build a Two-Agent Collaboration

The real power emerges when agents talk to each other. Let's create a code reviewer and code writer:

code_writer = AssistantAgent(
    name="code_writer",
    llm_config=LLM_CONFIG,
    system_message="You are an expert Python developer. Write clean, efficient code."
)

code_reviewer = AssistantAgent(
    name="code_reviewer",
    llm_config=LLM_CONFIG,
    system_message="You are a strict code reviewer. Check for bugs, security issues, and style. Give specific feedback."
)

Create a user proxy to start the exchange:

user = UserProxyAgent(
    name="user",
    human_input_mode="TERMINATE",
    max_consecutive_auto_reply=15,
)

Initiate a multi-turn conversation:

user.initiate_chat(
    code_writer,
    message="Write a Python function to validate email addresses using regex."
)

But here's the catch: initiate_chat only connects two agents. To get the reviewer involved, you need a conversation loop:

def chat_with_review():
    code_writer.reset()
    code_reviewer.reset()

    code_writer.receive_message(
        message="Write a Python function to validate email addresses.",
        sender=user
    )

    code_reviewer.receive_message(
        message=code_writer.last_message()["content"],
        sender=code_writer
    )

    for i in range(5):
        response = code_writer.generate_reply(
            messages=code_writer.chat_history
        )
        code_reviewer.receive_message(response, sender=code_writer)

This is tedious. That's why GroupChat exists.

Step 5: Scale with GroupChat

GroupChat orchestrates multi-agent conversations automatically. Define your agents and let GroupChat route messages:

from autogen import GroupChat, GroupChatManager

agents = [code_writer, code_reviewer, user]

group_chat = GroupChat(
    agents=agents,
    messages=[],
    max_round=10,
    speaker_selection_method="auto",
)

manager = GroupChatManager(
    groupchat=group_chat,
    llm_config=LLM_CONFIG
)

user.initiate_chat(
    manager,
    message="Write and review a function to parse CSV files."
)

The speaker_selection_method="auto" uses the LLM to decide who speaks next based on context. Alternatives: "round_robin" (fixed rotation), "manual" (you decide), or a custom function.

GroupChat is where AutoGen shines. Five agents collaborating, each specialized, producing better output than any single agent. But it requires discipline.

Warning

GroupChat with more than 5 agents gets slow—each agent evaluates whether it should speak. Token costs climb fast. If you have 10+ agents, consider splitting into sub-groups or using a hierarchical approach with a manager agent routing tasks to specialists.

Step 6: Implement Multi-Agent Patterns for Complex Workflows

Real systems need patterns beyond free-form conversation. Here are three that work:

Pattern 1: Specialist Teams Create sub-teams of agents. A research team (researcher + fact-checker) produces a report. A content team (writer + editor) refines it. Then they merge findings:

researchers = [researcher, fact_checker]
researchers_chat = GroupChat(
    agents=researchers + [user],
    max_round=5,
    speaker_selection_method="auto"
)

content_team = [writer, editor]
content_chat = GroupChat(
    agents=content_team + [user],
    max_round=5,
    speaker_selection_method="auto"
)

# Run research phase
user.initiate_chat(
    GroupChatManager(researchers_chat, llm_config=LLM_CONFIG),
    message="Research AI agent market trends."
)

research_output = user.last_message()["content"]

# Run content phase
user.initiate_chat(
    GroupChatManager(content_chat, llm_config=LLM_CONFIG),
    message=f"Turn this research into a blog post: {research_output}"
)

Pattern 2: Approval Workflow An agent proposes, another approves or rejects:

def requires_approval(agent, task):
    agent_response = agent.generate_reply(messages=[{"role": "user", "content": task}])

    approver_decision = approver.generate_reply(
        messages=[{"role": "assistant", "content": agent_response}]
    )

    if "APPROVED" in approver_decision:
        return agent_response, True
    return agent_response, False

Use this for deployment decisions, financial transactions, or sensitive outputs.

Pattern 3: Hierarchical Routing A manager agent receives tasks and routes them to specialists:

manager_system = """
You are a task router. Given a user request:
1. Identify the task type (data analysis / content creation / coding)
2. Route to the appropriate specialist team
3. Synthesize their output
Never make decisions directly—always delegate.
"""

manager = AssistantAgent(
    name="manager",
    llm_config=LLM_CONFIG,
    system_message=manager_system
)

# Manager routes to teams
manager.receive_message(
    message="Analyze Q4 sales data and write a summary."
)

The manager never sees the specialist tools; they only coordinate.

Step 7: Production Considerations

Development agents and production agents are different creatures.

Cost Control: Every API call costs money. Set strict limits:

from autogen import Completion

Completion.max_tokens = 500  # Per message
Completion.temperature = 0.3  # Lower for consistency

llm_config = {
    "config_list": [...],
    "timeout": 60,
    "max_tokens": 500,
    "cache_seed": None,  # Disable caching in production
}

Monitor API usage:

import logging
logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger("autogen")
logger.setLevel(logging.INFO)

Set a budget per conversation:

max_messages = 50
current_messages = 0

def count_messages(agent):
    global current_messages
    current_messages += 1
    if current_messages > max_messages:
        raise Exception("Message limit exceeded")

Human-in-the-Loop: Not every decision should be automatic. For critical actions, ask for approval:

user_proxy = UserProxyAgent(
    name="user",
    human_input_mode="ALWAYS",  # Require human approval
)

Or conditional approval:

def should_require_approval(message: str) -> bool:
    sensitive_keywords = ["delete", "deploy", "transfer", "approve"]
    return any(keyword in message.lower() for keyword in sensitive_keywords)

user_proxy.human_input_mode = "TERMINATE"
# But set it to "ALWAYS" if should_require_approval(message)

Error Handling: Agents hallucinate and fail. Handle it gracefully:

from autogen import ConversationResult

def safe_chat(initiator, recipient, message, max_attempts=3):
    for attempt in range(max_attempts):
        try:
            result = initiator.initiate_chat(
                recipient,
                message=message,
                summary_method="reflection_with_llm"
            )
            return result.summary
        except Exception as e:
            logger.error(f"Attempt {attempt + 1} failed: {e}")
            if attempt == max_attempts - 1:
                return "Task failed after retries."

Token Tracking: Know how many tokens your agents consume:

class TokenTracker:
    def __init__(self):
        self.total_tokens = 0

    def log_tokens(self, message):
        from autogen.utils import count_tokens
        tokens = count_tokens(message)
        self.total_tokens += tokens
        return tokens

tracker = TokenTracker()

Step 8: Common Pitfalls and How to Avoid Them

Most tutorials skip this. Don't.

Pitfall 1: Agents Talking in Circles Agents repeat the same point endlessly. Fix it with:

Lower max_consecutive_auto_reply (default: 5)
Set max_round in GroupChat (default: 10)
Add an explicit termination condition: "If you agree, say CONSENSUS."

group_chat = GroupChat(
    agents=agents,
    max_round=8,  # Hard stop
    system_message="When all agents agree, say CONSENSUS and stop."
)

Pitfall 2: Tools Don't Get Called The agent knows the tool exists but doesn't use it. Usually because the system prompt doesn't mention it:

assistant = AssistantAgent(
    name="assistant",
    llm_config=LLM_CONFIG,
    system_message="You are an assistant. You have access to a search tool. Use it to find current information."
)

Explicitly tell agents they have tools.

Pitfall 3: One Agent Dominates In GroupChat, one agent speaks too much. Adjust speaker selection:

def custom_speaker_selection(last_speaker, groupchat):
    # Ensure fair distribution
    if last_speaker == agent_a:
        return agent_b
    return agent_a

group_chat = GroupChat(
    agents=agents,
    speaker_selection_method=custom_speaker_selection,
)

Pitfall 4: Forgetting Agent Resets Memory persists between chats. If you reuse agents:

assistant.reset()  # Clear chat history
user_proxy.reset()

Forgetting this causes agents to reference old conversations.

Pitfall 5: Tool Functions with Side Effects If a tool deletes data or sends emails, test it outside AutoGen first:

# Test tool in isolation
result = search_wikipedia("Python")
print(result)

# Then register with agent
assistant.register_for_execution()(search_wikipedia)

Comparing AutoGen with Other Frameworks

Feature	AutoGen	CrewAI	LangChain
Multi-Agent Conversation	Native GroupChat	Task-based orchestration	Requires custom loops
Tool Integration	register_for_execution()	Tool decorator	Tool calling via LLM
Code Execution	Built-in (sandboxed)	Not built-in	Via LLM only
Learning Curve	Steep	Gentle	Moderate
Production Ready	Yes	Emerging	Yes, but manual setup

AutoGen is the choice if you need agents that execute code and collaborate autonomously. CrewAI is simpler if you're new to agents. LangChain is best if you need maximum flexibility and don't mind writing scaffolding code.

What You've Built

You now have a production-capable multi-agent system. You can:

Create specialized agents with distinct roles
Register tools for agents to call
Coordinate 3-5 agents in GroupChat
Handle failures and human approvals
Monitor costs and prevent runaway loops

The AI agent market reached $7.6B in 2025, with 79% of organizations adopting AI agents. 93% of business leaders believe AI agents give a competitive edge. The difference between successful deployments and failures is rarely the LLM—it's agent orchestration. AutoGen handles that orchestration well.

For deeper patterns, see our complete guide to building AI agents and comparisons with LangChain and CrewAI.

Should I use AutoGen v0.4 or v0.3?

Use v0.4. It's newer, simpler, and the team has deprecated v0.3 support. v0.4 has better tool integration and reduced boilerplate. Migration from v0.3 requires updates to agent initialization, but it's worth it.

How do I prevent agents from running forever?

Set max_consecutive_auto_reply on agents and max_round on GroupChat. Both are hard stops. Also set human_input_mode="TERMINATE" on user proxies—the agent asks for confirmation before continuing.

Can AutoGen work with local models?

Yes. Configure any LLM via the config_list. Use Ollama or LM Studio for local inference. You'll sacrifice speed compared to cloud APIs, but you keep all data local. Recommended for sensitive workflows.

What's the typical cost for a multi-agent workflow?

A 5-agent GroupChat resolving in 8 rounds costs roughly $0.50-$2 with GPT-4 Turbo, depending on token usage. Caching cuts this 40-60% in development. For production, budget $0.10-$1 per task with proper limits and cheaper models like GPT-3.5 Turbo.