The agent space moves fast and the vocabulary is inconsistent. This is a map of the full stack — what each layer does, what tools live there, and what actually matters in 2025.
Layer 1: The model
At the base, a language model that can reason and call tools. In 2025, your real choices are:
Claude 3.5/3.7 Sonnet — best for complex reasoning, long context, following instructions precisely. The right default for most agentic work.
GPT-4o — strong general performance, good tool-calling reliability, deep ecosystem of integrations.
Gemini 2.5 Pro/Flash — best-in-class context window, strong coding, Google ecosystem native. Flash is fast and cheap for high-volume agents.
Llama 3.x (self-hosted) — if you need data sovereignty, no API costs, or complete control. Performance gap is closing.
Model choice matters less than most people think. The bottleneck is usually tool design and orchestration, not the model.
Layer 2: Orchestration
The orchestration layer runs the agent loop — sending messages to the model, handling tool calls, managing state, deciding when to stop.
LangChain / LangGraph — the most popular, most complex. LangGraph is better for multi-agent workflows. High abstraction, high magic, harder to debug. Use if your team already knows it.
LlamaIndex — stronger on the retrieval/memory side. Good for document-heavy agents.
Anthropic's Claude SDK directly — underrated. If you're using Claude, the tool-calling API is clean and you often don't need an orchestration framework. Write the loop yourself in 50 lines.
No framework — for simple agents, the framework overhead isn't worth it. Build the loop yourself. You'll understand what's happening, and debugging is 10x easier.
The framework question mostly comes down to: how complex is your agent graph? Single-agent linear tasks need no framework. Multi-agent workflows with conditional routing benefit from LangGraph or similar.
Layer 3: Tools
Tools are the capabilities your agent can invoke. They're the most important architectural decision you'll make.
Built-in tools — web search, code execution, file access. Every major model provider offers some of these out of the box.
MCP servers — the emerging standard for connecting agents to external systems. Write once, use with any MCP-compatible host. The ecosystem already has servers for most common integrations.
Custom API wrappers — when you need to connect to internal systems, proprietary APIs, or anything not covered by existing MCP servers. Write thin wrappers, keep tools atomic.
Sandboxed execution environments — for agents that write and run code, you need isolation. E2B, Modal, or a custom container setup. Never run model-generated code outside a sandbox.
Tool design principles: atomic, bounded output, idempotent, clear errors. This is worth repeating because most agent failures trace back to bad tool design.
Layer 4: Memory
What the agent knows and how it's stored.
In-context — the conversation window. Fastest, most expensive, limited by context length. Everything the agent is actively working with.
Vector stores — semantic search over embedded documents or past interactions. Chroma (local), Pinecone, Weaviate, or pgvector (if you're already on Postgres). Use for knowledge bases and semantic retrieval.
Relational databases — for structured data the agent reads and writes. Facts, state, logs. Your regular database works here.
File systems — for agents that generate artifacts — code, reports, data files. GCS, S3, or local disk depending on your setup.
Most agents only need in-context memory + a database. Don't over-engineer the memory layer until you actually have a retrieval problem.
Layer 5: Evaluation
The layer most people skip. If you can't measure your agent, you can't improve it.
Trajectory evaluation — did the agent take reasonable steps to reach the goal? Even if the output is wrong, was the reasoning sound?
Output evaluation — is the final output correct? This requires defining what "correct" means for your task.
Tool-use evaluation — did the agent call the right tools with the right arguments? Hallucinated tool calls are a major failure mode.
Cost and latency tracking — how many tokens? How many steps? How long? These compound fast at scale.
Simple eval: a test set of 20-50 tasks with expected outputs. Run it on every change. A number that goes up or down tells you more than any amount of vibe-checking.
Layer 6: Deployment
Where the agent runs and how it's triggered.
Serverless — Cloud Run, Lambda, Vercel. Good for on-demand agents triggered by API calls or webhooks. Pay per invocation, cold starts are a concern for long-running agents.
Long-running processes — VMs or containers running a scheduler. Good for periodic agents (daily reports, regular syncs). This is what most production agents use.
Queues — for high-volume, fault-tolerant agent work. Cloud Tasks, SQS, RabbitMQ. Decouple triggering from execution. Handle retries and timeouts properly.
Human-in-the-loop hooks — for high-stakes actions, you want approval before execution. A Slack message asking "should I delete these 500 records?" before proceeding. Build this in from the start for consequential agents.
What actually matters
Stack debates are usually a distraction. What matters:
The tool design is the product. Spend 80% of your time there.
The system prompt is the agent's behavior policy. It matters more than the model choice.
The eval loop is how you improve. Without it, you're guessing.
Everything else is infrastructure. Pick the simplest option that works and move on.
Published under Field Notes. Not for sale. Share freely.