28.8 Agent Evaluation and Safety

Alright, let’s get real about agent evaluation and safety. This isn’t some academic footnote; it’s the difference between building a useful assistant and unleashing a digital Rube Goldberg machine that accidentally spends your entire AWS budget on cat food subscriptions. We’re not just teaching agents to use tools; we’re teaching them to use them responsibly. This is where the rubber meets the road, or more accurately, where the LLM meets the API that can actually change things in the real world.

28.7 AutoGen and CrewAI: Multi-Agent Frameworks

Right, so you’ve got your single agent doing its ReAct thing, calling a tool, and feeling pretty clever. But let’s be honest, most real-world problems aren’t solved by one brilliant mind working in isolation. They’re solved by a group of specialists, some arguing, some delegating, and at least one making the coffee. Welcome to the wonderfully chaotic world of multi-agent systems. Frameworks like AutoGen and CrewAI exist to manage this chaos for you. They provide the scaffolding to define different agent personas, give them specific tools, and—most importantly—orchestrate the conversation between them. Think of it as being a director for a play where the actors are LLM instances and they’re all prone to going wildly off-script.

28.6 Multi-Agent Systems: Collaboration, Competition, and Communication

Right, so you’ve got your single agent doing its ReAct thing, using tools, feeling pretty clever. But let’s be honest, most real-world problems aren’t solved by a single brilliant mind working in isolation. They’re solved by teams, committees, and groups of specialists who (ideally) collaborate, (sometimes) bicker, and (occasionally) produce something greater than the sum of their parts. Welcome to multi-agent systems, where we take that single-agent brain and copy-paste it a few times to see what beautiful—or horrifying—chaos ensues.

28.5 Planning Agents: MRKL, Toolformer, HuggingGPT

Alright, let’s get our hands dirty with planning agents. You’ve seen the basic ReAct loop, which is like a friend who thinks out loud before doing something. Planning agents are that friend on a triple espresso, with a whiteboard and a disturbingly detailed Gantt chart. They don’t just plan the next action; they plan a whole sequence of them, often breaking your big, scary problem into smaller, chewable pieces before they even reach for a single tool.

28.4 Memory in Agents: Short-Term, Long-Term, Episodic

Right, let’s talk about memory. Because without it, your AI agent is just a glorified, one-shot API call with amnesia. It’s the difference between a colleague who remembers the entire project history and a new intern you have to re-introduce yourself to every single morning. The core problem is context windows. LLMs have a shockingly short attention span. You’re basically trying to fit the entire plot of War and Peace into a tweet. We combat this with a strategy you’re already familiar with: not remembering everything, but remembering the right things. We break it down into three key types.

28.3 Tool Use: Function Calling and MCP

Right, let’s talk about getting these LLMs to actually do things. You see, an AI that can only talk is like a brilliant philosopher locked in a sensory deprivation tank. They can reason about the world, but they can’t interact with it. Their knowledge is frozen in time, limited to their training data. They can’t tell you the weather, can’t look up your latest database entry, and can’t book you a flight to Tahiti. This is where Tool Use, often called Function Calling, comes in. It’s the mechanism we use to give our boxed-in intellects a set of hands.

28.2 ReAct: Reasoning + Acting in Interleaved Steps

Right, let’s talk about ReAct. You’ve probably hit the wall with standard LLM prompting. You ask a question, it gives you an answer that sounds plausible but is, in fact, a beautiful and confident hallucination. It’s like asking for directions from a poet. ReAct is our first solid attempt to fix that by giving the model a way to do things to find the answer, not just make one up.

28.1 What Is an AI Agent? Perception, Planning, Action

Right, let’s cut through the marketing fluff. When I say “AI agent,” I’m not talking about a chrome-plated automaton that’s going to file your TPS reports. At its core, an agent is just a program that doesn’t just think—it does. It takes a high-level goal from you, like “find the best price for a new graphics card,” and breaks it down into a series of steps, using tools (like a web browser or a calculator) to execute them. It’s the difference between a student who memorizes the textbook and one who actually knows how to use the library, the lab, and a decent search engine.

— joke —

...