28.1 What Is an AI Agent? Perception, Planning, Action

Right, let’s cut through the marketing fluff. When I say “AI agent,” I’m not talking about a chrome-plated automaton that’s going to file your TPS reports. At its core, an agent is just a program that doesn’t just think—it does. It takes a high-level goal from you, like “find the best price for a new graphics card,” and breaks it down into a series of steps, using tools (like a web browser or a calculator) to execute them. It’s the difference between a student who memorizes the textbook and one who actually knows how to use the library, the lab, and a decent search engine.

The classic, and still utterly relevant, way to understand this is through the ReAct framework: Reasoning and Acting. It’s a loop. The model doesn’t just barf out an answer; it thinks about what to do next, takes an action based on that thought, observes the result, and then loops until it has solved your problem or admitted defeat. This is the fundamental engine under the hood of most modern agentic systems.

The Perceive-Plan-Act Loop, Deconstructed

This isn’t just academic; it’s the operational blueprint. Every worthwhile agent you’ll build or use runs on a variation of this.

Perception is about ingesting the current state of the world. This is your input: the user’s query, the output from the last tool you used, the current context window of the conversation. The key here is that the agent’s “senses” are limited to the text you give it and the text returned by its tools. It’s not “seeing” a webpage; it’s receiving the HTML or cleaned-up text from a fetch_html tool. This distinction is everything. Garbage in, garbage out.

Planning is where the LLM earns its keep. Using the current context, it has to reason about the next step. This is almost always done through structured output, like a JSON block, forcing the model to articulate its thought process and its intended action. The “thought” is for its own benefit (and yours, for debugging); the “action” is the executable command. This structured output is what keeps the agent from being a chaotic mess—it’s a forcing function for logic.

Action is the execution phase. The agent’s “thought” gets parsed, and the system calls the appropriate tool with the specified parameters. This could be running a Python function to do math, calling an API to get the weather, or querying a database. The result of this action is then fed back into the Perception stage for the next cycle.

Here’s a brutally simplified code example of what one cycle of this loop looks like in practice. We’re using LangChain here not because it’s perfect (it has its quirks, which we’ll get to) but because it clearly illustrates the concepts.

from langchain.agents import AgentType, initialize_agent, load_tools
from langchain.llms import OpenAI

# First, we need the brains and the brawn.
llm = OpenAI(temperature=0)  # Temperature 0 for less creativity, more reliability
tools = load_tools(["serpapi", "llm-math"], llm=llm)  # A search tool and a calculator

# This is where the magic *isn't*. It's just careful engineering.
# The ZERO_SHOT_REACT_DESCRIPTION agent type is basically the textbook ReAct prompt.
agent = initialize_agent(
    tools,
    llm,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose=True  # This is non-negotiable for debugging. You want to see the thoughts.
)

# Now we let it loose on a problem that requires multiple steps.
result = agent.run("What was the price of NVIDIA's stock when Tesla's Model 3 first shipped? How much would I have if I invested $1000 back then?")

When you run this with verbose=True, you’ll see the glorious, clunky machinery in your terminal. It’ll first think about needing to find the Tesla ship date, then act by searching for it. It will perceive that result, then think about needing the NVIDIA stock price on that date, and act by searching for that. Finally, it will think about calculating the investment growth and act by using the llm-math tool. It’s a beautiful thing to watch.

Why This Loop Is Both Brilliant and Terrible

The ReAct pattern is brilliant because it leverages what LLMs are good at (reasoning and language) and offloads what they’re terrible at (factuality, calculation) to external tools. It makes them vastly more powerful and reliable.

It’s terrible because the entire system is hilariously brittle. The biggest pitfall? The context window is a prison. Every tool call, its result, and every “thought” consumes precious tokens. Long-running tasks can hit the window limit and just… forget what they were doing. It’s like a goldfish trying to assemble IKEA furniture. Best practice: use tools that return concise outputs. Write your functions to return summaries, not massive JSON blobs.

The second pitfall is poor tool definition. The LLM doesn’t know what your tool does; it only knows the name and description you give it. If your description is vague, the agent will use the tool incorrectly or not at all. Be painfully explicit. Instead of “A function to get user data,” write “Fetches a user’s name and email address by their unique user ID. Input should be a string containing the ID.”

Finally, and this is the killer, the model can fail to output valid JSON. The structured output is a suggestion, not a law. Sometimes, especially if the task is complex or the model is flustered, it will just output free text, breaking the entire parsing system. Robust agent frameworks have layers of validation and correction for this, but it’s the single most common point of failure. You must assume the output will be malformed and code accordingly.

So, is the agentic future here? Yes, absolutely. Is it a seamless, hyper-intelligent experience? Don’t be ridiculous. It’s a powerful but incredibly fiddly way to get an LLM to sequentially call APIs without getting lost. Mastering that fiddliness is the real job.