23.4 Tree of Thought and Graph of Thought

Alright, let’s get our hands dirty with the next evolution in prompt engineering. You’ve mastered the basics: asking directly (Zero-Shot), giving a few examples (Few-Shot), and laying out your reasoning step-by-step (Chain-of-Thought). These are fantastic tools, but they’re linear. They march forward in a straight line. The problem is, most interesting problems aren’t straight lines; they’re sprawling trees or tangled webs. That’s where Tree of Thought (ToT) and Graph of Thought (GoT) come in. They’re the move from a single path of reasoning to exploring multiple paths simultaneously, and it’s a complete game-changer.

Think of Chain-of-Thought as a single detective trying to solve a case. They follow one lead at a time. ToT and GoT are the entire police department whiteboarding the case, brainstorming multiple theories, and following up on all of them in parallel. It’s brute-force intelligence, and it’s wildly effective for complex problem-solving.

Why Linear Reasoning Falls Short

Chain-of-Thought is brilliant, but it has a critical weakness: it can’t backtrack. If it makes a subtle error in step 3, it’s doomed to amplify that error through steps 4, 5, and 6, leading to a confidently wrong answer. It’s also terrible at problems that require exploration or creativity, like writing a poem with multiple constraints or planning a strategy game move. You need a way to generate multiple possibilities, evaluate them, and then decide which branch to explore further. This is exactly what you do in your own head when tackling a tough problem, and ToT/GoT are about externalizing that process for the LLM.

Implementing Tree of Thought: A Practical Example

The core idea of ToT is to generate multiple reasoning paths (“thoughts”) at a given step, evaluate their promise, and then prune the bad ones to focus resources on the most promising branches. You are the conductor, and the LLM is the orchestra, playing out each potential path.

Let’s say you’re using an LLM to play a word game like Scrabble. The goal is to find the highest-scoring word from your rack of letters: “CAtnip”. A linear Chain-of-Thought might fixate on the first decent word it sees. ToT will explore the space.

First, we need a prompt to generate candidate moves. This is our “thought generator”.

# Thought Generator Prompt
thought_generator_prompt = """
You are a Scrabble expert. Generate a list of 5 possible high-scoring words from the letters: 'C', 'A', 'T', 'N', 'I', 'P'. 
You can only use each letter once. 
Return ONLY a valid JSON array of strings, nothing else. Example: ["WORD1", "WORD2", "WORD3"]
Letters: {letters}
"""

Next, we need a way to evaluate these candidates. This is our “evaluator” or “state evaluator”.

# State Evaluator Prompt
evaluator_prompt = """
Evaluate the following word for Scrabble. Consider:
1. Its point value (use standard Scrabble letter values).
2. How common it is (common words are better).
Return a JSON object with: {{"word": "<word>", "score": <integer>, "reasoning": "<brief explanation>"}}
Word: {word}
"""

Now, the magic happens in the code that orchestrates this. We’ll generate thoughts, evaluate them, and then perhaps even expand on the best one.

import openai
import json

# A simple mock function to simulate Scrabble scoring
def calculate_scrabble_score(word):
    letter_values = {'a':1, 'c':3, 'i':1, 'n':1, 'p':3, 't':1}
    return sum(letter_values.get(letter, 0) for letter in word.lower())

def tree_of_thought(letters, max_depth=2):
    # Step 1: Generate initial candidate words (Thought Generation)
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": thought_generator_prompt.format(letters=letters)}]
    )
    candidates = json.loads(response.choices[0].message.content)
    
    print(f"Generated candidates: {candidates}")
    
    # Step 2: Evaluate each candidate (State Evaluation)
    evaluated_candidates = []
    for candidate in candidates:
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": evaluator_prompt.format(word=candidate)}]
        )
        evaluation = json.loads(response.choices[0].message.content)
        # Use our real scoring function for ground truth
        evaluation['true_score'] = calculate_scrabble_score(candidate)
        evaluated_candidates.append(evaluation)
    
    # Step 3: Sort by score and select the best one (Pruning & Selection)
    best_candidate = max(evaluated_candidates, key=lambda x: x['true_score'])
    print(f"Best candidate: {best_candidate}")
    
    # In a more complex ToT, you might now use this best candidate to generate *new* letters/options (e.g., what to do next in a game)
    # This would be the next 'depth' in the tree.
    return best_candidate

# Run it
best_move = tree_of_thought("CAtnip")
print(f"The best word is '{best_move['word']}' with a score of {best_move['true_score']}!")

This is a simple one-depth ToT. A full implementation might take the best candidate and then generate responses to that move, building a true tree of possibilities. The key is the framework: Generate -> Evaluate -> Prune -> Repeat.

Leveling Up to Graph of Thought

If ToT is a tree, Graph of Thought (GoT) is a general graph. This is the bleeding edge. In ToT, thoughts are independent branches. In GoT, thoughts can interact and be combined during the reasoning process. You can take a promising step from one branch and combine it with a step from another branch to synthesize a new, better solution.

For example, in a coding problem, one branch might have a great function for sorting, and another branch might have a great function for filtering. A GoT framework could be prompted to merge these two solutions into a single, optimal code block. This is incredibly powerful but also significantly more complex to orchestrate, as you now have to manage a graph of dependencies between different LLM-generated states. The implementation looks more like a complex graph traversal algorithm, and honestly, most current applications haven’t fully climbed this mountain yet—it’s more of a research concept showing us the direction we’re headed.

The Inevitable Pitfalls

This power doesn’t come free.

Cost and Latency: You’re making multiple LLM calls (sometimes many). This gets expensive and slow fast. This isn’t for every problem, just the hard ones worth the extra compute.
Orchestration Complexity: You’re no longer just writing a prompt; you’re writing a whole state machine. The code gets messy. Debugging a tree of LLM calls is a special kind of hell.
Evaluation is Everything: Your entire system’s performance hinges on your “evaluator” prompt. If it’s bad at judging the quality of thoughts, you’ll prune the good branches and water the weeds. This is the hardest part to get right.
Overkill: Using ToT to decide what to have for lunch is like using a particle accelerator to crack an egg. For simple facts or straightforward tasks, Zero-Shot or Chain-of-Thought is faster, cheaper, and just as effective.

The takeaway? When you hit a wall with linear reasoning, remember you have this tool. It feels less like chatting with an AI and more like building a reasoning engine. And that, frankly, is where the real magic is.