What is "Chain-of-Thought Prompting" about?

Learn how to improve LLM accuracy by forcing them to think step-by-step.

What topics does "Chain-of-Thought Prompting" cover?

This article covers: chain of thought, cot, prompt engineering, reasoning.

Chain-of-Thought Prompting

Add five words to almost any prompt and you'll get meaningfully better results:

"Let's think step-by-step."

That's chain-of-thought (CoT) prompting in its simplest form. It asks the model to show its reasoning process -- and the act of reasoning out loud dramatically improves accuracy.

The Research

The paper "Large Language Models are Zero-Shot Reasoners" tested what happens when you add "Let's think step-by-step" to prompts, specifically on arithmetic and logical reasoning tasks where LLMs traditionally struggle.

Results on a multi-arithmetic benchmark:

Prompt type	Accuracy
Standard zero-shot	17.7%
Zero-shot + "Let's think step-by-step"	78.7%

A 61-percentage-point jump from five words. No examples, no complex setup -- just telling the model to reason out loud before answering.

The study also tested misleading phrases like "Don't think, just feel" and "Abracadabra." Even those outperformed the baseline zero-shot -- suggesting that any instruction that forces the model to produce reasoning tokens improves accuracy, with "Let's think step-by-step" being the best.

Why It Works

Remember: LLMs only "think" while they're generating output. They don't reason silently and then write the answer. They produce the answer token by token.

Without chain-of-thought, the model goes directly from your question to a final answer -- no intermediate reasoning. For simple tasks, this is fine. For multi-step problems, it fails:

Plain text

Question: John has 5 apples. He gives Mary 2 red ones and keeps the green ones.
Sally has 3 apples, all green. How many green apples total?

Zero-shot model: 6 (wrong -- it pattern-matched without tracking state)

With chain-of-thought, the model is forced to work through the problem step by step in the output before concluding:

Plain text

Let's think step-by-step.

John has 5 apples total, giving away 2 red ones, so he has 3 left.
If the ones he gave were red, his remaining apples are not red.
The problem says he "keeps the green ones" -- so his remaining 3 are green.
Sally has 3 apples, all green.
Total green apples: John's 3 + Sally's 3 = 6.

Answer: 6

Wait -- the answer was 6 after all. But the path to get there is the important part. Reasoning out loud catches mistakes that jumping to a conclusion would miss.

Why LLMs Struggle with Math and Logic

LLMs are pattern predictors, not calculators. When you ask "what is 847 × 23?", the model isn't multiplying -- it's predicting the most likely tokens after that sequence. Simple multiplication works because 2 + 2 = 4 appears constantly in training data. Complex multi-step arithmetic fails because the model can't track state across calculation steps without writing them down.

Similarly, "flip a coin" in an LLM doesn't give you a 50/50 result. The model predicts the most likely token -- and if its training data has "heads" appearing more often in similar contexts, you'll get heads more often. LLMs are not random number generators.

Chain-of-thought prompting compensates for this by forcing the model to externalize its reasoning, making each step explicit and checkable.

Example 1: Diagnosing a Plant Problem

Without CoT:

Plain text

My plant's leaves are turning yellow. What's wrong?
→ "Could be overwatering, underwatering, or nutrient deficiency."

With CoT:

Plain text

My plant's leaves are turning yellow. Diagnose this step-by-step.

Output:

Step 1: Check watering issues. Overwatering (most common): Are leaves yellow and wilting? Is the soil constantly wet? Does the pot have drainage holes? Overwatering suffocates roots, preventing nutrient absorption. Underwatering: Are leaves dry and crispy? Is the soil dried out?

Step 2 -- Examine light conditions...

Step 3 -- Assess nutrient deficiency...

Step 4 -- Look for other clues...

Quick diagnostic questions: [list]

The second response is actually useful -- it walks through a diagnostic process you can follow. The first is nearly meaningless.

Example 2: Building Complex Features Step-by-Step

Chain-of-thought isn't just for problem diagnosis. When building complex features, explicitly structuring the steps gets better results:

Plain text

Let's build a complete export/import system for a Prompt Library, step-by-step.

Step 1: Analyze what data needs to be exported
- All prompts with their metadata, ratings, and notes

Step 2: Design the export JSON schema
- Include version number for future compatibility
- Add export timestamp and statistics
- Provide complete prompts array

Step 3: Create the export function
- Gather all data from localStorage
- Validate data integrity
- Create a Blob and trigger download with timestamp in filename

Step 4: Create the import function
- Read the uploaded JSON file
- Validate structure and version number
- Check for duplicate IDs
- Merge or replace based on user choice

Step 5: Add error recovery
- Back up existing data before import
- Roll back on failure
- Provide detailed error messages

Implement this system. Let's think step-by-step.

Adding both the explicit steps and the "let's think step-by-step" instruction is redundant in a human sense -- but the model responds better to explicit instruction. The phrase primes the model to maintain its reasoning chain throughout the implementation.

The Scale Effect

The accuracy gains from chain-of-thought grow with model size. On smaller models, zero-shot and zero-shot CoT perform similarly. On larger models, the gap becomes dramatic.

Since models are trending larger, chain-of-thought prompting will keep improving. Learning it now is investing in a technique that compounds over time.

Variations That Work

From the research, multiple CoT templates were tested. All outperformed the baseline:

Template	Works well for
"Let's think step-by-step"	General purpose -- use this by default
"First, ... Next, ... Finally, ..."	Sequential processes
"Let's think about this logically"	Reasoning and analysis
"Let's solve this by splitting it into steps"	Multi-part problems
"Diagnose this step-by-step"	Debugging, troubleshooting

The specific words matter less than the fact that you're asking for explicit reasoning. But "Let's think step-by-step" is the most researched and most reliable default.

Practical Application

Add it to prompts where:

The answer involves multiple steps
You're debugging something
You're doing any math or logic
You want to understand why the model reached a conclusion
The output keeps being wrong and you don't know why

It works in code generation too. When a model is generating a complex algorithm, "let's think step-by-step" often produces better code and inline explanations of the reasoning.

Practice

0/5 done

Chain-of-Thought Prompting

The Research

Why It Works

Why LLMs Struggle with Math and Logic

Example 1: Diagnosing a Plant Problem

Example 2: Building Complex Features Step-by-Step

The Scale Effect

Variations That Work

Practical Application

Further Reading and Watching

Practice