Chain-of-Thought Prompting

Five words — "Let's think step-by-step" — took model accuracy from 17.7% to 78.7% on math problems. Chain-of-thought is the most impactful technique per character written.

March 30, 20266 min read2 / 6

Add five words to almost any prompt and you'll get meaningfully better results:

"Let's think step-by-step."

That's chain-of-thought (CoT) prompting in its simplest form. It asks the model to show its reasoning process — and the act of reasoning out loud dramatically improves accuracy.

The Research

The paper "Large Language Models are Zero-Shot Reasoners" tested what happens when you add "Let's think step-by-step" to prompts, specifically on arithmetic and logical reasoning tasks where LLMs traditionally struggle.

Results on a multi-arithmetic benchmark:

Prompt typeAccuracy
Standard zero-shot17.7%
Zero-shot + "Let's think step-by-step"78.7%

A 61-percentage-point jump from five words. No examples, no complex setup — just telling the model to reason out loud before answering.

The study also tested misleading phrases like "Don't think, just feel" and "Abracadabra." Even those outperformed the baseline zero-shot — suggesting that any instruction that forces the model to produce reasoning tokens improves accuracy, with "Let's think step-by-step" being the best.

Why It Works

Remember: LLMs only "think" while they're generating output. They don't reason silently and then write the answer. They produce the answer token by token.

Without chain-of-thought, the model goes directly from your question to a final answer — no intermediate reasoning. For simple tasks, this is fine. For multi-step problems, it fails:

Plain text
Question: John has 5 apples. He gives Mary 2 red ones and keeps the green ones. Sally has 3 apples, all green. How many green apples total? Zero-shot model: 6 (wrong — it pattern-matched without tracking state)

With chain-of-thought, the model is forced to work through the problem step by step in the output before concluding:

Plain text
Let's think step-by-step. John has 5 apples total, giving away 2 red ones, so he has 3 left. If the ones he gave were red, his remaining apples are not red. The problem says he "keeps the green ones" — so his remaining 3 are green. Sally has 3 apples, all green. Total green apples: John's 3 + Sally's 3 = 6. Answer: 6

Wait — the answer was 6 after all. But the path to get there is the important part. Reasoning out loud catches mistakes that jumping to a conclusion would miss.

Why LLMs Struggle with Math and Logic

LLMs are pattern predictors, not calculators. When you ask "what is 847 × 23?", the model isn't multiplying — it's predicting the most likely tokens after that sequence. Simple multiplication works because 2 + 2 = 4 appears constantly in training data. Complex multi-step arithmetic fails because the model can't track state across calculation steps without writing them down.

Similarly, "flip a coin" in an LLM doesn't give you a 50/50 result. The model predicts the most likely token — and if its training data has "heads" appearing more often in similar contexts, you'll get heads more often. LLMs are not random number generators.

Chain-of-thought prompting compensates for this by forcing the model to externalize its reasoning, making each step explicit and checkable.

Example 1: Diagnosing a Plant Problem

Without CoT:

Plain text
My plant's leaves are turning yellow. What's wrong? → "Could be overwatering, underwatering, or nutrient deficiency."

With CoT:

Plain text
My plant's leaves are turning yellow. Diagnose this step-by-step.

Output:

Step 1 — Check watering issues. Overwatering (most common): Are leaves yellow and wilting? Is the soil constantly wet? Does the pot have drainage holes? Overwatering suffocates roots, preventing nutrient absorption. Underwatering: Are leaves dry and crispy? Is the soil dried out?

Step 2 — Examine light conditions...

Step 3 — Assess nutrient deficiency...

Step 4 — Look for other clues...

Quick diagnostic questions: [list]

The second response is actually useful — it walks through a diagnostic process you can follow. The first is nearly meaningless.

Example 2: Building Complex Features Step-by-Step

Chain-of-thought isn't just for problem diagnosis. When building complex features, explicitly structuring the steps gets better results:

Plain text
Let's build a complete export/import system for a Prompt Library, step-by-step. Step 1: Analyze what data needs to be exported - All prompts with their metadata, ratings, and notes Step 2: Design the export JSON schema - Include version number for future compatibility - Add export timestamp and statistics - Provide complete prompts array Step 3: Create the export function - Gather all data from localStorage - Validate data integrity - Create a Blob and trigger download with timestamp in filename Step 4: Create the import function - Read the uploaded JSON file - Validate structure and version number - Check for duplicate IDs - Merge or replace based on user choice Step 5: Add error recovery - Back up existing data before import - Roll back on failure - Provide detailed error messages Implement this system. Let's think step-by-step.

Adding both the explicit steps and the "let's think step-by-step" instruction is redundant in a human sense — but the model responds better to explicit instruction. The phrase primes the model to maintain its reasoning chain throughout the implementation.

The Scale Effect

The accuracy gains from chain-of-thought grow with model size. On smaller models, zero-shot and zero-shot CoT perform similarly. On larger models, the gap becomes dramatic.

Since models are trending larger, chain-of-thought prompting will keep improving. Learning it now is investing in a technique that compounds over time.

Variations That Work

From the research, multiple CoT templates were tested. All outperformed the baseline:

TemplateWorks well for
"Let's think step-by-step"General purpose — use this by default
"First, ... Next, ... Finally, ..."Sequential processes
"Let's think about this logically"Reasoning and analysis
"Let's solve this by splitting it into steps"Multi-part problems
"Diagnose this step-by-step"Debugging, troubleshooting

The specific words matter less than the fact that you're asking for explicit reasoning. But "Let's think step-by-step" is the most researched and most reliable default.

Practical Application

Add it to prompts where:

  • The answer involves multiple steps
  • You're debugging something
  • You're doing any math or logic
  • You want to understand why the model reached a conclusion
  • The output keeps being wrong and you don't know why

It works in code generation too. When a model is generating a complex algorithm, "let's think step-by-step" often produces better code and inline explanations of the reasoning.

Practice: Try It Yourself

Exercise 1: Classic logic problem

Plain text
Without CoT: "If all Bloops are Razzles, and all Razzles are Lazzles, are all Bloops Lazzles?" With CoT: "If all Bloops are Razzles, and all Razzles are Lazzles, are all Bloops Lazzles? Let's think step-by-step."

Compare the confidence and correctness of the two answers.

Exercise 2: Word problem

Plain text
"A store sells apples for $0.50 each and oranges for $0.75 each. If I buy 3 apples and 4 oranges and pay with a $5 bill, how much change do I get? Let's think step-by-step."

Then try the same without the CoT instruction and see if the answer differs.

Exercise 3: Code debugging

Plain text
"Here's a function that should return the sum of all even numbers in an array, but it returns the wrong value. Debug this step-by-step. [paste buggy code]"

The step-by-step reasoning often surfaces the bug more clearly than asking "what's wrong with this?"

Enjoyed this? Get more like it.

Deep dives on system design, React, web development, and personal finance — straight to your inbox. Free, always.