Chain-of-Thought Prompting

Five words -- "Let's think step-by-step" -- took model accuracy from 17.7% to 78.7% on math problems. Chain-of-thought is the most impactful technique per character written.

March 30, 20265 min read2 / 6

Add five words to almost any prompt and you'll get meaningfully better results:

"Let's think step-by-step."

That's chain-of-thought (CoT) prompting in its simplest form. It asks the model to show its reasoning process -- and the act of reasoning out loud dramatically improves accuracy.

The Research

The paper "Large Language Models are Zero-Shot Reasoners" tested what happens when you add "Let's think step-by-step" to prompts, specifically on arithmetic and logical reasoning tasks where LLMs traditionally struggle.

Results on a multi-arithmetic benchmark:

Prompt typeAccuracy
Standard zero-shot17.7%
Zero-shot + "Let's think step-by-step"78.7%

A 61-percentage-point jump from five words. No examples, no complex setup -- just telling the model to reason out loud before answering.

The study also tested misleading phrases like "Don't think, just feel" and "Abracadabra." Even those outperformed the baseline zero-shot -- suggesting that any instruction that forces the model to produce reasoning tokens improves accuracy, with "Let's think step-by-step" being the best.

Why It Works

Remember: LLMs only "think" while they're generating output. They don't reason silently and then write the answer. They produce the answer token by token.

Without chain-of-thought, the model goes directly from your question to a final answer -- no intermediate reasoning. For simple tasks, this is fine. For multi-step problems, it fails:

Plain text
Question: John has 5 apples. He gives Mary 2 red ones and keeps the green ones. Sally has 3 apples, all green. How many green apples total? Zero-shot model: 6 (wrong -- it pattern-matched without tracking state)

With chain-of-thought, the model is forced to work through the problem step by step in the output before concluding:

Plain text
Let's think step-by-step. John has 5 apples total, giving away 2 red ones, so he has 3 left. If the ones he gave were red, his remaining apples are not red. The problem says he "keeps the green ones" -- so his remaining 3 are green. Sally has 3 apples, all green. Total green apples: John's 3 + Sally's 3 = 6. Answer: 6

Wait -- the answer was 6 after all. But the path to get there is the important part. Reasoning out loud catches mistakes that jumping to a conclusion would miss.

Why LLMs Struggle with Math and Logic

LLMs are pattern predictors, not calculators. When you ask "what is 847 × 23?", the model isn't multiplying -- it's predicting the most likely tokens after that sequence. Simple multiplication works because 2 + 2 = 4 appears constantly in training data. Complex multi-step arithmetic fails because the model can't track state across calculation steps without writing them down.

Similarly, "flip a coin" in an LLM doesn't give you a 50/50 result. The model predicts the most likely token -- and if its training data has "heads" appearing more often in similar contexts, you'll get heads more often. LLMs are not random number generators.

Chain-of-thought prompting compensates for this by forcing the model to externalize its reasoning, making each step explicit and checkable.

Example 1: Diagnosing a Plant Problem

Without CoT:

Plain text
My plant's leaves are turning yellow. What's wrong? → "Could be overwatering, underwatering, or nutrient deficiency."

With CoT:

Plain text
My plant's leaves are turning yellow. Diagnose this step-by-step.

Output:

Step 1: Check watering issues. Overwatering (most common): Are leaves yellow and wilting? Is the soil constantly wet? Does the pot have drainage holes? Overwatering suffocates roots, preventing nutrient absorption. Underwatering: Are leaves dry and crispy? Is the soil dried out?

Step 2 -- Examine light conditions...

Step 3 -- Assess nutrient deficiency...

Step 4 -- Look for other clues...

Quick diagnostic questions: [list]

The second response is actually useful -- it walks through a diagnostic process you can follow. The first is nearly meaningless.

Example 2: Building Complex Features Step-by-Step

Chain-of-thought isn't just for problem diagnosis. When building complex features, explicitly structuring the steps gets better results:

Plain text
Let's build a complete export/import system for a Prompt Library, step-by-step. Step 1: Analyze what data needs to be exported - All prompts with their metadata, ratings, and notes Step 2: Design the export JSON schema - Include version number for future compatibility - Add export timestamp and statistics - Provide complete prompts array Step 3: Create the export function - Gather all data from localStorage - Validate data integrity - Create a Blob and trigger download with timestamp in filename Step 4: Create the import function - Read the uploaded JSON file - Validate structure and version number - Check for duplicate IDs - Merge or replace based on user choice Step 5: Add error recovery - Back up existing data before import - Roll back on failure - Provide detailed error messages Implement this system. Let's think step-by-step.

Adding both the explicit steps and the "let's think step-by-step" instruction is redundant in a human sense -- but the model responds better to explicit instruction. The phrase primes the model to maintain its reasoning chain throughout the implementation.

The Scale Effect

The accuracy gains from chain-of-thought grow with model size. On smaller models, zero-shot and zero-shot CoT perform similarly. On larger models, the gap becomes dramatic.

Since models are trending larger, chain-of-thought prompting will keep improving. Learning it now is investing in a technique that compounds over time.

Variations That Work

From the research, multiple CoT templates were tested. All outperformed the baseline:

TemplateWorks well for
"Let's think step-by-step"General purpose -- use this by default
"First, ... Next, ... Finally, ..."Sequential processes
"Let's think about this logically"Reasoning and analysis
"Let's solve this by splitting it into steps"Multi-part problems
"Diagnose this step-by-step"Debugging, troubleshooting

The specific words matter less than the fact that you're asking for explicit reasoning. But "Let's think step-by-step" is the most researched and most reliable default.

Practical Application

Add it to prompts where:

  • The answer involves multiple steps
  • You're debugging something
  • You're doing any math or logic
  • You want to understand why the model reached a conclusion
  • The output keeps being wrong and you don't know why

It works in code generation too. When a model is generating a complex algorithm, "let's think step-by-step" often produces better code and inline explanations of the reasoning.


Further Reading and Watching

Practice

0/5 done