Choosing the Right Model
Different models have different strengths, costs, and deprecation timelines. Picking the right model for the job is a core prompt engineering skill.
Most people find one model they like and use it for everything. This is understandable -- once you've gotten comfortable with how a model talks, switching feels like learning a new tool. But in professional contexts, model selection is a real engineering decision.
Models Are Different
All major LLMs are powerful. But they differ in:
| Factor | What it affects |
|---|---|
| Speed | How long you wait for a response |
| Cost | How much an API call costs (matters for apps running thousands of calls/day) |
| Capability | Complex reasoning, code quality, creative writing |
| Context window | How much conversation/code it can hold |
| Training cutoff | How current its knowledge is |
| Strengths | Some excel at code, others at analysis or creative writing |
The Cost Consideration
When building AI applications, model cost scales directly with usage. A real example:
An AI fintech app was using Haiku (smaller, faster, cheaper) for classifying financial transactions. When Sonnet 4 was released, it was ~80% more expensive -- and running thousands of classifications per day.
The question wasn't "which model is smarter?" It was "does Sonnet 4 classify transactions measurably better than Haiku for this specific task?" For a classification task where Haiku was already 99% accurate, the answer was no. Haiku stayed.
Choosing the most capable model by default isn't engineering -- it's just spending more money.
Deprecation is Real
Models get deprecated. Haiku could be gone tomorrow. GPT-4 is already deprecated. This has a practical implication:
When you build an AI application, write it so you can swap the model without rewriting everything.
// Flexible: model name is a config, not hardcoded
const MODEL = process.env.AI_MODEL ?? "claude-sonnet-4-6";
const response = await anthropic.messages.create({
model: MODEL,
messages: [...]
});This way, when your model gets deprecated (and it will), you change one line and test with the new model.
Try Different Providers
The psychological pattern: once you find a model you like at age 18-24, you're a loyal customer for life. The same is true for AI tools -- most people stick with their first comfortable model forever.
Push against this. Each provider and model has genuine strengths:
- Claude (Anthropic): Strong at nuanced reasoning, following complex instructions, long context
- GPT series (OpenAI): Widely used, strong code generation, great ecosystem
- Gemini (Google): Strong multimodal capabilities, improving rapidly
- Haiku / smaller models: Fast and cheap for high-volume, lower-complexity tasks
What's worth trying:
- Use a different model for a week and see how it handles your specific tasks
- When cost matters, benchmark smaller models against larger ones for your specific use case
- For AI applications, A/B test model outputs before committing
Same Model, Different Behavior
The same model (e.g., Claude Sonnet 4.6) will behave differently in:
- Claude.ai chat
- GitHub Copilot
- Cursor
- Your own API call
Why? The system message. Each tool sets a different invisible system prompt that shapes how the model responds. This isn't a bug -- it's by design.
It also means you can't directly compare "Claude in chat" vs. "Claude in Copilot" -- you're comparing Claude with different system messages.
Which Model to Use for What
A rough guide for everyday decisions:
Complex architecture decisions, detailed code review:
→ Use the most capable model available (Sonnet 4.6, GPT-5, etc.)
Routine code generation, documentation:
→ Mid-tier (Claude Haiku, GPT-4o mini)
High-volume classification, extraction, summaries:
→ Smallest model that meets your accuracy bar (often Haiku-class)
Creative writing, brainstorming:
→ Try different models -- creative quality varies noticeablyThe Prompting Technique Connection
Your prompting technique should also change based on model size:
- Smaller models respond better to simpler, very explicit prompts. Few-shot examples become more important to guide behavior.
- Larger models can handle more ambiguity and implicit context. A well-structured zero-shot prompt often works.
Chain-of-thought prompting (coming up) shows dramatically better results on larger models than smaller ones. Keep this in mind when debugging unexpected outputs from a smaller/cheaper model.
Practical Checklist
When choosing a model for a new project:
- What's the task? (reasoning, code, classification, creative)
- How often will this run? (once vs. thousands/day)
- What's the acceptable error rate?
- What's my token budget per call?
- Can I swap the model if this one gets deprecated?
- Have I actually benchmarked this on real examples?
Don't commit to a model because it's the most popular or the one you used last time. Commit because you've tested it for your specific task.