Choosing the Right Model

Most people find one model they like and use it for everything. This is understandable -- once you've gotten comfortable with how a model talks, switching feels like learning a new tool. But in professional contexts, model selection is a real engineering decision.

Models Are Different

All major LLMs are powerful. But they differ in:

Factor	What it affects
Speed	How long you wait for a response
Cost	How much an API call costs (matters for apps running thousands of calls/day)
Capability	Complex reasoning, code quality, creative writing
Context window	How much conversation/code it can hold
Training cutoff	How current its knowledge is
Strengths	Some excel at code, others at analysis or creative writing

The Cost Consideration

When building AI applications, model cost scales directly with usage. A real example:

An AI fintech app was using Haiku (smaller, faster, cheaper) for classifying financial transactions. When Sonnet 4 was released, it was ~80% more expensive -- and running thousands of classifications per day.

The question wasn't "which model is smarter?" It was "does Sonnet 4 classify transactions measurably better than Haiku for this specific task?" For a classification task where Haiku was already 99% accurate, the answer was no. Haiku stayed.

Choosing the most capable model by default isn't engineering -- it's just spending more money.

Deprecation is Real

Models get deprecated. Haiku could be gone tomorrow. GPT-4 is already deprecated. This has a practical implication:

When you build an AI application, write it so you can swap the model without rewriting everything.

TypeScript

// Flexible: model name is a config, not hardcoded
const MODEL = process.env.AI_MODEL ?? "claude-sonnet-4-6";

const response = await anthropic.messages.create({
  model: MODEL,
  messages: [...]
});

This way, when your model gets deprecated (and it will), you change one line and test with the new model.

Try Different Providers

The psychological pattern: once you find a model you like at age 18-24, you're a loyal customer for life. The same is true for AI tools -- most people stick with their first comfortable model forever.

Push against this. Each provider and model has genuine strengths:

Claude (Anthropic): Strong at nuanced reasoning, following complex instructions, long context
GPT series (OpenAI): Widely used, strong code generation, great ecosystem
Gemini (Google): Strong multimodal capabilities, improving rapidly
Haiku / smaller models: Fast and cheap for high-volume, lower-complexity tasks

What's worth trying:

Use a different model for a week and see how it handles your specific tasks
When cost matters, benchmark smaller models against larger ones for your specific use case
For AI applications, A/B test model outputs before committing

Same Model, Different Behavior

The same model (e.g., Claude Sonnet 4.6) will behave differently in:

Claude.ai chat
GitHub Copilot
Cursor
Your own API call

Why? The system message. Each tool sets a different invisible system prompt that shapes how the model responds. This isn't a bug -- it's by design.

It also means you can't directly compare "Claude in chat" vs. "Claude in Copilot" -- you're comparing Claude with different system messages.

Which Model to Use for What

A rough guide for everyday decisions:

Plain text

Complex architecture decisions, detailed code review:
→ Use the most capable model available (Sonnet 4.6, GPT-5, etc.)

Routine code generation, documentation:
→ Mid-tier (Claude Haiku, GPT-4o mini)

High-volume classification, extraction, summaries:
→ Smallest model that meets your accuracy bar (often Haiku-class)

Creative writing, brainstorming:
→ Try different models -- creative quality varies noticeably

The Prompting Technique Connection

Your prompting technique should also change based on model size:

Smaller models respond better to simpler, very explicit prompts. Few-shot examples become more important to guide behavior.
Larger models can handle more ambiguity and implicit context. A well-structured zero-shot prompt often works.

Chain-of-thought prompting (coming up) shows dramatically better results on larger models than smaller ones. Keep this in mind when debugging unexpected outputs from a smaller/cheaper model.

Practical Checklist

When choosing a model for a new project:

What's the task? (reasoning, code, classification, creative)
How often will this run? (once vs. thousands/day)
What's the acceptable error rate?
What's my token budget per call?
Can I swap the model if this one gets deprecated?
Have I actually benchmarked this on real examples?

Don't commit to a model because it's the most popular or the one you used last time. Commit because you've tested it for your specific task.

Practice

0/5 done