Ml Foundations Three Pillars

Explore the three major buckets of machine learning: from finding the best model to the massive compute power required for optimization.

May 15, 20263 min read4 / 5

The refund predictor we built in the previous posts ran on four users. We picked three features, set three thresholds, and checked them by hand. That is fine for a demo. It breaks completely at scale.

The discipline of machine learning exists to solve that scaling problem. And it does it through three distinct areas of work.

The Search Space Problem

Before getting to the three pillars, here is the core problem they are all solving.

Imagine I have 10 features and I want to test 50 possible threshold values for each. The number of possible boundary combinations I would need to check is:

$$50^{10} = \text{9 quadrillion}$$

And for each of those combinations, I'd need to run every one of my 100,000 training examples through the model to measure accuracy. This is computationally impossible to brute-force. The three pillars are how we make it tractable.

The three pillars of machine learning: model selection, optimization, and representative data ExpandThe three pillars of machine learning: model selection, optimization, and representative data

Pillar 1: Model Selection

A Decision Tree is one type of model. There are many others: logistic regression, support vector machines, random forests, gradient boosted trees, and neural networks.

Each makes different assumptions about the shape of the data and the type of patterns it can capture. A Decision Tree draws rectangular boundaries. A neural network can draw curves. An SVM finds a hyperplane in high-dimensional space.

Choosing the right model architecture for the problem is the first major decision. Get it wrong and no amount of training will fix it.

Pillar 2: Optimization

Once a model type is chosen, we need to find the parameters that maximize accuracy on the sample without overfitting. Since brute-force is off the table, we use optimization shortcuts.

For Decision Trees, the technique is a greedy algorithm: find the single best threshold for the first feature, lock it in, then move to the next. This does not guarantee the globally optimal set of parameters, but it gets there fast.

For neural networks, the technique is gradient descent: calculate which direction each parameter should move to reduce error, then take a small step in that direction. Repeat thousands of times.

The key difference: greedy algorithms are fast but locally optimal. Gradient descent is slower but scales to far more complex problems.

Pillar 3: Representative Data

This one is not about math. It is about honesty.

If the sample does not reflect the population, the best algorithm in the world produces a confidently wrong model. If past human decisions embedded biases (for example, if refunds were systematically denied to users in certain zip codes), the model learns to replicate that bias. It does not know those patterns are wrong.

And as we saw with Max in the drift post, the population can change in ways the sample did not anticipate. Maintaining alignment between sample and population is an ongoing operational responsibility, not a one-time task.

In the next post, we look at the practical workflow of taking all of this from a prototype into a live API.

The Essentials

  1. Model Selection: The architecture of the converter determines what kinds of patterns it can learn. Decision Trees, SVMs, and neural networks each have different strengths.
  2. Optimization: The technique for finding the best parameters efficiently. Greedy algorithms for Decision Trees; gradient descent for neural networks.
  3. Representative Data: No algorithm fixes a biased or unrepresentative sample. The model will confidently learn the wrong patterns.
  4. The Search Space: With 10 features and 50 possible values each, there are 9 quadrillion combinations to check. Optimization exists to make this tractable.

Further Reading and Watching