Ml Decision Making Automation

Learn how to turn manual human decisions into automated patterns using features, samples, and historical labels.

May 15, 20263 min read1 / 5

The thing that tripped me up when I first thought about automating decisions was the data. Not the algorithm, not the math. Just: where does the data come from, and what shape does it need to be in?

In the previous post, we established that ML moves from writing explicit logic to discovering rules from examples. Here is what that looks like with concrete data: a refund request that needs to be approved or denied.

Building the Sample

Every automated decision system starts with a sample: a set of past events where we already know the outcome.

For a refund system, that sample is a database of past refund requests. Each request has two things attached to it:

  1. Background information: the details of the request.
  2. The decision: what a human actually did (refund or not).

We encode the decision as 1 (refund given) or -1 (refund denied). This is the label, the target we want our rules to reproduce.

What Goes Into the Background Information?

Not all data is equally useful. The data points we actually feed into a model are called features. Picking the right ones matters enormously.

For a refund request, useful features might include:

  • Refund Amount: How much money is being requested?
  • Account Length: How long has this person been a customer?
  • IP Account Count: How many accounts are registered from the same IP address?
  • Time Since Delivery: How long after the delivery was this filed?
  • Item Type: Perishable food vs. a non-perishable item changes the risk profile.

The key insight: we are not using every piece of data available. We are selecting the features we believe carry signal, patterns that correlate with whether a refund is legitimate.

Sample table showing James and Denver with features and labels ExpandSample table showing James and Denver with features and labels

Sample vs. Population

The sample is a subset of the population ($\forall$), every refund request that has ever happened or ever will. We care about the population. We only have access to the sample.

Two users from our sample:

  • James: 4 years active, $80 refund, 2 accounts on the IP. Label: 1 (Refund).
  • Denver: Account for under a week, $500 refund, 7 accounts on the IP. Label: -1 (Denied).

If I can find a set of rules that correctly produces 1 for James and -1 for Denver, and does so because it captured a real pattern, not a coincidence, I can apply those same rules to a future request and make a reasonable prediction.

That set of rules is a converter. In ML, it is called a model. The next post is about building one.

The Essentials

  1. The Automation Goal: Replace manual, human-led decisions by finding patterns that map background data to outcomes.
  2. Features: The specific data points chosen as inputs to the model. Good feature selection is half the work.
  3. Labels: The encoded historical outcome (1 or -1). The model learns to reproduce these.
  4. Sample vs. Population: The sample is what we have; the population is what we care about. A model only works if the sample reflects the population.

Further Reading and Watching