Production Ml Prototyping Engineering

Learn the workflow of moving from a prototype to a production-ready model, from feature engineering to API deployment.

May 15, 20263 min read5 / 5

The three pillars of ML describe what needs to happen. This post is about what it actually looks like in practice, from a data scientist sketching boundaries on a whiteboard to a model sitting behind an API endpoint serving millions of predictions per day.

There are two distinct phases. They require different mindsets, different tools, and often different people.

Prototype to production: two phases with different roles, data sizes, and engineering concerns ExpandPrototype to production: two phases with different roles, data sizes, and engineering concerns

Phase 1: Prototype

The goal here is not to build a production model. It is to understand the data well enough to know if a model is even possible.

A data scientist typically works on 10-20% of the final dataset during prototyping. Fast iteration matters more than perfect accuracy at this stage.

Pre-processing

Raw data is almost never clean. Pre-processing prepares it for the model:

  • Rounding dollar amounts so $19.99 and $20.00 are not treated as different signals.
  • Categorizing free-text fields into buckets. "Pad Thai with tofu" becomes "Food, Perishable."
  • Filling missing data so the model does not break on rows with incomplete information.

Feature Engineering

This is where domain knowledge pays off. Instead of feeding raw features into the model, we create new ones.

For example: instead of "account age" and "number of refunds" separately, we create refund rate (refunds per year of account age). A 1-year account with 20 refunds is very different from a 10-year account with 20 refunds, and raw numbers alone don't capture that.

Handling Imbalanced Data

In fraud detection, roughly 99% of refund requests are legitimate. A model that always predicts "approve" would be 99% accurate on the sample, but completely useless.

The metric we care about is precision and recall on the fraud class: of all the fraudulent requests, how many did we catch? Of all the requests we flagged as fraud, how many actually were?

One fix is SMOTE (Synthetic Minority Over-sampling Technique): generating synthetic fraud examples to give the model more signal on the rare class.

Phase 2: Production Engineering

Once the prototype proves out, the focus shifts to correctness, scale, and reliability.

Scaling the Training Data

The prototype ran on maybe 100,000 examples. The production model needs to train on millions. Techniques like XGBoost (Extreme Gradient Boosting) are the industry standard here: optimized for large datasets, handles missing values natively, and produces interpretable feature importance scores.

Deploying the Trained Model

A trained model is just a set of numbers, the boundary values that produced the best accuracy on the training set. Those numbers get serialized and served behind an API:

JSON
// POST /api/v1/refund-predict { "yearsActive": 1.5, "amount": 45.00, "ipAccounts": 2 } // Response { "prediction": 1, "confidence": 0.94 }

The model artifact gets loaded at startup, and each request runs through the same boundary-checking logic we built from scratch.

Ongoing Monitoring

A model that was accurate at launch will drift over time. New fraud patterns emerge. Seasonal behaviour shifts. This is concept drift in practice -- the same problem we saw with Max and his bot farm.

Production ML includes continuous monitoring: tracking prediction distributions, flagging outliers, and triggering retraining when the model's accuracy drops below a threshold. Tools like MLflow track experiments, log metrics, and version model artifacts so you can roll back when something goes wrong.

This entire section covers decision trees, sample vs. population, features, overfitting, and drift. It is the foundation. The next section drops the refund predictor entirely and introduces a new problem: detecting a smile in a grid of pixels. That problem will take us into neural networks.

The Essentials

  1. Pre-processing: Cleaning and shaping raw data before training. Missing values, rounding, and categorization all happen here.
  2. Feature Engineering: Creating new inputs from existing ones using domain knowledge. Often the highest-leverage work in the entire pipeline.
  3. Imbalanced Data: When one class is rare, accuracy is a misleading metric. Use precision and recall instead, and consider SMOTE to balance the training set.
  4. Production Deployment: A trained model is a serialized set of numbers served behind an API. Monitoring drift and triggering retraining are operational responsibilities, not afterthoughts.

Further Reading and Watching