How great products are made: Rules of Machine Learning by Google, a Summary

Google recently published some nuggets of ML wisdom, i.e., their best practices in ML Engineering. I believe that everyone should know about such best practices, so let’s look at them.

Please read the following statement a couple of times before moving on:

Do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

Now that the truth has been spoken, we can proceed with a recap of the rules.

Phase 0 – Before ML: understand whether the time is right for building a machine learning system

Rule #1: Don’t be afraid to launch a product without machine learning: If ML is really required, you can start with some simple heuristics. Otherwise, wait until you have enough data.

Rule #2: First, design and implement metrics: Add metrics, and then add some more – track as much as you can.

Rule #3: Choose machine learning over a complex heuristic: Start with simple heuristics but next move on to machine learning.

Phase 1 – Your First ML Pipeline

Rule #4: Keep the first model simple and get the infrastructure right: Don’t start with fancy models and features. Focus on fixing the infrastructure issues first, e.g., ensure that your (simple) features are correctly reaching the model.

Rule #5: Test the infrastructure independently from the machine learning: Have a testable infrastructure. Test data flow into the algorithm and its processing. Test getting models from the training algorithm.

Rule #6: Be careful about dropped data when copying pipelines: Make sure that there are no missing pieces when recycling some existing pipelines.

Rule #7: Turn heuristics into features, or handle them externally: Don’t just discard existing heuristics related to the ML problem you are trying to solve. For example: don’t try to relearn the definition of “blacklisted”; if there is a heuristic computing some relevance score, use that score as a feature; if there is a heuristic based on many pieces of information, feed those inputs into the learning algorithm separately.


Rule #8: Know the freshness requirements of your system: Measure the relationship between performance degradation and frequency of model update, i.e., how is performance impacted when you have a day old model vs a week old model.

Rule #9: Detect problems before exporting models: You should not go wrong in exporting user facing models so better not be frugal on sanity checks.

Rule #10: Watch for silent failures: Be wary of problems that go easily undetected like stale tables. Manually checking data once in a while doesn’t hurt.

Rule #11: Give feature columns owners and documentation: This rule is for large systems with lots of features.

Your First Objective (a number that your algorithm aims to optimize)

Rule #12: Don’t overthink which objective you choose to directly optimize: For example, if your goal is to increase the number of clicks and time spent on the site, then optimizing for one will also improve the other one. So in the early stages, don’t obsess about achieving an increase in one specific metric only.

Rule #13: Choose a simple, observable and attributable metric for your first objective: Go for directly observable user behaviors related to some particular action, e.g., “Did the user click the first document?”. Initially, avoid modeling indirect effects, e.g., “Did the user come back the next day?”. And do not try to model user satisfaction or happiness; these are hard to measure metrics and can be inferred indirectly – a happy user will stay on your site longer. Also, use human judgement here.

Rule #14: Starting with an interpretable model makes debugging easier: Don’t ignore the good old linear/logistic regression for some fancier black-box algorithm.

Rule #15: Separate Spam Filtering and Quality Ranking in a Policy Layer: While doing quality ranking assume that the content was posted in good faith.

Phase 2 – Feature Engineering

Rule #16: Plan to launch and iterate: What you should think about: Is it easy to add/remove/recombine features? Is it easy to create a new copy of the pipeline and verify its correctness? Is it possible to have two or three copies running in parallel?

Rule #17: Start with directly observed and reported features as opposed to learned features: A learned feature is, e.g., a feature generated by some unsupervised clustering algorithm or by the learner itself via deep learning.

Rule #18: Explore with features of content that generalize across contexts: For example, using features from YouTube search, like number of co-watches, as features for YouTube’s Watch-Next.

Rule #19: Use very specific features when you can: Don’t shy away from using groups of features, i.e., when each feature applies to a very small fraction of the data, but its overall coverage is above 90%. (Regularization can then help to eliminate features that cover too few examples.)

Rule #20: Combine and modify existing features to create new features in human­-understandable ways: Two common approaches for this aim are “discretizations” — turn a continuous feature into discrete features (think of histograms) — and “crosses” —  say you have gender and nationality as two features; you can cross the two to create a combined feature for Norwegian women (beware of overfitting though).

Rule #21: The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have: The key is to scale your learning to the size of your data — 1000 examples, a dozen features; 10M examples, a hundred thousand features.

Rule #22: Clean up features you are no longer using: Keep your infrastructure clean and avoid technical debt.

Human Analysis of the System: how to look at an existing model, and improve it

Rule #23: You are not a typical end user: When you have a product that looks decent enough to be released, stop testing “your baby” on your own. Further testing should be done either by paying laypeople via a crowdsourcing platform, or through a live experiment on real users. In this way, you not only avoids confirmation bias, but you also save the valuable time of your engineers (less costly solution).

Rule #24: Measure the delta between models: To measure the change that is going to be perceived by the user, calculate the difference in results between a new model and a model that is in production. If the difference is significant then ensure that the change is good enough. Also, make sure that when a model is compared with itself, the delta is low (ideally zero).

Rule #25: When choosing models, utilitarian performance trumps predictive power: If there is some change that improves loss but degrades the performance of the system, look for another feature.

Rule #26: Look for patterns in the measured errors, and create new features: Use examples where the model went wrong to look for trends falling outside your feature set. For example, if the model seems to demote longer posts, then add post length as a feature (But don’t try to guesstimate “long”. Just add a lot of features for it, and let the model figure out the rest.)

Rule #27: Try to quantify observed undesirable behavior:  “Measure first, optimize second”.

Rule #28: Be aware that identical short-term behavior does not imply identical long-term behavior: If you want to know how a system would behave long-term, you need to have trained it only on data acquired when the model was live, which is very difficult.

Training-Serving Skew: difference in training vs serving performance 

Rule #29: The best way to make sure that you train like you serve is to save the set of features used at serving time, and then pipe those features to a log to use them at training time: Youtube home page saw significant quality improvements when the team started to log features at serving time. If you can’t do this for every example, do it at least for a fraction of the data.

Rule #30: Importance-weight sampled data, don’t arbitrarily drop it!: If you are going to sample example X with a 30% probability, then give it a weight of 10/3. (Don’t take files 1-12 and just ignore files 13-99.)

Rule #31: Beware that if you join data from a table at training and serving time, the data in the table may change: If you are not logging features at training time, and these change between training and serving time, your model’s prediction for the same example may then differ between training and serving. So either log features at training time, or take snapshots of tables at regular intervals.

Rule #32: Re-use code between your training pipeline and your serving pipeline whenever possible: Re-use code and it’s best to have the same programming languages for training and serving.

Rule #33: If you produce a model based on the data until January 5th, test the model on the data from January 6th and after: Testing on newer data helps to understand how the system will behave in production.

Rule #34: In binary classification for filtering (such as spam detection or determining interesting emails), make small short-term sacrifices in performance for very clean data: To gather clean data, you could label 1% of all traffic as “held out”, and send all held out examples to the user. These held out examples can then become training data.

Rule #35: Beware of the inherent skew in ranking problems: When the ranking algorithm undergoes radical changes, and results start to change, then the data that the algorithm is going to see in the future has also changed. To handle the skew introduced by such changes: have higher regularization on features covering more queries, allow features to have only positive weights (avoid ending up with “Unknown” features), and don’t have document-only features.

Rule #36: Avoid feedback loops with positional features: Position highly affects click rates, and it is important to separate out positional features from the rest of the model. If you train your model with a positional feature, you won’t have this feature at serving time — you would be scoring candidates before their display order has been decided. However, you could sum a function of positional features with a function of the other features.

Rule #37: Measure Training/Serving Skew: Skew can be caused by: difference between the performance on the training data and the holdout data, difference between the performance on the holdout data and the “next­day” data, and difference between the performance on the “next-day” data and the live data.

Phase 3 – Slowed Growth, Optimization Refinement, and Complex Models

Rule #38: Don’t waste time on new features if unaligned objectives have become the issue:  If the product goals are not covered by the existing model’s objective, then either change your objective or your product goals.

Rule #39: Launch decisions are a proxy for long-term product goals: Metrics measurable in A/B tests, like daily active users, monthly active users, and ROI, are only a proxy for longer ­term goals like satisfying users, increasing users, satisfying partners, and profit. These long-term goals can then also be seen as proxies for having a useful, high quality product and a company that is not going to the graveyard soon. However, no metric covers the ultimate concern, “where is my product going to be five years from now?”

Rule #40: Keep ensembles simple: Each model should either be an ensemble only taking the input of other models, or a base model taking many features, but not both. Also, ensure that when the predicted probability of a base model increases, the predicted probability of the ensemble does not decrease.

Rule #41: When performance plateaus, look for qualitatively new sources of information to add rather than refining existing signals: Start building the infrastructure for radically different features.

Rule #42: Don’t expect diversity, personalization, or relevance to be as correlated with popularity as you think they are:  If your system is measuring clicks, time spent, watches etc., it means you are measuring the popularity of the content.

Rule #43: Your friends tend to be the same across different products. Your interests tend not to be: Closeness of connections across different products is more predictable than personalization features. Knowing that a user has been active on two different products can be a useful piece of information.

And it’s a wrap! Now remember to remember these rules when working on a real project  —  don’t try to reinvent the wheel, listen to the experts.

                                                    .  .  .

Thanks for reading! If you want to be notified when I write something new, you can sign up to the low volume mailing list.

Reference and Further Reading

Rules of Machine Learning: Best Practices for ML Engineering, Martin Zinkevich


















Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create a website or blog at

Up ↑

%d bloggers like this: