Nitpicking Machine Learning Technical Debt

Revisiting a resurging NeurIPS 2015 paper (and 25 best practices more relevant than that for 2020)

UPDATE 07/09/2020: A Japanese translation of this post is now available (Japanese Translation Part 1, Japanese Translation Part 2), thanks to Hono Shirai.

Background for this post

I recently revisited the paper Hidden Technical Debt in Machine Learning Systems (Sculley et al. 2015) (which I’ll refer to as the Tech Debt Paper throughout this post for the sake of brevity and clarity). This was a paper shown at NeurIPS 2015, but it sort of fell to the background because at the time everyone was swooning over projects based on this new “Generative Adversarial Networks” technique from Ian GoodFellow.

Now the Tech Debt Paper is making a comeback. At the time of writing this, there have been 25 papers citing this in the last 75 days. This is understandable, as machine learning has gotten to the point where we need to worry about technical debt. However, if a lot of people are going to be citing this paper (if not for more than just citing all the papers that have the phrase “machine learning technical debt” in them), we should at least be aware of which parts have and have not stood the test of time. With that in mind, I figured it would save a lot of time and trouble for everyone involved to write up which parts are outdated, and point out the novel methods that have superseded them. Having worked at companies ranging from fast-growing startups to large companies like Google (the company of the Tech Debt Paper authors), and seeing the same machine learning technical debt mistakes being made everywhere, I felt qualified to comment on this.

Notice that ML Code is the tiny and insignificant black box. One of the good parts of the Tech Debt Paper, though sadly many reimplementations of this image don’t have the different sizes of the boxes, and thus miss the point.

This post covers some of the relevant points of the Tech Debt Paper, while also giving additional advice on top that’s not 5 years out of date. Some of this advice is in the form of tools that didn’t exist back then…and then some is in the form of tools/techniques that definitely did exist that the authors missed a huge opportunity by not bringing up.

Introduction

Tech debt is an analogy for the long-term buildup of costs when engineers make design choices for speed of deployment over everything else. Fixing technical debt can take a lot of work. It’s the stuff that turns “Move fast and break things” into “Oh no, we went too fast and gotta clean some of this up”

Less catchy, I can understand why Mark Zuckerburg was less forthcoming about the second slogan

Okay, we know technical debt in software is bad, but the authors of this paper assert that technical debt for ML systems specifically is even worse. The Tech Debt Paper proposes a few types of tech debt in ML, and for some of them a few solutions (like how there are different recycling bins, different types of garbage code need different approaches 🚮). Given that the Tech Debt Paper was an opinion piece that was originally meant to get people’s attention, it’s important to note several pieces of advice from this work that may no longer be relevant, or may have better solutions in the modern day.

Part 1: ML tech debt is worse than you thought

You’re all probably familiar by now with technical debt. The Tech Debt Paper starts with a clarification that by technical debt, we’re not referring to adding new capabilities to existing code. This is the less glamorous task of writing unit tests, improving readability, adding documentation, getting rid of unused sections, and other such tasks for the sake of making future development easier. Well, since standard software engineering is a subset of the skills needed in machine learning engineering, more familiar software engineering tech debt is just a subset of the space of possible ML tech debt.

I’m paraphrasing, but this is pretty much what Sculley et al. 2015 saying

Part 2: The Nebulous Nature of Machine Learning

The Tech Debt Paper section after the intro goes into detail about how the nebulous nature of machine learning models makes dealing with tech debt harder. A big part of avoiding or correcting technical debt is making sure the code is properly organized and segregated. The fact is we often use machine learning in cases where precise rules or needs are super hard to specify in real code. Instead of hardcoding the rules to turn data into outputs, more often than not we’re trying to give an algorithm the data and the outputs (and sometimes not even that) to output the rules. We don’t even know what the rules that need segregation and organizing are.

Best Practice #1: Use interpretability/explainability tools.

This is where the problem of entanglement comes in. Basically, if you change anything about a model, you risk changing the performance of the whole system. For example, taking a 100-feature model on health records for individuals and adding a 101st feature (like, you’re suddenly listing whether or not they smoked weed). Everything’s connected. It’s almost like dealing with a chaotic system (ironically enough, a few mathematicians have tried to describe neural networks as chaotic attractors as though they were double pendulums or weather systems).

This, but there are hundreds of strings, and the blinds can go OUT the window too.

The authors suggest a few possible fixes like ensembling models or high-dimensional visualization tools, but even these fall short if any of the ensembled model outputs are correlated, or if the data is too high-dimensional. A lot of the recommendations for interpretable ML are a bit vague. With that in mind, I recommend checking out Facebook’s high-dimensional visualization tool, as well as reading by far the best resource I’ve seen on interpretable machine learning: “Interpretable Machine Learning” by Christoph Molnar (available online here)

Sometimes using more explainable model types, like decision trees, can help with this entanglement problem, but the jury’s still out for best practices for solving this for Neural networks.

Best Practice #2: Use explainable model types if possible.

Correction Cascades are what happens when some of the inputs to your nebulous machine learning model are themselves nebulous machine learning models. It’s just setting up this big domino rally of errors. It is extremely tempting to set up sequences of models like this, for example, when applying a pre-existing model to a new domain (or a “startup pivot” as so many insist on calling it). You might have an unsupervised dimensionality reduction step right before your random forest, but changing the t-SNE parameters suddenly tanks the performance of the rest of the model. In the worst case scenario, it’s impossible to improve any of the subcomponents without detracting from the performance of the entire system. Your machine learning pipeline goes from being positive sum to zero sum (that’s not a term from the Tech Debt Paper, I just felt like not adding it in was a missed opportunity).

Courtesy Randal Munroe of XKCD

As far as preventing this, one of the better techniques is a variant of greedy unsupervised layer-wise pretraining (or GULP). There’s still some disagreement on the mathematical reasons WHY this works so well, but basically you train the early models or early parts of your ensembles, freeze them, and then work your way up the rest of the sequence (again, not mentioning this in the Tech Debt Paper was another missed opportunity, especially since the technique has existed at least since 2007).

Best Practice #3: Always re-train downstream models in order.

Another inconvenient feature of machine learning models: more consumers might be relying on the outputs than you realize, beyond just other machine learning models. This is what the authors refer to as Undeclared Consumers. The issue here isn’t that the output data is unstructured or not formatted right, it’s that nobody’s taking stock of just how many systems depend on the outputs. For example, there are plenty of custom datasets on sites like Kaggle, many of which are themselves machine learning model outputs. A lot of projects and startups will often use datasets like this to build and train their initial machine learning models in lieu of having internal datasets of their own. Scripts and tasks that are dependent on these can find their data sources changing with little notice. The problem is compounded for APIs that don’t require any kind of sign-in to access data. Unless you have some kind of barrier to entry for accessing the model outputs, like access keys or service-level agreements, this is a pretty tricky one to handle. You may be just saving your model outputs to a file, and then someone else on the team may decide to use those outputs for a model of their own because, hey, why not, the data’s in the shared directory. Even if it’s experimental code, you should be careful about who’s accessing model outputs that aren’t verified yet. This tends to be a big problem with toolkits like JupyterLab (if I could go back in time and add any kind of warning to the Tech Debt Paper, it would be a warning about JupyterLab).

Basically fixing this type of technical debt involves cooperation between machine learning engineers and security engineers.

Best Practice #4: Set up access keys, directory permissions, and service-level-agreements.

You wish your downstream consumer graph was as simple or as comprehensive as this.

Section 3: Data Dependencies (on top of regular dependencies)

The third section goes a bit deeper with data dependency issues. More bad news: in addition to the regular code dependencies of software engineering, machine learning systems will also depend on large data sources that are probably more unstable than the developers realize.

For example, your input data might take the form of a lookup table that’s changing underneath you, or a continuous data stream, or you might be using data from an API you don’t even own. Imagine if the host of the MolNet dataset decided to update it with more accurate numbers (ignoring for a moment how they would do this for a moment). While the data may reflect reality more accurately, countless models have been built against the old data, and many of the makers will suddenly find that their accuracy is tanking when they re-run a notebook that definitely worked just last week.

One of the proposals by the authors is to use data dependency tracking tools like Photon for versioning. That being said in 2020 we also have newer tools like DVC, which literally just stands for “Data Version Control”, that make Photon obsolete for the most part. It behaves much the same way as git, and saves a DAG keeping track of the changes in a dataset/database. Two other great tools to be used together for versioning are Streamlit (for keeping track of experiments and prototypes) and Netflix’s Metaflow. How much version control you do will come down to a tradeoff between extra memory and avoiding some giant gap in the training process. Still, insufficient or inappropriate versioning will lead to enormous survivorship bias (and thus wasted potential) when it comes to model training

This is the DVC page. Go download it. Seriously, go download it now!

Best Practice #5: Use a data versioning tool.

The data dependency horror show goes on. Compared to the unstable data dependencies, the underutilized ones might not seem as bad, but that’s how they get you! Basically, you need to keep a lookout for data that’s unused, data that was once used but is considered legacy now, and data that’s redundant because it’s heavily correlated with something else. If you’re managing a data pipeline where it turns out entire gigabytes are redundant, that will incur development costs on its own just as well.

The correlated data is especially tricky, because you need to figure out which variable is the correlated one, and which is the causative one. This is a big problem in biological data. Tools like ANCOVA are increasingly outdated, and they’re unfortunately being used in scenarios where some of the ANCOVA assumptions definitely don’t apply. A few groups have tried proposing alternatives like ONION and Domain Aware Neural Networks, but many of these are improving upon fairly unimpressive standard approaches. Some companies like Microsoft and QuantumBlack have come up with packages for causal disentanglement (DoWhy and CausalNex respectively). I’m particularly fond of DeepMind’s work on Bayesian Causal Reasoning. Most of these were not around at the time of the Tech Debt Paper’s writing, and many of these packages have their own usability debt, but it’s important to make it known that ANCOVA is not a one-size-fits-all solution to this.

Best Practice #6: Drop unused files, extraneous correlated features, and maybe use a causal inference toolkit.

Anyway, the authors were a bit less pessimistic about the fixes for these. They suggested a static analysis of data dependencies, giving the one used by Google in their click-through predictions as an example. Since the Tech Debt Paper was published the pool of options for addressing this has grown a lot. For example, there are tools like Snorkel which lets you track which slices of data are being used for which experiments. Cloud Services like AWS and Azure have their own data dependency tracking services for DevOps, and there’s also tools like Red Gate SQL dependency tracker. So, yeah, looks like the authors were justified in being optimistic about that one.

Best Practice #7: Use any of the countless DevOps tools that track data dependencies.

Part 4: Frustratingly undefinable feedback loops

Now we had a bit of a hope spot in the previous section, but the bad news doesn’t just stop at data dependencies. Section 4 of the paper goes into how unchecked feedback loops can influence the machine learning development cycle. This can both refer to direct feedback loops like in semi-supervised learning or reinforcement learning, or indirect loops like engineers basing their design choices off of another machine learning output. This is one of the least defined issues in the Tech Debt Paper, but countless other organizations are working on this feedback loop problem, including what seems like the entirety of OpenAI (at least that’s what the “Long Term Safety” section of their charter, before all that “Capped Profit” hubbub). What I’m trying to say is that if you’re going to be doing research on direct or indirect feedback loops, you’ve got much better and more specific options than this paper.

This one goes back on track with solutions that seem a bit more hopeless than the last section. They give examples of bandit algorithms as being resistant to the direct feedback loops, but not only do those not scale, technical debt accumulates the most when you’re trying to build systems at scale. Useless. The indirect feedback fixes aren’t much better. In fact, the systems in the indirect feedback loop might not even be part of the same organization. This could be something like trading algorithms from different firms each trying to meta-game each other, but instead causing a flash crash. Or a more relevant example in biotech, suppose you have a model that’s predicting the error likelihood for a variety of pieces of lab equipment. As time goes on, the actual error rate could go down because people have become more practiced with it, or possibly up because the scientists are using the equipment more frequently, but the calibrations haven’t increased in frequency to compensate. Ultimately, fixing this comes down to high-level design decisions, and making sure you check as many assumptions behind your model’s data (especially the independence assumption) as possible.

This is also an area where many principles and practices from security engineering become very useful (e.g., tracking the flow of data throughout a system, searching for ways the system can be abused before bad actors can make use of them).

These are just a few examples of direct feedback loops for one model. In reality, some of the blocks in this diagram may themselves be ML models. This doesn’t scratch the surface of indirect interactions

Best Practice #8: Check independence assumptions behind models (and work closely with security engineers).

By now, especially after the ANCOVA comments, you’re probably sensing a theme about testing assumptions. I wish this was something the authors devoted at least an entire section to.

Part 5: Common no-no patterns in your ML code

The “Anti-patterns” section of the Tech Debt Paper was a little more actionable than the last one. This part went into higher-level patterns that are much-easier to spot than indirect-feedback loops.

(This is actually a table from the Tech Debt Paper, but with hyperlinks to actionable advice on how to fix them. This table was possibly redundant, as the authors discuss unique code smells and anti-patterns in ML, but these are all regular software engineering anti-patterns you should address in your code first.)

Design DefectType
Feature EnvyCode Smell
Data ClassCode Smell
Long MethodCode Smell
Long Parameter ListCode Smell
Large ClassCode Smell
God ClassAntipattern
Swiss Army KnifeAntipattern
Functional DecompositionAntipattern
Spaghetti CodeAntipattern

The majority of these patterns revolve around the 90% or more of ML code that’s just maintaining the model. This is the plumbing that most people in a Kaggle competition might think doesn’t exist. Solving cell segmentation is a lot easier when you’re not spending most of your time digging through the code connecting the Tecan Evo camera to your model input.

Best Practice #9: Use regular code-reviews (and/or use automatic code-sniffing tools).

The first ML-anti-pattern introduced is called “glue code”. This is all the code you write when you’re trying to fit data or tools from a general-purpose package into a super-specific model that you have. Anyone that’s ever tried doing something with packages like RDKit knows what I’m talking about. Basically, most of the stuff you shove into the utils.py file can count as this (everyone does it). These (hopefully) should be fixable by repackaging these dependencies as more specific API endpoints.

We all have utils.py files we’re not proud of

Best Practice #10: Repackage general-purpose dependencies into specific APIs.

“Pipeline jungles” are a little bit tricker, as this is where a lot of glue code accumulates. This is where all the transformations that you add for every little new data source piles up into an ugly amalgam. Unlike with Glue Code, the authors pretty much recommend letting go and redesigning codebases like this from scratch. I want to say this is something that has more options nowadays, but when glue code turns into pipeline jungles even tools like Uber’s Michelangelo can become part of the problem.

Of course, the advantage of the authors’ advice is that you can make this replacement code seem like an exciting new project with a cool name that’s also an obligatory Tolkien reference, like “Balrog” (as yes, ignoring unfortunate implications of your project name isn’t just Palantir’s domain. You’re free to do that as well).

This, but in software form

Best Practice #11: Get rid of Pipeline jungles with top-down redesign/reimplementation.

On the subject of letting go, experimental code. Yes, you thought you could just save that experimental code for later. You thought you could just put it in an unused function or unreferenced file and it would be all fine. Unfortunately, stuff like this is part of why maintaining backwards compatibility can be such a pain in the neck. Anyone that’s taken a deep dive into the Tensorflow framework can see the remains of frameworks that were only partially absorbed, experimental code, or even incomplete “TODO” code that was left for some other engineer to take care of at a later date. You probably first came across these while trying to debug your mysteriously failing Tensorflow code. This certainly puts all the compatibility hiccups between Tensorflow 1.X and 2.X in a new light. Do yourself a favor, and don’t put off pruning your codebase for 5 years. Keep doing experiments, but set some criteria for when to quarantine an experiment away from the rest of the code.

Google Research’s mess of a github repo (or the top 5th, at least), and this is just the stuff that’s public. And no, the README isn’t equally detailed. Quite the opposite in fact.

Best Practice #12: Set regular checks and criteria for removing code, or put the code in a directory or on a disk far-removed from the business-critical stuff.

Speaking of old code, you know what software engineering has had for a while now? Really great abstractions! Everything from the concept of relational databases to views in web pages. There are entire branches of applied category theory devoted to figuring out the best ways to organize code like this. You know what applied category theory hasn’t quite caught up to yet? That’s right, machine learning code organization. Software engineering has had decades of throwing abstraction spaghetti at the wall and seeing what sticks. Machine learning? Aside from Map-Reduce (which is like, not as impressive relational databases) or Async Parameter servers (which nobody can agree on how this should be done), or sync allreduce (which just sucks for most use-cases), we don’t have much to show.

High-level overview of the previously mentioned Sync AllReduce server architecture (or a seriously oversimplified version)

High-level overview of the previously mentioned Async parameter server architecture (or at least one version of it)

In fact, between groups doing research on random networks and Pytorch advertising how fluid the nodes in their neural networks are, Machine Learning has been throwing the abstraction spaghetti clean out the window! I don’t think the authors realized that this problem was going to get MUCH worse as time went on. My recommendation? Read more of the literature on the popular high-level abstractions, and maybe don’t use PyTorch for production code.

Best Practice #13: Stay up-to-date on abstractions that are becoming more solidified with time.

I’ve met plenty of senior machine learning engineers who have pet frameworks that they like to use for most problems. I’ve also seen many of the same engineers watch their favorite framework fall to pieces when applied to a new context, or get replaced by another framework that’s functionally indistinguishable. This is especially prevalent on teams doing anything with distributed machine learning. I want to make something absolutely clear:

Aside from MapReduce, you should avoid getting too attached to any single framework. If your “senior” machine learning engineer believes with all their being that Michelangelo is the bee’s knees and will solve everything, they’re probably not all that senior. While ML engineering has matured, it’s still relatively new. An actually “senior” senior ML engineer will probably focus on making workflows that are framework agnostic, since they know most of those frameworks are not long for this world. ⚰️

Now, the previous sections mentioned a bunch of distinct scenarios and qualities of technical debt in ML, but they also provide examples of higher-level anti-patterns for ML development.

Most of you reading this have probably heard the phrase code-smells going around. You’ve probably used tools like good-smell or Pep8-auto-checking (or even the hot new Black auto-formatter that everyone is using on their production python code). Truth be told I don’t like this term “code smell”. “Smell” always seems to imply something subtle, but the patterns described in the next section are pretty blatant. Nonetheless, the authors list a few types of code smells that indicate a high level of debt (beyond the usual types of code smells). For some reason, they only started listing the code-smells halfway into the section on code smells.

The “Plain data” smell You may have code that’s dealing with a lot of data in the form of numpy floats. There may be little information preserved about the nature of the data, such as whether your RNA read counts represent samples from a Bernoulli distribution, or whether your float is a log of a number. They don’t mention this in the Tech Debt Paper, but this is one area where using typing in python can help out. Avoiding unnecessary use of floats, or floats with too much precision will go a long way. Again, using the built-in Decimal or Typing packages will help a lot (and not just for code navigation but also speedups on CPUs).

Best Practice #14: Use packages like Typing and Decimal, and don’t use ‘float32’ for all data objects.

A majority of hackathon code, and unfortunately a lot of VC-funded “AI” startup code, looks like this under the hood.

The “Prototyping” smell Anyone that’s been in a hackathon knows that code slapped together in under 24 hours has a certain look to it. This ties back into the unused experimental code mentioned earlier. Yes, you might be all excited to try out the new PHATE dimensionality reduction tool for biological data, but either clean up your code or throw it out.

Best Practice #15: Don’t leave all works-in-progress in the same directory. Clean it up or toss it out.

The “Multi-language” smell Speaking of language typing, multi-language-codebases act almost like a multiplier for technical debt and make it pile up much faster. Sure, these languages all have their benefits. Python is great for building ideas fast. JavaScript is great for interfaces. C++ is great for graphics and making computations go fast. PHP…uhhh…okay maybe not that one. Golang is useful if you’re working with kubernetes (and you work at Google). But if you’re making these languages talk to each other there will be a lot of spots for things to go wrong, whether it be broken endpoints or memory leaks. At least in machine learning, there are a few toolkits like Spark and Tensorflow that have similar semantics between languages. If you absolutely must use multiple languages, at least we now have that going for us post-2015.

C++ programmer switching to Python. Now picture an entire repo written by this person.

Best Practice #16: Make sure endpoints are accounted for, and use frameworks that have similar abstractions between languages.

(Calling this a code-smell was a weird choice, as this is a pretty blatant pattern even by the standard of usual code-smells)

Part 6: Configuration Debt (boring but easy to fix)

The “Configuration debt” section of the Tech Debt Paper is probably the least exciting one, but the problem it describes is the easiest to fix. Basically, this is just the practice of making sure all the tuneable and configurable information about your machine learning pipeline is in one place, and that you don’t have to go searching through multiple directories just to figure out how many units your second LSTM layer had. Even if you’ve gotten into the habit of creating config files, the packages and technologies haven’t all caught up with you. Aside from some general principles, this part of the Tech Debt Paper doesn’t go into too much detail. I suspect that the authors of the Tech Debt Paper were more used to packages like Caffe (in which case yes, setting up configs with Caffe protobufs was objectively buggy and terrible).

Personally, I would suggest using a framework like tf.Keras or Chainer if you’re going to be setting up configuration files. Most cloud services have some version of configuration management, but outside of that you should at least be prepared to use a config.json file or parameter flags in your code.

Best Practice #17: Make it so you can set your file paths, hyperparameters, layer type and layer order, and other settings from one location.

A FANTASTIC example of a config file. Really wish the Tech Debt Paper had gone into more examples, or even pointed to some of the existing configuration file guides at the time.

If you’re going to be tuning these settings with a command line, try to use a package like Click instead of Argparse.

Part 7: The real world dashing your dreams of solving this

Section 7 acknowledges that a lot of managing tech debt is preparing for the fact that you’re dealing with a constantly changing real world. For example, you might have a model where there’s some kind of decision threshold for converting a model output into a classification, or picking a True or False Boolean. Any group or company that works with biological or health data is familiar with how diagnosis criteria can change rapidly. You shouldn’t assume the thresholds you work with will last forever, especially if you’re doing anything with bayesian machine learning.

Your decision boundaries in your online-learning algorithm…12 minutes after deployment.

Best Practice #18: Monitor the models’ real-world performance and decision boundaries constantly.

The section stresses the importance of real-time monitoring; I can definitely get behind this. As for which things to monitor, the paper’s not a comprehensive guide but they give a few examples. One is to compare the summary statistics for your predicted labels with the summary statistics of the observed labels. It’s not foolproof, but it’s like checking a small animal’s weight. If something’s very wrong there, it can alert you to a separate problem very quickly.

Far too many tools to count. Everyone and their mother is making a monitoring startup these days. Here’s SageMaker & Weights & Biases as examples since they’re slightly less buggy than the others.

Best Practice #19: Make sure distribution of predicted labels is similar to distribution of observed labels.

If your system is making any kind of real-world decisions, you probably want to put some kind of rate limiter on it. Even if your system is NOT being trusted with millions of dollars for bidding on stocks, even if it’s just to alert you that something’s not right with the cell culture incubators, you will regret not setting some kind of action limit per unit of time.

Some of your earliest headaches with production ML will be with systems with no rate limiters.

Best Practice #20: Put limits on real-world decisions that can be made by machine learning systems.

You also want to be mindful of any changes with upstream producers of the data your ML pipeline is consuming. For example, any company running machine learning on human blood or DNA samples obviously wants to make sure those samples are all collected with a standardized procedure. If a bunch of samples are all coming from a certain demographic, the company should make sure that won’t skew their analysis. If you’re doing some kind of single-cell sequencing on cultured human cells, you want to make sure you’re not confusing cancer cells dying due to a drug working with, say, an intern accidentally letting the cell cultured dehydrate. The authors say ideally you want a system that can respond to these changes (e.g., logging, turning itself off, changing decision thresholds, alert a technician or whomever does repairs) even when humans aren’t available.

You laugh now, but you might be more at risk than you realize for making this kind of mistake

Best Practice #21: Check assumptions behind input data.

Part 8: The weirdly meta section

The penultimate section of the Tech Debt Paper goes on to mention other areas. The authors previously mentioned failure of abstraction as a type of technical debt, and apparently that extends to the authors not being able to fit all these technical debt types into the first 7 sections of the paper.

Sanity Checks

Moving on, it’s critically important to have sanity checks on the data. If you’re training a new model, you want to make sure your model is at least capable of overfitting to one type of category in the data. If it’s not converging on anything, you might want to check that the data isn’t random noise before tuning those hyperparameters. The author’s weren’t that specific, but I figured that was a good test to mention.

Best Practice #22: Make sure your data isn’t all noise and no signal by making sure your model is at least capable of overfitting.

Reproducibility

Reproducibility. I’m sure many of you on the research team have had a lot of encounters with this one. You’ve probably seen code without seed numbers, notebooks written out of order, repositories without package versions. Since the Tech Debt Paper was written a few have tried making reproducibility checklists. Here’s a pretty good one that was featured on hacker news about 4 months ago.

This is a pretty great one that I use, and encourage teams I work with to use

Best Practice #23: Use reproducibility checklists when releasing research code.

Industry people reconstructing code from academia know what I’m talking about

Process Management

Most of the types of technical debt discussed so far have referred to single machine learning models, but process management debt is what happens when you’re running tons of models at the same time, and you don’t have any plans for stopping all of them from waiting around for the one laggard to finish. It’s important not to ignore the system-level smells, also, this is where checking the runtimes of your models becomes extremely important. Machine learning engineering is at least improving at thinking about high-level system design since the Tech Debt Paper’s writing.

An example of the types of charts used in bottleneck identification

Best Practice #24: Make a habit of checking and comparing runtimes for machine learning models.

Cultural Debt

Cultural debt is the really tricky type of debt. The authors point out that sometimes there’s a divide between research and engineering, and that it’s easier to encourage debt-correcting behavior in heterogeneous teams.

Personally, I’m not exactly a fan of that last part. I’ve witnessed many teams that have individuals that end up reporting to both the engineering directors and the research director. Making a subset of the engineers report to two different branches without the authority to make needed changes is not a solution for technical debt. It’s a solution insofar as a small subset of engineers take the brunt of the technical debt. The end result is that such engineers usually end up with No Authority Gauntlet Syndrome (NAGS), burn out, and are fired by whichever manager had the least of their objectives fulfilled by the engineer all while the most sympathetic managers are out at Burning Man. If heterogeneity helps, then it needs to be across the entire team.

Plus, I think the authors make some of the same mistakes many do when talking about team or company culture. Specifically, confusing culture with values. It’s really easy to list a few aspirational rules for a company or team and call them a culture. You don’t need an MBA to do that, but these are more values than actual culture. Culture is what people end up doing when they’re in situations that demand they choose between two otherwise weighted values. This was what got Uber in so much trouble. Both competitiveness and honesty were part of their corporate values, but in the end, their culture demanded they emphasized competitiveness over everything else, even if that meant HR violating laws to keep absolute creeps at the company.

A vicious cycle

The issue with tech debt is that it comes up in a similar situation. Yes, it’s easy to talk about how much you want maintainable code. But, if everyone’s racing for a deadline, and writing documentation keeps getting shifted down in priority on the JIRA board, that debt is going to pile up despite your best efforts.

The various reasons for technical debt at the highest level

Best Practice #25: Set aside regular, non-negotiable time for dealing with technical debt (whatever form it might take).

Part 9: A technical debt litmus test

It’s important to remember that the ‘debt’ part is just a metaphor. As much as the authors try to make this seem like something that has more rigor, that’s all it is. Unlike most debts, machine learning technical debt is something that’s hard to measure. How fast your team is moving at any given time is usually a poor indicator of how much you have (despite what many fresh-out-of-college product managers seem to insist). Rather than a metric, the authors suggest 5 questions to ask yourself (paraphrased for clarity here):

  • How long would it take to get an algorithm from an arbitrary NeurIPS paper running on your biggest data source?
  • Which data dependencies touch the most (or fewest) parts of your code?
  • How much can you predict the outcome of changing one part of your system?
  • Is your ML model improvement system zero-sum or positive sum?
  • Do you even have documentation? Is there a lot of hand-holding through the ramping up process for new-people?

Of course, since 2015, other articles and papers have tried coming up with more precise scoring mechanisms (like scoring rubrics). Some of these have the benefit of being able to create an at-a-glance scoring mechanisms that even if imprecise will help you track technical debt over time. Also, there’s been a ton of advancements in the Interpretable ML tools that were extolled as a solution to some types of technical debt. With that in mind, I’m going to recommend “Interpretable Machine Learning” by Christoph Molnar (available online here) again.

The 25 Best Practices in one place

Here are all the Best Practices I mentioned throughout in one spot. There are likely many more than this, but tools for fixing technical debt follow the Pareto Principle: 20% of the technical debt remedies can fix 80% of your problems.

  1. Use interpretability tools like SHAP values
  2. Use explainable model types if possible
  3. Always re-train downstream models
  4. Set up access keys, directory permissions, and service-level-agreements.
  5. Use a data versioning tool.
  6. Drop unused files, extraneous correlated features, and maybe use a causal inference toolkit.
  7. Use any of the countless DevOps tools that track data dependencies.
  8. Check independence assumptions behind models (and work closely with security engineers.
  9. Use regular code-reviews (and/or use automatic code-sniffing tools).
  10. Repackage general-purpose dependencies into specific APIs.
  11. Get rid of Pipeline jungles with top-down redesign/reimplementation.
  12. Set regular checks and criteria for removing code, or put the code in a directory or on a disk far-removed from the business-critical stuff.
  13. Stay up-to-date on abstractions that are becoming more solidified with time
  14. Use packages like Typing and Decimal, and don’t use ‘float32’ for all data objects
  15. Don’t leave all works-in-progress in the same directory. Clean it up or toss it out.
  16. Make sure endpoints are accounted, and use frameworks that have similar abstractions between languages
  17. Make it so you can set your file paths, hyperparameters, layer type and layer order, and other settings from one location
  18. Monitor the models’ real-world performance and decision boundaries constantly
  19. Make sure distribution of predicted labels is similar to distribution of observed labels
  20. Put limits on real-world decisions that can be made by machine learning systems
  21. Check assumptions behind input data
  22. Make sure your data isn’t all noise and no signal by making sure your model is at least capable of overfitting
  23. Use reproducibility checklists when releasing research code
  24. Make a habit of checking and comparing runtimes for machine learning models
  25. Set aside regular, non-negotiable time for dealing with technical debt (whatever form it might take)

Cited as:

@article{mcateer2020mltechdebt,
  title   = "Nitpicking Machine Learning Technical Debt",
  author  = "McAteer, Matthew",
  journal = "matthewmcateer.me",
  year    = "2020",
  url     = "https://matthewmcateer.me/blog/machine-learning-technical-debt/"
}

If you notice mistakes and errors in this post, don’t hesitate to contact me at [contact at [firstname][lastname] dot me] and I will be very happy to correct them right away! Alternatily, you can follow me on Twitter and reach out to me there.

See you in the next post 😄

I write about AI, Biotech, and a bunch of other topics. Subscribe to get new posts by email!


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

At least this isn't a full-screen popup

That'd be more annoying. Anyways, subscribe to my newsletter to get new posts by email! I write about AI, Biotech, and a bunch of other topics.


This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.