Data loops are the bottleneck in applied AI

Data loops are the bottleneck in applied AI

You can quickly iterate code, but not data. This is one of the major bottlenecks in companies trying to automate human activities, and solving it generally would change the way every machine learning team works.

To understand how easy things could be, consider a simple web application. To make a small change to the site – e.g. changing the way dates are displayed, an engineer must minimally a) make the change in source code, b) commit it to a version control system, and c) deploy the new version (which you can often do with a single command).

On my blog, this change was live three minutes after I had had the idea. More complex applications often have extra steps for testing and a release schedule, but developers can at least test the design very quickly.

Now consider a similarly small change in a machine learning model like a face detector, which is supposed to draw boxes around each face in a photo. Suppose we observe that the model is often missing faces of people that are looking away from the camera. The errors could be because faces which are at a significant angle were not part of the training dataset.
Whether the decision to omit these faces was implicit or explicit, a simple fix could do the trick: add more faces-at-an-angle into the training set. The typical data scientist’s workflow to do that is the following:

  1. write a query to get a list of candidate images from the database
  2. download the files from AWS S3 or other storage
  3. review the faces yourself, or send them with a task description to an internal/external image labelling team
    gather the labels and integrate them into the training set

This process involves writing custom scripts for moving all that data around. If there’s a significant amount of manual work, you also need to talk to an internal data labelling team or outsourced partner. In my experience, a single iteration of this process takes no less than a workday and can run into several weeks when massive amounts of data are involved.

While the software engineer is getting feedback to tens or even hundreds of ideas in a single day, the data scientist can try out maybe one idea per day, or worse. The difference is enormous!
Sometimes there is no way to circumvent this: if you want to have a million images labelled, you will have to wait weeks. You could have an enormous amount of labellers always standing by, but this is expensive.

However, the workflow I described above is also typical for small datasets: you need to write data loading scripts for a hundred examples, as you do for a million. It might only take an hour to look through a hundred images, but you’ll spend almost ten times as long to create the scripts for shuffling the data around.

The solution for iterating on small batches of data is to automate it as much as possible by building in-house tools. How to speed up iterating on millions of data points is much less clear, though. Tesla’s Director of AI, Andrej Karpathy, clearly thinks about it a lot and even shares at a very high level some ideas they have implemented, like using information from the future as annotation for the past: in a continuous video stream you can automatically label that a car is about to pass by looking at where it was five seconds later.

I haven’t found any good open-source or commercial software or tools that would significantly help iterate on datasets faster as git and GitHub do with code. I’ve created a list of data annotation tools, but most seem to address the box-drawing part, not the whole annotation workflow. My current explanation for this is that very few companies and teams iterate on their datasets. Whether this is because many real-world problems don’t require iterating on data, or that they simply focus on wrong things, I don’t know.

Academia magnifies this problem. In research, datasets are usually fixed, and most papers work on improving models, chasing SOTA or state-of-the-art performance on standard datasets. In industry, the approach is the opposite: my friend working on self-driving software recently said he uses the same model architecture for all his prediction tasks; most of his time goes into getting an appropriate dataset.

A tool that allows iterating on datasets 10x faster than today will have a massive impact on the productivity of applied AI teams.