Iterating on data is the bottleneck in applied AI

Posted 19 hours ago · June 2019

Code can be iterated on quickly but data cannot. This is one of the major bottlenecks in companies trying to automate human activity, and solving it generally would change the way every machine learning team works.

To understand how easy things could be, consider a simple web application. To make a small change to the site – e.g. changing the way dates are displayed, an engineer must minimally a) make the change in source code, b) commit it to a version control system, and c) deploy the new version (which can be often done with a single command).

On my personal website which is built, this change was live three minutes after I had the idea. More complex applications built in teams often have extra steps for testing and a release schedule, but the idea can be at least tested very quickly.

Now consider a similarly small change in a machine learning model like a face detector, which is supposed to draw boxes around each face in a photo. Suppose we observe that the model is often missing faces of people that are looking away from the camera. This could be because faces which are at a significant angle were not part of the training dataset.

Whether the decision to omit these faces was implicit or explicit, a simple fix could do the trick: add more faces-at-an-angle into the training set. The typical data scientist’s workflow to do that is the following:

  1. write query to get list of candidate images from the database
  2. download the images from AWS S3 or other storage
  3. review the images yourself, or send them with a task description to an internal/external image labelling team
  4. gather the labels and integrate them into the training set

This process involves writing custom scripts for moving all that data around. If there’s a significant amount of manual work you also need to talk to an internal data labelling team or outsourced partner. In my experience, a single iteration of this process takes no less than a workday, and can run into several weeks when other teams and larger amounts of data are involved.

While the software engineer is getting feedback to tens or even hundreds of ideas in a single day, the data scientist can try out maybe one idea per day, or worse. The difference is enormous!

Sometimes there is no way to circumvent this: if you want to have a million images labelled you will have to wait weeks. You could have an enormous amount of labellers always standing by, but this is expensive.

However, the workflow I described above is also typical for small datasets: even for a hundred examples, a similar set of scripts will have to be written. It might only take an hour to go through a hundred images, but almost ten times as long to create the scripts for shuffling the images and results around.

The solution for iterating on small batches of data is to automate it as much as possible by building in-house tools. How to speed up iterating on millions of datapoints is much less obvious, though. Tesla’s Director of AI, Andrej Karpathy, clearly thinks about it a lot and even shares at a very high level some ideas they have implemented, like using information from the future as annotation for the past: in a continuous video stream you can automatically label that a car is about to pass by looking at what you detected it doing in ten seconds later.

I haven’t found any good open-source or commercial software or tools that would significantly help iterate on datasets faster, like git and GitHub do with code. I’ve created a list of data annotation tools but most seem to address the actual box-drawing part, not the whole annotation workflow. My current explanation for this is that very few companies and teams iterate on their datasets. Whether this is because many real-world problems don’t require iterating on data, or that they are simply focused on wrong things, I don’t know.

The academia magnifies this problem. In research, datasets are usually fixed and most papers work on improving models, chasing SOTA or state-of-the-art performance on standard datasets. In industry the approach is opposite: my friend working on self-driving software recently said he uses the exact same model architecture for all his prediction tasks; most of his time goes into getting an appropriate dataset.

A tool that allows iterating on datasets 10x faster than today will have a huge impact on the productivity of applied AI teams.


Taivo Pungas
About

Taivo Pungas is Automation Lead at Veriff, where he leads AI product teams.

Previously, he built self-driving robot software at Starship, worked on software and data science at several startups, and has been writing for years.