Fake it ’til you make it — the advantages of using synthetic data in computer vision models

Published in

Infinity AI

6 min readNov 12, 2021

Written by Lina Colucci, Brinnae Bent, and Andrew Weitz

Collecting and labeling real-world data is an expensive and time consuming prerequisite to training machine learning (ML) models. Generating synthetic data, on the other hand, leapfrogs the traditional data collection and labeling process.

In this article, we will introduce synthetic data and discuss the many benefits of using synthetic data for computer vision applications. We also share a few examples of how synthetic data is currently being used to solve computer vision problems. To see examples of synthetic data in action, be sure to check out our recent article on our open-source synthetic dataset for fitness applications, InfiniteForm.

If your preferred medium is video/audio, check out our recent presentation on the benefits of synthetic data.

The need for labeled data (and a lot of it)

As ML scientists, data is critical to the work that we do. Machine learning models can only be as good as the data they are trained on. Furthermore, the majority of machine learning work in industry is supervised ML, which means not only do you need a lot of data, you need a lot of labeled data.

ML models (especially deep learning approaches) have been tending to use more and more data. However, access to large and well-labeled datasets is often a barrier to startups and individual researchers. Source.

Real-world datasets require labeling, which is traditionally performed by human annotators, who draw bounding boxes or other annotations to each frame of a video. This is incredibly time consuming and expensive. For example, $2M spent per year on labeling is considered a small scale project.

Labeling real-world data is expensive and time consuming. Source

What is synthetic data?

Synthetic data includes any data — images, videos, time series, and more — that is generated entirely via computation, as opposed to being measured directly with a sensor in the real world. Synthetic data generation can broadly be organized into three groups: 1) image composition, 2) implicit generative models, and 3) rendering with traditional computer graphics software. Here, we will focus on the third category.

By generating synthetic data via rendering, researchers have explicit control over their dataset, can generate unlimited training samples, and do not need to go through the laborious process of labeling each sample. Custom scenes allow ML practitioners to control what people are doing, the camera parameters, the environment, and the lighting. Rendered synthetic data also comes with pixel-perfect labels that might be hard or even impossible to get via manual annotations. Labels can include semantic segmentation, 2D and 3D keypoints, bounding boxes, 3D cuboids, depth, surface normals, camera position, objects attributes (e.g. human body dimensions), and more.

By giving ML practitioners explicit control over a dataset’s properties, synthetic data can also be used to minimize bias in downstream models. With human-centric datasets, for example, aspects of appearance such as skin tone, body shape, and clothing, can all be customized to fit the distribution of ⁴he intended target population. Synthetic data doesn’t require footage of real humans, so it also protects people’s privacy. Finally, synthetic data is less expensive than collecting and labeling real-world data, making it poised to democratize access to larger datasets and empower anyone to solve machine learning problems at the scale of big tech companies.

How is synthetic data being used?

Case Study: Tesla Self-driving Algorithms

During the recent Tesla AI Day presentation, the team at Tesla called out a few areas where they really lean on synthetic data:

1. When data is difficult to source. You would have to drive many miles to find people and a dog running in the middle of a highway in the real-world. Alternatively, Tesla just synthetically generates those rare scenes.

2. When data is difficult to label. Crowded environments, for example, are difficult for human annotators to label accurately.

3. When data is closed loop. Tesla essentially created a video game of simulated streets, cities, and highways and they ask their self-driving algorithms to “play” the game.

Case Study: Microsoft HoloLens trains their hand tracking algorithms on synthetic data

Synthetic data powers the fully articulated hand tracking system on the Microsoft HoloLens 2. As shown in the figure below, the Microsoft team generated synthetic humans to perform a variety of different hand movements. One advantage of synthetic data is the ability to easily simulate different camera types (such as the egocentric view shown below).

An interesting anecdote from the project is that the model was originally trained using short sleeve shirts. When tested in the summertime, it worked well. But in the fall, when people started wearing long sleeve shirts and coats, the model performance started breaking down. Because the team was using a synthetically generated dataset, they could easily add long sleeves to the simulated data and retrain the model. If they had been using a real-world dataset, they would have had to recollect all of the data with different sleeve types, which would be time consuming and expensive.

Case Study: Google uses synthetic data to train robots to pick up clear objects

Training a robot to interact with transparent objects is incredibly challenging. In order for a robot to pick up an object, the algorithms need to both localize an object and make good predictions about its depth. This is difficult to do with transparent objects because their appearance varies dramatically based on the background and lighting.

Using synthetic data, Google Robotics trained neural networks to see transparent objects, allowing robots to pick up and interact with them successfully! This was enabled by generating synthetic (yet photorealistic) images of transparent objects, along with their ground truth surface normals, segmentation and edge masks, and depth maps.

The non-obvious benefits of synthetic data

The obvious benefit of using synthetic data for training computer vision models is to increase performance. Check out our human pose estimation model trained on synthetic data, which outperformed Google’s state-of-the-art model.

Beyond model performance boosts, working with synthetic data has a few non-obvious benefits:

1.Debug faster! You can test out hypotheses using synthetic images instead of waiting for real-world data. You can also use it to determine exactly what kind of real-world data you need to collect and how much needs to be collected.

2. Characterize Models. You can use synthetic data to precisely characterize the performance of models across a specific variable. For example, let’s say you wanted to determine model performance at different camera angles. With synthetic data, you can simply generate a sweep of data at different camera angles and then analyze model performance as a function of camera angle.

3. Tailor the Model’s Reach. With synthetic data, you can generate training data that is both tailored to a specific problem and precisely controlled to have the type and amount of variation you want. For example, let’s say you wanted to estimate poses, but in two completely different use cases — a remote fitness application and an automatic grocery store (like Amazon Go). These different contexts will have very different data — the environment (e.g. living room vs. store), poses, lighting, and camera angles (e.g. floor angle vs. mounted on the ceiling) will vary quite a bit depending on the use case.

Synthetic Data at Infinity AI

At Infinity AI, we provide on-demand synthetic data for computer vision teams through the Pixelate API. Interested in using synthetic data in your own projects? Get in touch at info@edgeanalytics.io or check out our recently-released open-source synthetic dataset for fitness applications — InfiniteForm.