Infinity AI: For ML engineers, by ML engineers

Lina Avancini Colucci
Infinity AI
Published in
5 min readJun 10, 2022

--

Our mission at Infinity AI is to build data-centric tools that accelerate progress in machine learning (ML). The past 7 years have seen massive commoditization of cutting-edge deep learning models due to the release of open-source, model-centric frameworks like Tensorflow (2015), Pytorch (2017), and HuggingFace’s model platform (2019). Now, the ML community is shifting its collective focus to the next major chapter of making ML progress: the data itself.

This is the time for data-centric AI

A new era of data-centric AI has begun. Unlike the previous model-centric AI era, where engineers treated data almost like a static artifact and iterated on model architectures, data-centric AI treats datasets as something that can (and should) be curated and optimized. It makes sense — after all, an ML model is only as good as the data it is trained on. Bad data leads to poor model performance, and a data focus is necessary to unlock the full potential of AI.

Andrew Ng, one of the fathers of AI, called for a data-centric AI shift last year (2021) [REF].

Synthetic data is the ultimate form of data-centric AI. Data is no longer a bottleneck.

Synthetic data makes it so that data is available in continuous supply. Gathering, curating, and annotating a new dataset is no longer a massive operational burden. Synthetic data leverages computation resources rather than human labor: engineers can simply make an API call and get endless fresh data. The key use cases of synthetic data include the ability to:

  1. Train models: bootstrap new ML models fully with synthetic data, or mix synthetic and real-world data together to improve model performance.
  2. Characterize models: generate sweeps of synthetic data across specific parameters (like camera position, lighting, or body types) and characterize model performance as a function of a parameter.
  3. Fix failure cases: generate synthetic digital twins of real-world failure cases (sample amplification) and re-train on the expanded training dataset to fix the failures
  4. Boost privacy and minimize data bias: comply with the strictest data protection laws by utilizing synthetic data rather than requiring real people’s faces or identifiable data; generate balanced, diverse datasets for equitable ML
  5. Do robust testing (CI / CD): ensure that models pass “data unit tests” before deploying them into production; generate reliable data for integration testing; conduct CI/CD for large scale data pipelines

We became enamored with synthetic data at our previous ML consulting company because it could directly plug into existing ML pipelines and immediately address the key limitations of real-world data. Synthetic data had pixel-perfect labels and reduced data gaps by giving precise control over data distributions. Most importantly, it just worked!

All the top tech companies started using it last year

We got hooked on synthetic data in late 2020. It turns out a lot of the big tech companies also became fans of training on synthetic data:

Tesla — for self driving. Watch demo (Plus, Andrej Kaparthy, Head of AI at Tesla, often tweets about the benefits of synthetic data [1] [2])

Microsoft Hololens — for hand tracking. Watch demo.

Apple — for natural scene understanding. Read paper.

Microsoft — for eye tracking. Read paper.

NVIDIA — for self driving. See demo.

Google — for pose estimation. Read paper.

These announcements are all from 2021 or early 2022. Synthetic data is the future and the time for it is now.

There are interesting, emergent properties of synthetic data

The ML community has a scarcity mentality with regards to data today. Synthetic data turns this into an abundance mentality. What happens when data is no longer a limited resource? It turns out that a lot of interesting things happen:

The Data IDE

Synthetic data turns data into something interactive, like an IDE where you can easily run and test code, see the results, and log error messages. Synthetic data enables us to imagine a unit test for an ML model. How can we verify a model is going to perform well within the distributions we want it to perform well in? Synthetic data can give precise control over the data specs.

Model-in-the-Loop

Synthetic data lets us approach every ML problem like a reinforcement learning (RL) problem. We can hook up a parameterized synthetic data generator and an ML model in a closed-loop way, and have the generator synthesize whatever data makes the model perform well.

Queryable Data

When synthetic data is available in abundance, we have the ability to cherry-pick (and create) data for specific needs, versus being limited to what is both available and open-source on platforms like Kaggle. In practice, this means that we can build ML models that we want, when we want them. In other words, synthetic data allows us to build an ML pipeline that is goal-driven versus data-availability-driven.

But one thing remains the same: we’re still building a team of the best people we’ve ever met to pursue the most interesting work of our lives.

It’s all about the people

When we started Edge Analytics — an ML consulting company — we wanted to work on the hardest, most interesting problems with the best people we’d ever met. We got to do that for over 3 years and it was the joy of a lifetime. Working at Edge even made the pandemic fly by — the work was so fun and meaningful.

In order to move on to anything else we had to be convinced that it was at least an order of magnitude greater in impact than what we were already pursuing at Edge. We were convinced synthetic data was a remarkable tool for our own consulting work, but weren’t sure it was the right thing to build an entire company around (at least as the first product). So we gave ourselves a few months to talk ourselves out of it. We listed out the potential weak points of synthetic data as a startup idea and rigorously went about testing each one. Eventually, we reached a threshold level of conviction around both the tech and business opportunity that synthetic data represented. We spun out Infinity AI and raised our seed round at the end of 2021.

Our mission as individuals is to advance the state of the art in ML. We did this for 3 years at Edge, and we saw an opportunity to do that with synthetic data at Infinity. But one thing remains the same: we’re still building a team of the best people we’ve ever met to pursue the most interesting work of our lives.

Thank you to Lauren Friedman, Joshua d’Arcy, Diana Kimball Berlin, and Sidney Primas for reading early drafts of this blog and providing feedback.

Infinity AI. We’re a dedicated team of engineers who build tools to accelerate progress in ML. We’d love to share our journey of building this company with you. Follow us on LinkedIn or let’s find some time to chat!

Reach out at lina@toinfinity.ai (I read every email and love meeting new people).

Some of our past blogs:
January
InfiniteRep: An open-source synthetic dataset for remote fitness and PT applications
FebruaryAnnouncing the Infinity API: Data at your fingertips
MarchNew Infinity API Features
AprilTools for the Infinity API

--

--