9 Products To Build For ML Practitioners Who Need To Generate Synthetic Data

List of synthetic data tasks that appear in an ML practitioner's job. Insights on use cases and available products on the market.

Jul 19, 2022

In this post, we’ll talk about:

Tasks that require ML practitioners to generate synthetic data
Use cases of synthetic data
Products available today to perform those tasks

If you have a question, submit it here and I’ll get back to you with my thoughts.

AI is being infused into all forms of software products. But AI systems need large quantities of data to perform well. Many times, we don't have access to large quantities of data.

How do we build these AI models? Using synthetic data.

Synthetic data refers to data that has been artificially generated by software. This data goes into training algorithms. The outcome is a trained model that can be used in production. The quality of synthetic data that comes out matters a lot. It needs to mimic the real world data as much as possible. Or else the resulting AI models will be useless in the real world.

ML practitioners need to generate this synthetic data, but it's a lot of work. To serve this need, dozens of startups have popped up in the last 3 years. Synthetic data generation has garnered a lot of attention because of the role it plays in building modern AI systems.

In this post, I've listed 9 products that you can build for ML practitioners who need to generate synthetic data in different scenarios. And examples of products that are available for that task.

Product #1: Generating image and video data

This has been one of the most popular use cases for synthetic data. ML practitioners need image and video data to build models for self driving cars, autonomous robots, video surveillance, deep fakes, and more.

Examples: Parallel Domain, Scale Synthetic, Cognata, Cvedia, Synthesis AI, Deep Vision, Lexset, Neurolabs

Product #2: Generating text data

Synthetic text data can be used in a wide variety of scenarios. Text data is available in abundance and there are large language models being trained everyday. Many open source models are also available. ML practitioners need synthetic text data to train language systems for speech recognition, customer support, and more.

Examples: GPT-3, PaLM, OPT-175B, BigScience BLOOM, Amazon Comprehend, Cohere, and thousands of models available on Hugging Face

Product #3: Generating digital twins of training datasets

This is useful when you have a small dataset and want to augment it by adding synthetic data. You need the synthetic data to be similar to the existing data, but not exactly the same. ML practitioners need a system that can understand the characteristics of existing data and augment it by creating synthetic data.

Examples: Gretel, Datomize, Sythesized, Anyverse, Autonama AI, Sogeti

Product #4: Generating data by preserving privacy

This is useful when you can't use data directly to train your model. You need to anonymize it and preserve the privacy. Many times it's required by law as well. In these situations, ML practitioners need a product that can generate data that's similar in nature to the existing data. But at the same time, it preserves the privacy by anonymizing and masking the information.

Examples: Betterdata, Facteus, Generatrix, Diveplane Geminai

Product #5: Generating financial data

Financial firms use synthetic data to build systems that can detect fraud and money laundering. They also use these systems to better understand customer transactions. ML practitioners need a product that can generate synthetic data that looks like financial transactions.

Examples: Hazy

Product #6: Generating healthcare data

Healthcare firms use synthetic data for clinical trials and healthcare analytics. Insurance companies use petabytes of synthetic data where they generate history of insurance claims, treatment information, and other pertinent medical information. ML practitioners need a product that can generate healthcare data to be fed into a training algorithm.

Examples: MDClone, Syntegra, Veil AI

Product #7: Generating 3D structures

This is useful to companies building solutions in automotive, smart office, AR, VR, and fitness. These firms need 3D models of humans and objects to build their models. ML practitioners need a product that can generate this synthetic data so that they can train their models.

Examples: Bifrost, Datagen, OneView

Product #8: Generating test data

ML practitioners need products that can create test datasets so that they can test their ML models. They usually do this by segmenting the existing dataset into training and testing. But synthetic test data generators will help ML practitioners make their systems even more robust.

Examples: BizDataX, Curiosity, GenRocket

Product #9: Generating data with open source tools

This is for ML practitioners who want to customize everything from scratch and retain full control on how the generator model works. You can use open source tools and build a solution on top of that.

Examples: Twinify, Synner, Synthea

Where to go from here?

The goal of this post is to show the tasks that appear in an ML practitioner's life when they're working on synthetic data. If you're a builder of ML products, you can talk to potential users and get an idea of the status quo across these various tasks. You should see what they need help with and build a product that suits their needs.

If you find this newsletter valuable, consider subscribing to it and sharing it with your friends.

Infinite Curiosity Newsletter

Discussion about this post