Product Category Memo #7: Data Labeling
In-depth analysis of products that are used to label data for machine learning
Welcome to Infinite Curiosity, a weekly newsletter at the intersection of Artificial Intelligence and Startups. Tech enthusiasts across 200 countries have been reading what I write. Subscribe to this newsletter for free to receive it in your inbox every week:
Hello friends,
Welcome to the 7th edition of Product Category Memo. The goal of this segment is to do an in-depth analysis of a specific subsector in ML. In this post, we’ll talk about:
Why do we need products to label data for machine learning
What are the characteristics of a good data labeling product
How to decide what type of labeling system to use
What products are competing in this market
What factors drive pricing
How these products acquire customers
I asked DALL-E to generate an image of a robot sorting through images to get them labeled. Looks like it still has a lot work to do. Let’s dive in.
Why do we need products to label data?
Machine Learning (ML) is being used in production across a number of use cases. You need labeled data to build supervised learning systems. This is especially relevant for use cases in image recognition, video analysis, text processing, and speech recognition.
For example, let's say you're building a product for a manufacturing facility. The goal of the product is to recognize the type of hardware component in a given image. For the sake of simplicity, let's assume that there are only three types of components.
To do this, you first need to provide many examples of what each component looks like. The ML system will learn from these images and build a model that can recognize these parts. If you collect 50,000 images for each category, then someone needs to label 150,000 images to build this ML model. That's a lot of work. This is where data labeling tools come into play.
What happens if you don't have a good labeling system? Here's what you’ll have to deal with:
models won't be able to predict accurately
models won't stay relevant as time goes by
the cost of labeling becomes unsustainable
models might end up getting biased
What are the characteristics of a good data labeling product?
The output of a data labeling solution is used by anybody who needs to build ML models. It's usually data scientists and ML engineers.
Data labeling can be done in a manual way by humans. These humans can be in-house employees or part-time contractors.
You need to provide these human labelers with a tool do the labeling work. You can build this tool in-house. Or you can use a product available on the market.
Data labeling can also be done in a fully automated way by software products. But we are not at a point where we can blindly trust the labels provided by software, so humans still have to verify the labels that the software provides.
Here are the features they look for in a data labeling product:
can provide accurate labels
is low-cost
is fast
can show the confidence level for each label
can quantify the error rate
doesn't require ML practitioners to do any heavy lifting
is easy to use
can label data of different types
can use Active Learning to improve the quality of datasets by recognizing what's the most useful that needs to be labeled
is collaborative
doesn't require labelers to keep switching context
can use consensus to help reduce errors and biases of individual labelers
has auditing functionality to verify the accuracy of labels
What points do you need to consider for the labeling work?
Here are 7 key considerations before you decide on how to label your data:
Available budget: How much can you afford to spend to label this data?
Quality requirements: How tolerant are you with respect to the labeling errors?
Speed requirements: How quickly do you want the labels? Do you need to do this on a recurring basis?
Requirement of domain knowledge: Do you need domain knowledge to label the data?
Data privacy rules: What are the privacy rules for the dataset in question? Can you share it with contractors around the world to get it labeled?
Types of data: How many different data types do you have to label (e.g. images, text, video, audio, lidar)?
Role of labeling within the business: Is labeling a core part of your company's existence? Or are you doing this to support my main product offering?
What products are competing in this market?
There are a number of open source as well as commercial offerings in the market.
Open source:
Label Studio
CVAT
Audio-annotator
Doccano
ImgLab
Universal-data-tool
LabelMe
Labelimg
VoTT
Commercial:
Scale AI
LabelBox
Appen
Datasaur
SuperAnnotate
Amazon SageMaker Grouth Truth
Snorkel AI
Sloth
Tagtog
Dataturk
Playment
LightTag
V7
Supervise.ly
Encord
Dataloop
Hive Data
Sama
What factors drive pricing?
Here are the factors that drive pricing:
Number of datapoints that need to be labeled
Number of labels per datapoint
Number of validators per label
Number of labelers doing the work
Number of labeling workflows being used
Analytics offering around the labeling work
How do these products acquire customers?
These products can use the following ways to acquire customers:
Enterprise model: These products need top-down adoption to drive usage. They provide data labels as the output. And are used by customers who don't want the hassle of labeling. Companies that have been successful in this sector have taken this approach. Given how drastically different the labeling requirements can be, companies don't usually provide a standardized pricing. But they use the levers mentioned in the previous section to provide pricing details to customers. These are premium-priced products and are targeted towards large companies.
Open-source adoption: Companies provide labeling tools to customers. And customers use them to do the labeling work themselves. These products appeal to customers who want to have the labeling product in-house vs using a 3rd party provider. These products aim to get individual practitioners to adopt it by providing an open source package. That's how a product makes its way into an organization.
Types of data it can label: Companies have a variety of data requirements. They may need to label images, text, lidar, and more. They usually like to go a single company that can handle all their needs (as opposed to using multiple labeling products). A product needs to be able to label a variety of data types.
Integration capabilities: A product needs to play well with all the tools being used by the ML teams. If your product doesn't integrate well, it won't be adopted by companies.
If you are getting value from this newsletter, consider subscribing for free and sharing with your friends: