Algorithmic Data Curation for LLMs

What is it, why do we need it, how does it work

May 13, 2024

Welcome to Infinite Curiosity, a weekly newsletter that explores the intersection of Artificial Intelligence and Startups. Tech enthusiasts across 200 countries have been reading what I write. Subscribe to this newsletter for free to receive it in your inbox every week:

As LLMs are getting increasingly adopted across all sectors, researchers are working on figuring on how to make them faster/better/cheaper. There are a few dimensions that are relevant: building better architectures, building better algorithms, using faster hardware, using better data, and more. This post is focused on the data aspect of it.

I recently had an excellent conversation with Ari Morcos (founder/CEO Datology) on this topic. You can listen to it on Spotify or Apple Podcasts.

Large datasets are being used to train an LLM. The training process in some sense is a way to extract all useful information from the training data.

Each data point you use to train the model costs money. But not every data point adds equal amount of information.

In fact, some data points don't add any information. And some data points are actually detrimental. What can we do here? This is where algorithmic data curation comes into play.

What exactly is algorithmic data curation?

It refers to the process of using ML algorithms to manage, organize, and enhance the quality of a dataset. This process is crucial in environments where data is vast and varied. Manually verifying the fidelity of each data point is not possible. You need automated systems to ensure that the data remains useful and accessible for analysis and decision-making.

Why bother with algorithmic data curation for LLMs?

There are three key reasons:

Simply increasing data size doesn't necessarily improve model quality. This is especially true if the data adds redundancy without adding new information.
Need to discern and filter out semantic duplicates. And manage semantic redundancy to capture natural variability without excessive replication.
Identifying and removing bad data (misleading or incorrect information) is crucial as it disproportionately impacts model performance.

What are the key components of algorithmic data curation?

There are many components, but I'll discuss 7 points here:

Assessing the conceptual complexity: A sophisticated element of data curation is the algorithm's ability to assess and understand the conceptual complexity inherent in the data. This involves recognizing different categories and sub-categories within the data. And determining how much variation is necessary for the model to learn effectively about each category.

Data ordering and batching: The order in which data is presented to the model and how it is batched can significantly affect learning outcomes. Algorithmic data curation includes optimizing these aspects to enhance learning efficiency and model performance. This might involve strategies that present data in a sequence that is most likely to reinforce learning or adjust the model's exposure to various data types over time.

Semantic deduplication: This involves identifying data points that are essentially identical in meaning but may appear different due to variations in presentation or format. For example, two descriptions of the same event by different authors might be treated as duplicates even though they are worded differently. This helps in reducing redundancy without losing any unique information.

Handling semantic redundancy: Managing semantic redundancy is important for maintaining the diversity of examples without overwhelming the model with near-identical data. This involves differentiating between when multiple data points effectively convey the same information vs when they add meaningful diversity. The balance between redundancy and diversity is key to training robust models.

Filtering out bad data: Identifying and removing incorrect/misleading data is critical. In unsupervised learning, the data isn't labeled. So determining what constitutes "bad" data can be challenging but is crucial for maintaining the model's accuracy. Techniques might include identifying out-of-distribution samples or using heuristic methods to flag data likely to be erroneous.

Algorithmic efficiency: Efficient algorithms are necessary to handle the vast scales of data typically involved in training LLMs. This includes using techniques like MinHash for deduplication, which serves as a simple starting point for more complex operations that must be performed at scale.

Embedding and vectorization: Utilizing embedding spaces to represent data points allows the algorithms to understand and manipulate data at a more abstract level. This facilitates more sophisticated operations like clustering similar data points or identifying outliers. This is crucial for effectively parsing and organizing large datasets into manageable subsets.

If you're a founder or an investor who has been thinking about this, I'd love to hear from you. I’m at prateek at moxxie dot vc.

If you are getting value from this newsletter, consider subscribing for free and sharing it with 1 friend who’s curious about AI:

Infinite Curiosity Newsletter

Discussion about this post