What Is Empirical Risk Minimization
What is risk. True risk vs empirical risk. How to minimize empirical risk.
In this post, we’ll talk about:
The concept of risk
True risk vs empirical risk
Why do we need to minimize empirical risk
How to minimize empirical risk
If you have a question, submit it here and I’ll get back to you with my thoughts.
Empirical Risk Minimization is an important concept in machine learning, especially when it comes to supervised learning. What's the goal of supervised learning? To find a model that solves a given problem. Contrary to popular belief, it's not about building a model that best fits a given dataset. Why? Because the dataset doesn’t represent the whole population.
Why is that an issue?
When it comes to solving supervised learning problems, we don't have access to every single data point that represents each class in its entirety. For example, let's say you want to build a model that can look at an image and tell us if it's a cat or a dog. If we select 100 random images where cats are all black and dogs are all white, then the model will incorrectly assume that color is the differentiating feature between cats and dogs. If you show an image of a white cat, it will classify it as a dog.
To build a truly accurate model, we need to build a dataset of images of all the cats and dogs in the world. Even though it would be a fun activity (no doubt!), it wouldn't be practical! So what do we do? We use the next best thing available to us -- a training dataset that's representative of the classes. We select a small number of cats and dogs to build our dataset. And hope that this sample dataset is representative of the whole population.
We can think of supervised learning as choosing a function that achieves a given goal. We have to choose this function from a set of potential functions that can serve the need. How do we measure the effectiveness of this chosen function given that we don’t know what the actual distribution looks like? How do we find the function that’s the best representative of the true solution? That's where the concept of risk comes in.
What is risk?
The word "risk" has many definitions. But in this context, we're talking about error. To understand it, we need to talk about the idea of a loss function. Given a set of inputs and outputs, a loss function measures the difference between the predicted output and the true output. This difference is called error or "risk".
But it's applicable only to the given set of inputs and outputs. The risk keeps changing as you changes the inputs and outputs. We want to know what the risk is over all the possibilities. We want to compute the risk for every single cat and dog in the world. This is where "true risk" comes into the picture.
What is true risk?
True risk computes the average loss over all the possibilities. But the problem in the real world is that we don't have access to all the possibilities. No universal dataset of all the cats and dogs that exist!
If we put it in mathematical terms, we say that we don't know what the true distribution looks like. If we did, then we wouldn't need machine learning in the first place. How do we assess the risk here? This is where empirical risk comes into the picture.
What is empirical risk and why do we need to minimize it?
We assume that the samples in our training dataset come from the whole population. The dataset serves as an approximation of this true distribution. If you compute the loss using the data points in our dataset, it's called empirical risk. Why is it "empirical risk" and not "true risk"? Because we are using a dataset that's a subset of the whole population.
When we build our machine learning model, we need to pick the function that minimizes the empirical risk. The difference between the predicted output and the actual output for the data points in our dataset should be minimized. This process of finding the loss function is called empirical risk minimization. In an ideal world, we would directly minimize the true risk. But we don't have the dataset that allows us to achieve that. So we do the next best thing and minimize the empirical risk. The hope is that this empirical risk will almost be the same as the true risk. Hence by minimizing empirical risk, we aim to minimize the true risk.
Where to go from here?
The size of the dataset has a big impact on empirical risk minimization. If we get more data, the empirical risk will approach the true risk. This fits well with common knowledge in machine learning where a bigger dataset will help make your model better. The complexity of the true distribution affects how well we can approximate it. If it’s too complex, we would need a lot of data to get a good approximation.