How To Measure The Quality Of Images Created By AI Models

What's the problem. What to measure. What metrics are available.

May 10, 2022

Hey reader, welcome to the 💥 free edition 💥 of my weekly newsletter. I write about ML concepts, how to build ML products, and how to thrive at work. You can learn more about me here. Feel free to send me your questions and I’m happy to offer my thoughts. Subscribe to this newsletter to receive it in your inbox every week.

AI models are being built that can create images of people who don't exist. And can write tweets and create music that's never been created before. These AI models are called generative AI models. Their main purpose is to generate content that looks realistic to humans. Or at least be unrealistic in a very specific way such as generating fantastical images based on natural language. If it's good, a human wouldn't be able to recognize that it has been created by an AI model.

When you build an AI model, you measure its performance by using a dataset. This dataset is split into training and testing. All the labels are known beforehand. You then compare the labels outputted by the model (i.e. prediction) to the actual labels (i.e. ground truth). You then use this difference to measure how accurate the model is. This works well only when you have the ground truth available.

But what about generative AI models where we don't have the ground truth? They create images of people who don't exist. How can we measure the quality of this image to see if it's realistic enough? How do we know if this model can do this on a consistent basis for a large number of outputs?

Why do we need a way to measure the quality of the generated output?

Let's say we have four different generative AI models that create images of human faces. How can we determine which model is creating the most realistic images?

Just by looking at the created images, we can see that some models are better than others. Some outputs look more realistic than others. But how can we quantify this difference? By using mathematical metrics that can assign scores to these models.

Why can't we just ask people to vote on how realistic these images are?

One way to assess the quality of the generated images is to show them to a group of people. And then ask each person to vote on it. This will give us a sense of how realistic the images are.

But this method is subjective. It will depend on the group of people and their own biases. To objectively measure the quality, we need to define metrics that can be used to evaluate the outputs of generative models. This will enable us to measure the quality and use it to make progress. To address it, people came up with Inception Score.

What is Inception Score?

Inception Score is a metric to evaluate the quality of images generated by an AI model. This specifically applies to synthetic images e.g. faces of people who don't exist. It can be calculated mathematically and it allows us to assign a score to a given model. It was proposed in this 2016 paper.

To calculate the Inception Score, we use a pre-trained neural network for image classification. We use this model to classify the generated images into different categories. We take a large number of generated images and classify them using this model. We calculate the probability of a given image belonging to each class. We then compile these predictions and come up with the Inception Score.

The goal of Inception Score is to capture two properties of generated images:

Is the model specific enough? The model should be able to generate images that look like a specific item or object. It can't put the head of a horse on the body of a dolphin. That's not realistic.
Is the model diverse enough? The model should be able to generate a wide range of objects. For example, it shouldn't just generate images of tables. Even if they're realistic, only being able to generate images of tables is too limited.

The lowest possible Inception Score is 1.0 and the highest possible value is the number of classes supported by the classification model. If the classification dataset consists of 200 categories of objects, then the Inception Score can range from 1 to 200.

What's the problem with Inception Score?

When we build a generative AI model, we want it to learn to create outputs that look similar to the data in the training dataset. One of the key issues with Inception Score is that it doesn't take into account the images that were used to train the generator. It only evaluates the distribution of images that are generated by the model.

For example, let's say we want to build an AI model that can create faces of humans. We use a big dataset of human faces to train the model. After the training, the model starts creating very realistic pictures of all types of furniture. The Inception Score for this AI model will be high because each picture looks like a specific type of object (i.e. furniture) and there's a variety of furniture. But does it meet our goal of creating human faces? No.

We need a metric that aligns with human judgment of quality. And it needs to be a number that encapsulates the similarity level between the training dataset and generated dataset. That's why people came up with a different metric called Fréchet Inception Distance (FID) to evaluate the generated images.

What is Fréchet Inception Distance?

Before we dive into FID, we need to know what Fréchet Distance is. It's a metric named after Maurice Fréchet and is used in mathematics to measure the similarity between two curves. It takes into account the location and ordering of the points along those curves.

How is it relevant here? Well there's another interesting application of the Fréchet Distance metric. In addition to measuring the similarity between curves, we can use it to measure the similarity between probability distributions as well.

Using the Inception object recognition framework, the images in both the training and generated dataset are projected onto a lower-dimensional space by capturing the important features. Then the Fréchet Distance is calculated between these samples to see how similar the distributions are. It allows us to compare the group of generated images with the group of training images. This metric captures the usefulness of the Inception Score along with including the similarity level between training images and generated images.

Where to go from here?

Generative AI models are becoming more pervasive and are being used in many applications. These models are helping people compose emails, write code, fix errors in documents, write tweets, create images for their blog posts, and synthesize music. These models are on their way to becoming our copilots for many of our daily tasks.

🎙🔥 Two new episodes on the Infinite ML pod

Tushar Gupta on the step-by-step process of getting a tech book published, identifying topics for books, the process of writing, and trends in machine learning books.
Duration: 30 mins
🎧 Apple Podcasts
🎧 Spotify
🎧 Google Podcasts

What's new in ML: I talk about the latest news in AI including generative AI, enzyme that can break down plastic quickly, nuclear fusion, low-carbon concrete, synthesizing molecules, object detection, few-shot learning.
Duration: 15 mins
🎧 Apple Podcasts
🎧 Spotify
🎧 Google Podcasts

📋 Job Opportunities in AI

Check out this job board for the latest opportunities in AI. It features a list of open roles in Machine Learning, Data Science, Computer Vision, and NLP at startups and big tech.

💁🏻‍♀️ 💁🏻‍♂️ How would you rate this week’s newsletter?

You can rate this newsletter to let me know what you think. Your feedback will help make it better.

Infinite Curiosity Newsletter

Discussion about this post