Welcome to Infinite Curiosity, a weekly newsletter that explores the intersection of Artificial Intelligence and Startups. Tech enthusiasts across 200 countries have been reading what I write. Subscribe to this newsletter for free to receive it in your inbox every week:
Hello friends,
Vector Search has emerged in recent years as an efficient way to look through huge amounts of data and extract knowledge from it. In this post, I decided to discuss how it works and how developers can use it to infuse search capabilities into their applications.
In this post, we’ll talk about:
What is a vector
Why do we need vectors
What does a good vector representation look like
How are vectors created
What is vector search
What algorithms can be used for vector search
Let’s dive in.
What is a vector?
A vector refers to a vector embedding. It's a key concept in machine learning that's especially relevant in natural language processing, semantic analysis, and search applications. ML algorithms need numbers to work with. Before an ML algorithm can analyze data (e.g. words), we need to convert that data into numbers. A simple example would be to convert a word into a 1-dimensional array consisting of 'n' numbers. This array is called a vector. Each number in this array is a feature of that word. You can choose the value of 'n' depending on how many features you want.
We can convert a single word into a vector. Or we can convert an entire sentence into a vector. The goal is basically to convert the data from its native form to numerical form so that an ML algorithm can extract useful information from it. These vectors are then used for applications such as recommendation engine, search engine, language translation, and many more.
A key point to note here is that this concept is not just limited to text. You can convert images, video, audio, and other forms of data into vectors as well.
Why do we need vectors?
Let's take the example of text-based search. When a user inputs a word or a phrase, we need to go through our database of words and retrieve the right results. Why not just compare the input phrase with everything in the database and pick the exact match? Why bother converting words into vectors?
Matching words with each other is called keyword matching and you can certainly do it. But it's not good for situations where you have to do conceptual matching. If someone types "red footwear", you may not find an exact match in your database because all the items are labeled "shoes", "sandals", or other similar terms. Keyword matching won't show any results. That's where vectors come into picture. The vector representation of "shoes" will be close to "footwear", so it will retrieve those objects and show the results to the user.
What does a good vector representation look like?
Proximity in the vector space should correspond to proximity in the real world. It means that if the distance between two vectors is small, it means that those two words have similar conceptual meaning.
In the above example, the vector representations of "footwear" and "shoes" should be close enough. But the vector representations of "footwear" and "football" should be different because they are unrelated concepts (even though the words look similar in english).
How are the vectors created?
There are a couple of ways to create these vectors:
Manually: You can create a function that converts words into vectors. This function is created by humans using domain knowledge. This is called feature engineering.
Using ML: You can build an ML model to convert words into vectors. You train a model to go through all the words and create a model that can convert words into vectors.
If the ML model is good, then the corresponding vector representations will also be good. We've discussed the definition of "good" in the previous section.
What is vector search?
Let's continue with the example of a search engine. You have a database of text. And now you want to make it queryable. Why? Because when a user comes in and types something into the search bar, you need retrieve the relevant objects from your database and show it to the user. To do this, you convert your database of objects into a database of vectors.
When the user types something into the search bar, you convert that phrase into a vector. And then conduct similarity search with the vectors in your database to retrieve the relevant results. This is called vector search. It's the process of searching through the list of vectors to choose the ones that are similar to the input vector.
What algorithms can be used for vector search?
An algorithm's goal here is to search through a database of vectors and find similar objects. It needs to do it quickly and efficiently. To start off, how do we measure the similarity level between two vectors? Here are two metrics that are frequently used:
Euclidean distance: It's calculated by taking the square root of the sum of the square of the differences of each feature in the array.
Cosine similarity: The difference between cosine similarity and Euclidean distance is that cosine similarity takes direction into account. It's the dot product of the vectors divided by the product of their lengths.
Now that we have a way to measure the similarity level between two vectors, let's see what algorithms can be used to retrieve the vectors from our database that are similar to the input vector:
k Nearest Neighbors (kNN): This algorithm looks at the input vector, goes through all the vectors in the database, and picks the 'k' closest vectors.
Approximate Nearest Neighbors (ANN): kNN tends to be slow because going through the entire database of vectors takes a long time. We need a faster way to find the matches. ANN is a way to address that. The tradeoff here would be that the matches will be approximate, not exact. This algorithm uses various techniques such as indexing, hashing, and clustering to speed up the search for those 'k' closest vectors.
HNSW (Hierarchical Navigable Small World): HNSW is another way to speed up the process of finding the 'k' nearest neighbors. In HNSW, the search for the closest vectors is conducted in a multi-layered manner. Each layer is a proximity graph and you keep going down the hierarchy until you find the closest ones.
This is an active area of research where new methods keep popping up. The size of the database keeps increasing rapidly. So the goal of these new methods is to make this search process faster and more efficient for larger datasets.
If you are getting value from this newsletter, consider subscribing for free and sharing with your friends: