“You Do Not Need a Vector Database” is the provocative title of a recent blog post (with code) by Dr. Yucheng Low, co-founder of XetHub. In this post, I’ll explain what a vector database is, why Dr. Low says you don’t need one, and provide context to Dr. Low’s answer.
Background
To make sense of Dr. Low’s article, it’s important to understand why vectors of numbers have become such an important tool for search systems and why that popularity can translate into unnecessary expense for organizations that deploy vector-based systems.
Vectors of Numbers
The word “vector” in computer programming means “a sequence of numbers.” For example, this vector represents the location of The White House in Washington DC:
[38.897957, -77.036560]
While this vector represents characteristics of an electric vehicle:
[38990, 175, 341, 2]
Both of these vectors represent a point in some mathematical space. The first vector represents the location of the White House in a 2-dimensional space where the first dimension is the latitude and the second dimension is the longitude. The second vector represents the price, range improvement with a 15-minute charge, the total range, and the number of motors for an electric vehicle (in this case, a Tesla Model 3).
Clearly, context is required to understand the content and meaning of a vector.
Bags of Words
For the first few decades of text processing, one common way to represent the contents of a document was with a vector where each number represented the number of times a specific word appeared in the text: the first element in the vector might count the word “the” (the most common word), the second “be” (the second most common word), and so on.
This is sometimes called a “bag of words” approach because the vector doesn’t represent the order in which the words occur: it’s as if you wrote each of the document’s words onto a note card and put them all into a bag. The sentences “I quite like chocolate ice cream” and “I cream the ice like chocolate” each have the same bag-of-words representation, even though they have very different meanings.
One of the curious discoveries about the bag-of-words approach was that even though the vectors had no semantic content, simple mathematical operations on the vectors were semantically relevant. For example, if you created a dictionary of the 10,000 most popular words in the English language and used it to represent a bunch of documents, each with their own vector, then you could find similar documents by clustering the points in 10,000-dimensional space. Of course, it took decades of research to figure out how to make this work. The ultimate result was the popular search engines and text retrieval systems of the 1990s and early 2000s.
Word Embeddings
Over the past decade, researchers have developed new approaches for representing the semantic content of phrases, sentences, and documents. These approaches use various techniques to create multidimensional vector spaces representing concepts and find where a word, phrase, or entire document resides or embeds into that space. The resulting vectors are called embeddings.
Enterprise software can easily create word embeddings entirely on enterprise servers using high-performance open-source tools like the spaCy open source natural language processing system. There are commercial word embedding services, such as the OpenAI text embedding API. Each system offers multiple language models to create embeddings, and the embeddings created with one are incompatible with others.
With word embeddings, you can score the similarity between two documents simply by computing the distance between their embeddings. That “distance” is in a multidimensional space, although the math is similar to computing the distance in two or three dimensions. This makes it possible to find similar documents similar to an exemplar document by computing the distance between your exemplar’s word embedding and the word embedding for every other document in your corpus. Likewise, you can find documents responsive to a question by finding the document with an embedding closest to the embedding of the question.
Vector Databases
Vector databases implement this process of storing and operating on multidimensional vectors. For example, you can store a million vectors in a database (representing a million documents in your Google Drive, perhaps) and have the database find the ten vectors closest to your vector for your search. Or you can give the vector two documents you think are similar (on the same topic) and ask how many other similar vectors (documents) it has stored.
These are straightforward queries to resolve if the vector has only a single element — that is if a single number indexes documents. But as vectors get larger, these queries become exponentially more difficult. It’s possible to answer these questions by having the query examine every vector in the database, but that is computationally slow. Vector databases like Chroma, Faiss, Pinecone, Weaviate, and Qdrant all implement various tricks and heuristics to improve performance, although their accuracy can suffer in the process.
Low: You don’t need a vector database (for RAG)
We are now in a position to understand Low’s article!
Yucheng Low is the co-founder and CEO of XetHub. This Seattle-area company has data management tools that make it easy for data scientists to manage datasets in the gigabyte to petabyte range. He earned his PhD at Carnegie Mellon University, where he worked on GraphLab.
The big idea in Low’s article is that a hybrid approach of combining traditional text retrieval algorithms with vector technology can produce better search results for a popular AI application called “retrieval augmented generation” (RAG) than simply using a vector database.
Low’s Demonstration
Low’s article compares three text retrieval approaches:
Approach #1 – Traditional keyword search with heuristics — the “Best Match 25” (BM25) algorithm developed in the 1980s.
Approach #2 – Retrieval using just vector embeddings —what you would get with a high-performance vector database.
Approach #3 – First using BM25 to retrieve 1000 results, and then picking the best results using vector embeddings.
Low finds that the hybrid approach of using BM25 and then picking the best results (“reranking”) using vector embeddings produces better results than either BM25 or ranked vector embeddings on their own.
These findings are important, Low argues because you don’t need a vector database if you are only working with a thousand vectors. Modern computers are fast enough to sort through a few thousand vectors in multidimensional space and find the one that’s the closest match to the question that you’re asking. You only need a vector database for sorting through millions or billions of vectors.
What’s RAG, Anyway?
Retrieval augmented generation (RAG) is an approach that’s been developed over the past year to improve the usefulness of large language model (LLM) AI systems such as ChatGPT.
RAG solves two important problems with these chatbot systems. First, systems implementing RAG can provide users with the “references” that back up the LLM’s answers. Second, RAG can help keep the ChatBot systems on target and aligned with the goals of the organization deploying the LLM.
With RAG, the search engine first takes the user’s question and searches through a large set of documents, finding parts of each document that might be responsive. These document parts and the user’s question are then provided to the LLM with a prompt that says something along the lines of, “Given the following documents, please answer this question.”
(Remember, it’s important to say ‘please’ and ‘thank you’ to your ChatBot — it will give you better answers if you are polite.)
The Copilot feature of Microsoft’s Bing search engine uses RAG to improve results and provide links and context. For example, when I ask CoPilot “Do I need a vector database to do RAG?,” the user interface shows me that it first uses Bing to search the Internet, then uses a LLM to formulate its answer. At the bottom of the answer are links to websites with more information—presumably some of the websites that Bing uses as part of the RAG process.
The Surprising Conclusion
Because today’s computers are fast enough to search through thousands of vectors, it’s significantly more efficient to do an initial search using traditional search technology and then use vector technology to rank the results.
What some people may find surprising about Low’s demonstration is that he also showed that he got better RAG results when using this hybrid approach. But this isn’t surprising to me: hybrid approaches combining multiple AI techniques generally do better than a single, generalized approach.
This doesn’t mean there is no need for vector databases; you just don’t need them to do RAG.