Building LLM applications with vector search in Azure Cognitive Services
Building LLM applications with vector search in Azure Cognitive Services
Tools like Semantic Kernel, TypeChat, and LangChain make it possible to build applications around generative AI technologies like Azure OpenAI. That’s because they allow you to put constraints around the underlying large language model (LLM), using it as a tool for building and running natural language interfaces.
At heart a LLM is a tool for navigating a semantic space, where a deep neural network predicts the next syllable in a chain of tokens that follow on from your initial prompt. Where a prompt is open-ended, the LLM can overrun its inputs, producing content that may seem plausible but is in fact complete nonsense.
Just as we tend to trust the outputs from search engines, we also tend to trust the outputs of LLMs, as we see them as just another facet of a familiar technology. But training large language models on trustworthy data from sites like Wikipedia, Stack Overflow, and Reddit doesn’t impart an understanding of the content; it merely imparts an ability to generate text that follows the same patterns as text in those sources. Sometimes the output may be correct, but other times it will be wrong.
How can we avoid false and nonsensical output from our large language models, and ensure that our users get accurate and sensible answers to their queries?
Constraining large language models with semantic memory
What we need to do is constrain the LLM, ensuring that it only generates text from a much smaller set of data. That’s where Microsoft’s new LLM-based development stack comes in. It provides the necessary tooling to reign in the model and keep it from delivering errors.
You can constrain a LLM by using a tool like TypeChat to force a specific output format, or by using an orchestration pipeline like Semantic Kernel to work with additional sources of trusted information, in effect “grounding” the model in a known semantic space. Here the LLM can do what it’s good at, summarizing a constructed prompt and generating text based on that prompt, without overruns (or at least with a significantly reduced chance of overruns occurring).
What Microsoft calls “semantic memory” is the foundation of this last approach. Semantic memory uses a vector search to provide a prompt that can be used to deliver a factual output from an LLM. A vector database manages the context for the initial prompt, a vector search finds stored data that matches the initial user query, and the LLM generates text based on that data. You can see this approach in action with Microsoft’s Bing Chat, which uses Bing’s native vector search tools to build answers taken from its search database.
Semantic memory makes vector databases and vector search the means to delivering usable, grounded, LLM-based applications. You can use any of the growing number of open source vector databases or add vector indexes to familiar SQL and NoSQL databases. One new entrant that looks particularly useful extends Azure Cognitive Search, adding a vector index to your data and new APIs for querying that index.
Adding vector indexing to Azure Cognitive Search
Azure Cognitive Search builds on Microsoft’s own work on search tooling, offering a mix of familiar Lucene queries and its own natural language query tool. Azure Cognitive Search is a software-as-a-service platform, hosting your private data and using Cognitive Service APIs to access your content. Microsoft recently added support for building and using vector indexes, allowing you to use similarity searches to rank relevant results from your data and use them in AI-based applications. That makes Azure Cognitive Search an ideal tool for use in Azure-hosted LLM applications built using Semantic Kernel and Azure OpenAI, with Semantic Kernel plug-ins for Cognitive Search in both C# and Python.
Like all Azure services, Azure Cognitive Search is a managed service that works with other Azure services, allowing you to index and search across a wide range of Azure storage services, hosting text and images as well as audio and video. Data is stored in multiple regions, offering high availability and reducing latency and response times. As an added benefit, for enterprise applications, you can use Microsoft Entra ID (the new name for Azure Active Directory) to control access to your private data.
Generating and storing embedding vectors for your content
One thing to note is that Azure Cognitive Search is a “bring your own embedding vector” service. Cognitive Search will not generate the required vector embeddings for you, so you will have to use either Azure OpenAI or the OpenAI embedding APIs to create embeddings for your content. That may require chunking large files so that you stay inside the token limits of the service. Be prepared to create new tables for vector indexed data where necessary.
Vector search in Azure Cognitive Search uses a nearest neighbor model to return a selected number of documents that are similar to the original query. This uses a vector embedding of your original query in a call to the vector index, returning similar vectors from the database along with the indexed content, ready for use in an LLM prompt.
Microsoft uses vector stores like this as part of Azure Machine Learning’s Retrieval Augmented Generation (RAG) design pattern, working with its prompt flow tooling. RAG uses the vector index in Cognitive Search to build context that forms the foundation of an LLM prompt. This gives you a low code approach to building and using your vector index, for example setting the number of similar documents that a query returns.
Getting started with vector search in Azure Cognitive Search
Using Azure Cognitive Search for vector queries is straightforward. Start by creating resources for Azure OpenAI and Cognitive Search in the same region. This will allow you to load your search index with embeddings with minimal latency. You’ll need to make calls to both the Azure OpenAI APIs and the Cognitive Search APIs to load the index, so it’s a good idea to ensure that your code can respond to any possible rate limits in the service, by adding code that manages retries for you. As you’re working with service APIs, you should be using asynchronous calls both to generate embeddings and load the index.
Vectors are stored as vector fields in a search index, where vectors are floating point numbers that have dimensions. The vectors are mapped by a Hierarchical Navigable Small World proximity graph, which sorts vectors into neighborhoods of similar vectors, speeding up the actual process of searching the vector index.
Once you have defined the index schema for your vector search you can load the data into your Cognitive Search index. It’s important to note that your data could have more than one vector associated with it. For example, if you’re using Cognitive Search to host corporate documents you might have separate vectors for key document metadata terms as well as for the document content. Your data set must be stored as JSON documents, which should simplify using results to assemble prompt context. The index doesn’t need to contain your source documents, as it supports working with most common Azure storage options.
Running a query requires first making a call to your chosen embedding model with the body of your query. This returns a multi-dimensional vector you can use to search your chosen index. When calling the vector search APIs, indicate your target vector index, the number of matches you require, and the related text fields in the index. It’s useful to choose an appropriate similarity metric for your query, with a cosine metric most commonly used.
Going beyond simple text vectors
There’s much more to Azure Cognitive Search’s vector capabilities than simply matching text. Cognitive search is able to work with multilingual embeddings to support searches across documents in many languages. You can use more complex APIs too. For example, you could mix in the Bing semantic search tools in a hybrid search that can provide more accurate results, improving the quality of the output from your LLM-powered application.
Microsoft is quickly productizing the tools and techniques it used to build its own GPT-4-powered Bing search engine and its various Copilots. Orchestration engines like Semantic Kernel and Azure AI Studio’s prompt flow are at the heart of Microsoft’s approach to using large language models. Now that those foundations have been laid, we’re seeing the company roll out more of the requisite supporting technologies. Vector search and a vector index are key to delivering accurate responses. By building on familiar tooling to deliver these, Microsoft will help keep our costs and our learning curves to a minimum.