Redis Vector Database - Redis University

Pre-trained machine learning models simplify the job of data scientists and avoid lengthy and complex operations to turn objects into the corresponding vector embedding. However, managing massive datasets for development and production environments becomes challenging, especially when real-time throughput, scalability, and high availability are not negotiable requirements.

The availability of machine learning models has boosted the rise of modern use cases and, consequently, the development and adoption of vector databases. Vector databases can store vectors and index and search the vector space efficiently.

Vector databases resolve the problem of managing vectors and their operations, so they must meet specific throughput requirements, ensuring they can handle increasing volumes of data and queries. Hence, it is crucial to ensure the scalability of the data layer and guarantee high availability, with high uptime and uninterrupted access to the stored vector data in case of disaster or maintenance operations.

Vector databases accelerate semantic search

In the following units, we will learn how Redis Stack is designed to perform vector search across millions of vectors with real-time performance. In addition, we will discover how Redis Enterprise and Redis Cloud are designed for high availability and scalability and allow the design of production-ready modern applications.

Modeling vectors in Redis

All the Redis database flavors can store, index, and search vectors. This means that you can work with vectors using the Redis Stack distribution in your development environment and also for functional testing. Redis Enterprise and Redis Enterprise Cloud are built upon the Redis Stack capabilities, but they also offer a robust set of features to work efficiently with vectors at scale.

Redis as a Vector Database

First, it is important to highlight that before the native support for vectors was introduced in Redis Stack Server 6.2.2-v1 in 2022, vectors would be stored in Redis as a string, so serializing the vector and storing it in the desired data structure. An example using the String:

SET vec "0.00555776,0.06124274,-0.05503812,-0.08395513,-0.09052192,-0.01091553,-0.06539601,0.01099653,-0.07732834,0.0536432"

Redis can store any arbitrary object once serialization and deserialization routines are available. A vector is just another object that Redis can store when serialized to the String type. However, Redis has no awareness of the intrinsic nature of the stored object and does not offer any feature to search through the space of vectors.

Since Redis Stack Server 6.2.2-v1, vectors can be stored as Hash or JSON documents, providing flexibility in how data is structured and accessed. Multiple indexing methods are supported, including FLAT and HNSW, enabling users to choose the most suitable approach for their specific use cases. Users can privilege precision over speed with the FLAT method or ensure high throughput with a little compromise on accuracy using HNSW. Additionally, Redis offers support for various distance metrics such as L2, IP, and COSINE, further enhancing the precision and efficiency of vector searches for specific types of embeddings. With these features, Redis becomes a flexible solution for businesses seeking to harness the power of vector data in diverse applications, from recommendation engines to similarity search tasks.

Storing vectors: the hash and JSON data types

Both the Hash and the JSON data types are suitable vector containers. In the following examples, we will show how to work with such data types. Let’s calculate the vector embedding first, using the free all-MiniLM-L6-v2 embedding model from the HuggingFace library. This model maps texts of up to 256 words to a 384-dimensional dense vector space.

text = "Understanding vector search is easy, but understanding all the mathematics behind a vector is not!"
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embedding = model.encode(text)

Note that Redis does not generate vectors; this is the responsibility of the client application to choose the desired library (HuggingFace, OpenAI, Cohere, and more)

Next, we will store the vector embedding using the desired data structure and learn the syntax to create the index on the vector field stored in the document of choice. If you have already worked with Redis secondary indexing capabilities, you know how to use the commands FT.CREATE and FT.SEARCH. Vectors can be indexed using the VECTOR data type, which adds to the existing TEXT, TAG, NUMERIC, GEO and GEOSHAPE types.

Working with hashes

The vector embedding we have just generated can be stored in a Hash as a binary blob within the document itself, together with the rest of the fields. This means that if our document is structured as follows:

{
    "content": "Understanding vector search is easy, but understanding all the mathematics behind a vector is not!",
    "genre": "technical"
}

then we will include the vector embedding in the document itself:

{
    "content": "Understanding vector search is easy, but understanding all the mathematics behind a vector is not!",
    "genre": "technical",
    "embedding": "..."
}

In the following Python code sample, the utility astype from the numPy library for scientific computing is used: it casts the vector to the desired binary blob format, required by Redis for indexing purposes.

blob = embedding.astype(np.float32).tobytes()
r.hset('doc:1', mapping = {'embedding': blob,
                           'genre': 'technical',
                           'content': text})

Hash documents can be indexed with FT.CREATE using the VECTOR index type. We can also index other fields in the same index definition, like the TEXT and TAG fields in the following instructions. Indexing several fields in the same index enables hybrid searches, which we’ll show later.

FT.CREATE doc_idx ON HASH PREFIX 1 doc: SCHEMA content AS content TEXT genre AS genre TAG embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE

Note how we have specified:

the dimension of the vectors, set by the specific embedding model all-MiniLM-L6-v2
the indexing method, HNSW
the vector type, FLOAT32 in the example
the distance, COSINE in the example

Refer to the documentation to learn more about these options.

Working with JSON documents

When using the JSON type to store the vectors, differently from the hash, vectors must be stored as arrays of floats instead of binary blobs. In this Python code sample, the numPy library converts the vector embedding to a list and stores it with the original text and the desired data.

vector = embedding.tolist()
doc = {
    'embedding': vector,
    'genre': 'technical',
    'content': text
}
r.json().set("doc:1", '$', doc)

Redis long-time users are familiar with the Hash data type and may opt for it based on its simplicity, speed, and reduced memory footprint. Users that have experience with document stores, instead, may privilege the JSON format for better interoperability.

Note that one JSON document can store and index multiple vector embeddings. Certain data models may benefit from this feature for specific data representations and document searches. For example, if a large document is split into several chunks, these can all be stored under the same JSON document together with their associated representation as a vector.

Indexing the JSON document can be achieved similarly to the hash:

FT.CREATE doc_idx ON JSON PREFIX 1 doc: SCHEMA $.content as content TEXT $.genre AS genre TAG $.embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE_METRIC COSINE

Once the data is inserted and the index created using the desired data type, searching for similarity is straightforward.

Activity. Searching vectors

We know how to generate vector embeddings and create the corresponding index on the vector field. Let’s go back to the former example when we introduced the concept of cosine similarity, and let’s run the example to store the sentences once they are vectorized and search them.

You can create a Python environment and install the required libraries to run the example as follows:

python -m venv redisvenv
source ./redisvenv/bin/activate

pip install numpy
pip install sentence\_transformers
pip install redis

Once your virtual environment is configured, you can move on to the rest of the tasks.

Download the code provided in the file vector_search.py.
Study the code example. In particular, focus on the conversion of the embedding to binary blob and how it is stored in the hash data structure.
Configure your Redis Cloud (or local instance) database host, port, username and password in the file.
Connect to the database using RedisInsight or redis-cli and flush the database with FLUSHALL.
Execute the example. The first time the sample is executed, the requested embedding model all-MiniLM-L6-v2 is downloaded and stored. Wait patiently, this can take a few seconds.

The former script will print on the terminal the two closest results using the k-nearest neighbors algorithm (KNN):

Result{2 total, docs: [Document {'id': 'doc:1', 'payload': None, 'score': '0.0570845603943', 'content': 'That is a very happy person'}, Document {'id': 'doc:2', 'payload': None, 'score': '0.305422723293', 'content': 'That is a happy dog'}]}

Expectedly, the best match is “That is a very happy person”, having a shorter distance from the test sentence “That is a happy person”.

Note that the cosine distance is complementary to cosine similarity and can be obtained by subtracting the value of the cosine similarity from 1.

Data types, distances and indexing methods

Using Redis as a Vector Database, you have several options to make at design time that will influence your data model, the correctness of the results, and the overall performance of your application. The three main aspects you will evaluate in this unit are:

The data type
The distance
The indexing methods

Let’s cover the main points to consider when designing your application.

Choosing the right data type

Redis can store and manage vectors in Hash or JSON data types, as discussed. Besides the intrinsic structural differences between the Hash and the JSON, let’s make some considerations.

First, the JSON data type supports the same features as the Hash data type when performing VSS. There are some slight differences, though, to take into account when working with a determined type.

Searching
- When using Hashes, storing and searching vectors requires using the binary blob format.
- For JSON documents, formats used for storing and searching are asymmetric: vectors must be stored as lists rather than binary blobs (model.encode(text).astype(np.float32).tolist()), but to perform VSS, JSON requires the binary blob format model.encode(text).astype(np.float32).tobytes()
Indexing. The Hash can index a single vector, defined by the FT.CREATE command. The JSON format, instead, can store and have multiple vectors indexed, identified by a JSONPath expression
Footprint. JSON has a larger memory footprint compared to the Hash

Choosing the right distance

We mentioned that similarity between vectors can be measured through different methods; currently, we support three among the most popular: Euclidean distance, Internal product, and Cosine similarity.

L2. The Euclidean distance is the default distance metric used by many algorithms, and it generally gives good results. Conceptually, it should be used when we compare observations whose features are continuous: numeric variables like height, weight, or salaries, for example, although it should be noted that it works best with low-dimensional data and where the magnitude of the vectors is important to be measured.
COSINE. Cosine similarity considers the cosine of the angle formed by two vectors (when the angle is close to 0, the cosine tends to 1, representing the maximum similarity). The cosine similarity does not account for the magnitude of the vectors being compared. The cosine distance is complementary to cosine similarity (obtained by subtracting the cosine similarity value from 1). This distance is appropriate when the magnitude of the vectors is not important in the description of the unstructured data
IP. The inner product looks at both the angle between the vectors and their magnitude. Note that this distance is equivalent to cosine similarity if vectors are normalized.

Depending on the model used to represent the unstructured data, one distance may fit better than the others.

Choosing the indexing method

When a new vector is added to Redis, it can be indexed by one of the two indexing methods:

Flat index (FLAT)

You can use the FLAT indexing method for smaller datasets. This method compares the test vector to all the vectors in the index, one by one. This is a more accurate but much slower and compute-intensive approach

Hierarchical Navigable Small World graphs (HNSW)

For more extensive datasets, it becomes difficult to compare the test vector to every single vector in the index, so a probabilistic approach is adopted through the HNSW algorithm. This method provides speedy search results. This approach trades some accuracy for significant performance improvements.

Activity. Vector search with range queries

Redis supports Range Queries for vector search, a way of filtering query results by the distance between the stored vectors and a query vector in terms of the relevant vector field distance metric. You can think of it as a geo query by radius, where we return all the points within a certain distance of a given point, except that the radius is the distance between the vectors. As an example, we can modify the query written in the previous example with:

q = Query("@embedding:[VECTOR_RANGE $radius $vec]=>{$YIELD_DISTANCE_AS: score}").return_field("score").return_field("content").dialect(2)
res = r.ft("doc_idx").search(q, query_params={"vec": model.encode(sentence).astype(np.float32).tobytes(), "radius":0.1})

This time, rather than specifying that we want the two best matches, we specify that we would rather have all the sentences with a distance score under 0.1. Executing the example with the modification produces:

Result{1 total, docs: [Document {'id': 'doc:1', 'payload': None, 'score': '0.0570845603943', 'content': 'That is a very happy person'}]}

The matching sentence is the expected result. You can learn more about range queries in the documentation.

Activity. Vector search with hybrid queries

Vector search can be combined with the other querying mechanisms (including range queries), giving us the possibility to run hybrid queries. For example, we can modify the query in the previous example as follows to indicate that we want to filter by category. Using the TAG field specified for the genre:

q = Query("@genre:{pets}=>[KNN 2 @embedding $vec AS score]").return_field("score").return_field("content").dialect(2)

The result is returned correctly.

Result{1 total, docs: [Document {'id': 'doc:2', 'payload': None, 'score': '0.305422723293', 'content': 'That is a happy dog'}]}

You can learn more about hybrid queries in the documentation.

Activity. Working with RedisVL

In the introduction to vector search, we have provided examples using the redis-py standard client library for Python, supported by Redis. In this unit, we will provide you with a small example program that uses the RedisVL client for Python to store and manipulate data in Redis. RedisVL is an experimental library, and the API may be subject to change. However, for prototyping purposes, it may accelerate the development, as it abstracts several operations to model, store, and index unstructured data.

The code is located in the src/python/ folder in the course GitHub repository. You should have already cloned this repository to your machine as part of the initial course setup step.

Follow the instructions in the README.md file if you’d like to run the code in your local environment.

Code Walkthrough

The example is a Python version of the simple vector search example already introduced along the course, where we:

Instantiate the proper embedding model
Create the index with the desired fields
Create vectors from the three sentences using the model, and store them
Consider a sample sentence, calculate the embedding, and perform vector search

Embedding model creation

The embedding model we will be using in this example proceeds from HuggingFace. Installing the dependency is done with:

  pip install redisvl
  pip install sentence_transformers

The chosen model is all-MiniLM-L6-v2, which maps sentences and paragraphs to a 384-dimensional dense vector space.

hf = HFTextVectorizer(model="sentence-transformers/all-MiniLM-L6-v2")

Index creation

In this example, we are modeling simple documents having this structure:

{
    "content": "This is a content",
    "genre": "just-a-genre",
    "embedding": "..."
}

Provided there is no nested information in our document, the Hash data type fulfills the purpose. In addition to creating an index for the vector embedding, we will also create a full-text index of type TEXT for the content field and an index of type TAG for the genre field. The relevant options for the VECTOR index type, such as the Euclidean distance and the vector dimension, are also specified. You can learn more about the rest of the options from the documentation. The index is defined by the schema.yaml file.

index:
    name: vector\_idx
    prefix: doc

fields:
    text:
        - name: content
    tag:
        - name: genre
    vector:
        - name: embedding
          dims: 384
          distance\_metric: l2
          algorithm: HNSW

Index creation follows the import of the schema:

index = SearchIndex.from\_yaml("schema.yaml")

Vector embedding generation

Vector embeddings can be created using the instantiated model. Note that embeddings are stored in Hashes using the binary blob format used in the example.

data = \[
    {'content': 'That is a very happy person', 'genre': 'persons', 'embedding': hf.embed('That is a very happy person', as\_buffer=True)},
    {'content': 'That is a happy dog', 'genre': 'pets', 'embedding': hf.embed('That is a happy dog', as\_buffer=True)},
    {'content': 'Today is a sunny day', 'genre': 'weather', 'embedding': hf.embed('Today is a sunny day', as\_buffer=True)}
\]

index.load(data)

Remember that JSON must store vector embeddings as arrays of floats. Hence, notice the flag to be used for JSON documents: as_buffer=False

Perform the search

Finally, considering the test sentence “That is a happy person”, we perform the KNN search and return the score of the search and the content of the best matches. In this example, we are returning the three documents so that you can analyze the score returned by the field vector_distance.

query = VectorQuery(
    vector=hf.embed('That is a happy person'),
    vector\_field\_name="embedding",
    return\_fields=\["content"\],
    num\_results=3,
)

results = index.query(query)

Implementing a text-based recommender system

The idea behind a recommender system using vector search is to transform the relevant information (title, description, date of creation, authors, and more) into the corresponding vector embedding and store it in the same document as the original data. Then, when visualizing an entry (an article from a digital newspaper or any other media from the web), it is possible to leverage the stored vector embedding for that entry and feed into a vector search operation to semantically similar content.

Let’s consider the following example.

If you want to run the example first, jump to the bottom of this article to learn how to do so.

Writing a recommender system

We will use the dataset of books available under /data/books
The source code of the example is available as the Jupyter notebook books.ipynb

You can refer to the source code for the details to load the books and generate the embeddings. Books will be stored in the following JSON format and using the Redis Stack JSON data type.

  {
      "author": "Martha Wells",
      "id": "43",
      "description": "My risk-assessment module predicts a 53 percent chance of a human-on-human massacre before the end of the contract." A short story published in Wired.com magazine on December 17, 2018.",
      "editions": \[
        "english"
      \],
      "genres": \[
        "adult",
        "artificial intelligence",
        "fantasy",
        "fiction",
        "humor",
        "science fiction",
        "science fiction (dystopia)",
        "short stories",
        "space"
      \],
      "inventory": \[
        {
          "status": "available",
          "stock\_id": "43\_1"
        }
      \],
      "metrics": {
        "rating\_votes": 274,
        "score": 4.05
      },
      "pages": 369,
      "title": "Compulsory",
      "url": "https://www.goodreads.com/book/show/56033969-compulsory",
      "year\_published": 2018
    }

The relevant section in the example is the implementation of semantic search, delivered by this snippet of code:

def get_recommendation(key):
embedding = r.json().get(key)
embedding_as_blob = np.array(embedding['embedding'], dtype=np.float32).tobytes()
q = Query("*=>[KNN 5 @embedding $vec AS score]").return_field("$.title").sort_by("score", asc=True).dialect(2).paging(1, 5)
res = r.ft("books_idx").search(q, query_params={"vec": embedding_as_blob})
return res

The previous snippet does the following:

Given a document, it extracts the vector embedding for that document from the JSON entry
It converts the vector embedding, stored as an array of floats, to a binary array
It executes Vector Similarity Search to find similarities and get the most similar books
It pages the results, excluding the first result. Hence, paging starts from 1 rather than 0. In the first position, we would find the entry itself, having a distance from the test vector equal to zero

Launching the execution of the example for the two known movies: “It” and “Transformers: The Ultimate Guide” :

print(get\_recommendation('book:26415'))
print(get\_recommendation('book:9'))

We obtain the related recommendations:

Result{5 total, docs: \[Document {'id': 'book:3008', 'payload': None, '$.title': 'Wayward'}, Document {'id': 'book:2706', 'payload': None, '$.title': 'Before the Devil Breaks You'}, Document {'id': 'book:23187', 'payload': None, '$.title': 'Neverwhere'}, Document {'id': 'book:942', 'payload': None, '$.title': 'The Dead'}\]}

Result{5 total, docs: \[Document {'id': 'book:15', 'payload': None, '$.title': 'Transformers Volume 1: For All Mankind'}, Document {'id': 'book:3', 'payload': None, '$.title': 'Transformers: All Fall Down'}, Document {'id': 'book:110', 'payload': None, '$.title': 'Transformers: Exodus: The Official History of the War for Cybertron (Transformers'}, Document {'id': 'book:2', 'payload': None, '$.title': 'Transformers Generation One, Vol. 1'}\]}

Performing range search

In this example, we executed a KNN search and retrieved the documents with the closest distance from the document being considered. Alternatively, we can perform a vector search range search to retrieve results having the desired distance from the sample vector embedding. The related code is:

def get_recommendation_by_range(key):
embedding = r.json().get(key)
embedding_as_blob = np.array(embedding['embedding'], dtype=np.float32).tobytes()
q = Query("@embedding:[VECTOR_RANGE $radius $vec]=>{$YIELD_DISTANCE_AS: score}")\
  .return_fields("title")\
  .sort_by("score", asc=True)\
  .paging(1, 5)\
  .dialect(2)

# Find all vectors within a radius from the query vector
query_params = {
  "radius": 3,
  "vec": embedding_as_blob
}

res = r.ft("books_idx").search(q, query_params)
return res

Computing this vector search range search returns similar results.

Result{1486 total, docs: \[Document {'id': 'book:3008', 'payload': None, 'title': 'Wayward'}, Document {'id': 'book:2706', 'payload': None, 'title': 'Before the Devil Breaks You'}, Document {'id': 'book:23187', 'payload': None, 'title': 'Neverwhere'}, Document {'id': 'book:942', 'payload': None, 'title': 'The Dead'}, Document {'id': 'book:519', 'payload': None, 'title': 'The Last Days of Magic'}\]}

Result{1486 total, docs: \[Document {'id': 'book:15', 'payload': None, 'title': 'Transformers Volume 1: For All Mankind'}, Document {'id': 'book:3', 'payload': None, 'title': 'Transformers: All Fall Down'}, Document {'id': 'book:110', 'payload': None, 'title': 'Transformers: Exodus: The Official History of the War for Cybertron (Transformers'}, Document {'id': 'book:2', 'payload': None, 'title': 'Transformers Generation One, Vol. 1'}, document {'id': 'book:37', 'payload': None, 'title': 'How to Build a Robot Army: Tips on Defending Planet Earth Against Alien Invaders, Ninjas, and Zombies'}\]}

We have provided you with a Jupyter notebook that includes the entire example. Follow this procedure to create and activate your Python virtual environment:

python -m venv vssvenv
source vssvenv/bin/activate

Once done, install the required modules defined by the requirements.txt requirements file, available under /src/jupyter

  pip install -r requirements.txt

Ensure that you have database host, port, username and password for your Redis Cloud database at hand (alternatively, a Redis Stack instance is running). Complete the configuration of the environment by setting the environment variable that configures your Redis instance (default is localhost on port 6379).

Connect to the database using RedisInsight or redis-cli and flush the database with FLUSHALL.
Configure the environment variable to connect export REDIS_URL=redis://user:password@host:port

Now, you can start the notebook, execute all the cells, and check the results.

jupyter notebook books.ipynb

Large Language Models

The rise of conversational Artificial Intelligence (AI) has taken the world by storm in the early months of 2023, thanks to the advent of powerful Large Language Models (LLMs) such as ChatGPT latest releases. ChatGPT versions 3.5 and 4, presented around March 2023, have surprised users with unprecedented quality answers, the ability to solve complex and structured problems, produce ideas, organize and edit texts, and generate source code, all of this using natural and conversational questions in the desired language have impressed the world in a wide variety of scenarios and use cases. While such a paradigm shift has been driven by the ChatGPT assistant available for free to the public, the possibility of turning the usual applications and services into smart assistants has been accelerated by pay-as-you-go service models by OpenAI and other providers, democratizing the access to such advanced capabilities.

Running LLMs on-premise is hard. Besides the massive amount of resources to hold billions of parameters in memory required to generate an almost real-time answer to a question (which is referred to as prompt), designing such systems for scalability and elasticity requires a non-negligible effort. Making such services available to developers for rapid prototyping is another factor contributing to the vast adoption of LLMs and the surge of many heterogeneous services in different areas where a user requires dedicated and customized attention, such as recommendations, assistance, financial advisory, troubleshooting, and more.

Finally, factors that have influenced the rapid ascent of conversational AI services and have led to the release of increasingly advanced algorithms are massive datasets available for training (the Internet), computational power and efficiency provided by modern Graphical Processing Units (GPUs), and the advance of distributed systems and architectures. Training such systems is extremely time-consuming and resource-intensive. For example, training ChatGPT 4 took over a month and dozens of GPUs, which led to freezing the training set in time and cutting off forthcoming knowledge. This intrinsic feature of LLMs poses a constraint on several kinds of applications: working with fresh data is not possible when integrating an LLM technology into a service. Given this background, new techniques are becoming popular to circumvent such limitations and enable LLMs to assist the user even in the case of recent updates to the corpus of knowledge and provide answers when the model was not trained on specific content.

The challenge of outdated information

If you have ever asked ChatGPT the following question:

❓ What is the newest data ChatGPT is trained on?

You may already have realized that ChatGPT training happened at some point in the past, which means that the newest data may already be quite in the past (and the same is true for other LLMs).

❗ My training is based on the GPT-3.5 architecture, and my knowledge is current up until
September 2021. Therefore, any events, information, or developments that have occurred
after that date are outside my training data, and I may not have the most up-to-date
information on them.

This is why it is impossible to answer questions like:

❓ What are the relevant facts of 2023?

❗ I apologize for any inconvenience, but as of my last knowledge update in September 2021, I do not have access to information or events that have occurred in 2023 or beyond. My training data only goes up to that point, and I am not able to browse the internet or access real-time information.
To get information about events and facts relevant to 2023, I would recommend checking reliable news sources, websites, or databases that provide up-to-date information on current events and developments.

Retraining LLMs to include the latest and fresh knowledge is expensive and not viable in the immediate term, even in the case of resorting to a custom LLM trained on-premise, so two principal methods are gaining traction to overcome this limitation and enable the latest information in the system: fine-tuning and Retrieval Augmented Generation (RAG).

Redis, the Vector Database for conversational AI use cases

Redis, as a high-performance, in-memory data platform, can play a pivotal role in addressing the challenges of LLM-based use cases. Here’s how:

Context Retrieval for RAG. Pairing Redis Enterprise with LLMs enables these models to access external contextual knowledge. This contextual knowledge is crucial for providing accurate and context-aware responses, preventing the model from generating incorrect or ‘hallucinated’ answers. By storing and indexing vectors that model unstructured data, Redis Enterprise ensures that the LLM can retrieve relevant information quickly and effectively, enhancing its response quality.
LLM Conversation Memory. Redis Enterprise allows the persistence of all conversation history (memories) as embeddings in a vector database to improve model quality and personalization. When a conversational agent interacts with the LLM, it can check for relevant memories to aid or personalize the LLM’s behavior. This feature enables seamless topic transitions during conversations and reduces misunderstandings.
Semantic Caching. LLM completions can be computationally expensive. Redis Enterprise helps reduce the overall costs of ML-powered applications by caching input prompts and evaluating cache hits based on semantic similarity using vector search. This caching mechanism ensures that frequently requested information is readily available, optimizing response times and resource utilization.
Fine-tuning and Retrieval Augmented Generation (RAG)

General purpose LLMs can be extended and turned into a specific purpose model by training part of the model (which does not imply nor require retraining the model in its entirety, but merely adjusting some model’s parameters, while most remain unchanged). Fine-tuning involves training the model with specific data (typically prepared following a conversation format), which results in higher task-specific training. However, this approach comes with drawbacks, such as the need for retraining when fresh knowledge is required. Conversely, RAG represents a simplified and convenient way to instruct the model with the desired information when an interaction with the model is required.

RAG, presented by Meta in 2020, allows LLMs to incorporate external knowledge sources through retrieval mechanisms, extending the model capabilities with the latest information. This method enables language models to perform similarly to humans, with little information collected from the environment and in real-time. RAG has been demonstrated to be very effective. However, it requires careful prompt engineering, management of fresh knowledge, and the orchestration of different components. The following picture summarizes the flow when a user interacts with a chatbot assistant by asking a question.

Retrieval Augmented Generation with Redis

We can simplify the architecture by considering the following three phases:

Preparation. The knowledge we want to make available to increase the expertise of our LLM assistant is collected, transformed, ingested, and indexed. This requires a specific data preprocessing pipeline, with connectors to the data source and downstream connectors to the target database. In the implementation we will explore in this article, Redis is the chosen Vector Database. The data can be represented by articles, documents, books, and any textual source to specialize our chat. Of the many indexing strategies available, vector databases have been demonstrated to be effective at indexing and searching unstructured data stored in vectorial format.
Retrieval. In this phase, the information (or context) relevant to the user’s question is retrieved. Database semantic search assists in this task: the question is converted to a vector embedding, and vector search is performed to retrieve the relevant results from the database. vector search can be configured and performed with hybrid or range search strategies to determine what results best describe the question and can likely contain an answer. The assumption is that the question and the answer will be semantically similar.
Generation. Time of prompt engineering: with the relevant context and the question in our hands, we proceed to create a prompt and instruct the LLM to elaborate and return a response. Composing the right prompt to leverage the provided context (and eventually the previous historical interactions in the chat) is crucial to getting a relevant answer to the question and guardrail the output.

LLM conversation memory

Current LLM services do not store any conversation history. So, conversations are stateless, which is the same. This means that once a question is asked and the answer generated, we cannot refer to previous passages in the conversation. Keeping the context of the conversation in memory and providing the LLM with the entire conversation (as a list of pairs question + response) together with the new question is the responsibility of the client application.

> Review the body of the OpenAI chat completion API, which accepts messages: the list of messages comprising the ongoing conversation. However, sending back to the LLM the entire conversation may not be convenient for two main reasons.

First, we should filter out irrelevant interactions from the current conversation when these do not relate to the last question. So, in practice, imagine a conversation about food interrupted by a few questions about coding and then additional questions about the former food context. Storing all the questions and responses and their corresponding vector embeddings in the user’s session enables vector search to find those semantically similar interactions to the last question. Using this method, we can pick the relevant portion of the conversation.

LLM conversation memory

The second reason that motivates smart conversation history management is cost reduction. LLM-as-a-service models charge the user based on the number of tokens in the question and the answer. This means that the longer the context, the more expensive the LLM service.

The idea behind the LLM Conversation Memory is to improve the model quality and personalization through an adaptive memory.

Persist all conversation history (memories) as embeddings in a vector database.
A conversational agent checks for relevant memories to aid or personalize the LLM behavior.
Allows users to change topics without misunderstandings seamlessly.
Semantic caching is used with large user bases or commonly asked questions. As usual with caching, this use case is about improving the application’s responsiveness and reducing costs when using LLM-as-a-service. Because LLM completions are expensive, it helps to reduce the overall costs of the ML-powered application.

Semantic caching with Redis

In practical terms, if a semantic cache is in place, whenever a new question is asked, this will be vectorized, and semantic search will be executed to find out if this question was already asked (we may use vector search with range search and establish a radius to refine the results). If the same question has already been asked, the LLM does not intervene to generate the answer, and the cached response is returned. Otherwise, the LLM produces a new response, which is cached for future searches.

Use vector database to cache input prompts
Cache hits evaluated by semantic similarity

Note that the RedisVL client library makes semantic caching available out-of-the-box.

Setting up a RAG Chatbot

Bookmark this page

Prototyping an ML-powered chatbot is not an impossible mission. The many frameworks and libraries available, together with the simplicity of getting an API token from the chosen LLM service provider, can assist you in setting up a proof-of-concept in a few hours and lines of code. Sticking to the three phases mentioned earlier (preparation, generation, and retrieval), let’s proceed to create a chatbot assistant, a movie expert you can consult to get recommendations from and ask for specific movies.

If you want to run the example first, jump to the bottom of this article to learn how to do so.

Preparation

Imagine a movie expert who may answer questions or recommend movies based on criteria (genre, your favorite cast, or rating). A smart, automated chatbot will be trained on a corpus of popular films, which, for this example, we have downloaded from Kaggle: the IMDB movies dataset, with more than 10,000 movies and plenty of relevant information. An entry in the dataset stores the following information:

{
  "names": "The Super Mario Bros. Movie",
  "date\_x": "04/05/2023",
  "score": 76.0,
  "genre": "Animation, Adventure, Family, Fantasy, Comedy",
  "overview": "While working underground to fix a water main, Brooklyn plumbers---and brothers---Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi.",
  "crew": \[
    "Chris Pratt, Mario (voice)",
    "Anya Taylor-Joy, Princess Peach (voice)",
    "Charlie Day, Luigi (voice)",
    "Jack Black, Bowser (voice)",
    "Keegan-Michael Key, Toad (voice)",
    "Seth Rogen, Donkey Kong (voice)",
    "Fred Armisen, Cranky Kong (voice)",
    "Kevin Michael Richardson, Kamek (voice)",
    "Sebastian Maniscalco, Spike (voice)"
  \],
  "status": "Released",
  "orig\_lang": "English",
  "budget\_x": 100000000.0,
  "revenue": 724459031.0,
  "country": "AU"
}

As mentioned, to enable context retrieval, we will capture the semantics of the data using an embedding model and we will store the embedding vector in the database, which will perform indexing using the desired method (FLAT or HNSW), distance (L2, IP or COSINE) and the required vector dimension. In particular, the index definition depends on the dimension of the vector specified by DIM, which is set by the chosen embedding model. The chosen embedding model we will use along this example is the open-source all-MiniLM-L6-v2 sentence transformer, which converts the provided paragraphs to a 384 dimensional dense vector space.

Note that embedding models support the conversion of texts up to a certain size. The chosen model warns that “input text longer than 256-word pieces is truncated”. This is not an issue for our movie dataset because we expect to convert paragraphs whose length is shorter than the limit. However, a text chunking strategy to map a document to multiple vector embeddings is needed for longer texts or even entire books.

Now we can parse the CSV dataset and import it in JSON format into Redis so that we can read a movie entry with:

JSON.GET moviebot:movie:2 $.names $.overview
{"$.overview":\["While working underground to fix a water main, Brooklyn plumbers---and brothers---Mario and Luigi are transported down a mysterious pipe and wander into a magical new world. But when the brothers are separated, Mario embarks on an epic quest to find Luigi."\],"$.names":\["The Super Mario Bros. Movie"\]}

We can read any nested entry or multiple entries in JSON documents stored in Redis Enterprise using the JSONPath syntax. However, we need an index to perform searches. So, we will proceed to create an index for this dataset, define the schema aligned to the data structure, and specify the embedding model and distance metric to be used for semantic search with vector search as long as the vector dimension set by the chosen embedding model, 384 in this case. A possible index definition could be:

FT.CREATE movie\_idx ON JSON PREFIX 1 moviebot:movie: SCHEMA $.crew AS crew TEXT $.overview AS overview TEXT $.genre AS genre TAG SEPARATOR , $.names AS names TAG SEPARATOR , $.overview\_embedding AS embedding VECTOR HNSW 6 TYPE FLOAT32 DIM 384 DISTANCE\_METRIC COSINE

This definition enables searches on several fields. As an example, we can perform a full-text search:

FT.SEARCH movie\_idx @overview:'While working underground' RETURN 1 names
1) (integer) 1
2) "moviebot:movie:2"
3) 1) "names"
   2) "The Super Mario Bros. Movie"

Or retrieve a movie by exact title match:

    FT.SEARCH movie\_idx @names:{Interstellar} RETURN 1 overview
    1) (integer) 1
    2) "moviebot:movie:190"
    3) 1) "overview"
       2) "The adventures of a group of explorers who make use of a newly discovered wormhole to surpass the limitations on human space travel and conquer the vast distances involved in an interstellar voyage."

Secondary index search is certainly relevant to assist the retrieval of contextual information or additional details, or even when the codebase is tightly coupled to the LLM using function calling capabilities. We want to answer questions using information that spans the entire dataset (such as the average rating of all the movies of a specific genre). However, for this proof-of-concept, we will resort to vector search only, and the index defined accordingly.

The final step to complete the preparation phase is deciding what will be indexed by the database; for that, we need to prepare the paragraph to be transformed by the embedding model. We can capture as much information as we want. In the following Python excerpt, we will extract one entry and format the string movie.

result = conn.json().get(key, "$.names", "$.overview", "$.crew", "$.score", "$.genre")
movie = f"movie title is: {result\['$.names'\]\[0\]}\\n"
movie += f"movie genre is: {result\['$.genre'\]\[0\]}\\n"
movie += f"movie crew is: {result\['$.crew'\]\[0\]}\\n"
movie += f"movie score is: {result\['$.score'\]\[0\]}\\n"
movie += f"movie overview is: {result\['$.overview'\]\[0\]}\\n"

Now, we can transform this string to a vector using the chosen model and store the vector in the same JSON entry, so the vector is packed together with the original entry in a compact object.

from sentence\_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embedding = model.encode(movie).astype(np.float32).tolist()
conn.json().set(key, "$.overview\_embedding", embedding)

Repeating the operation for all the movies in the dataset completes the preparation phase.

Retrieval

In this phase, we deal with the question from the user. The interaction is usually collected in the front end of a web application, using a standard input form, so you can capture it and forward it to the back end for processing. As anticipated, the question and the context will be semantically similar, so a proven technique to instruct the LLM with a context is to transform the question to vector embedding, then perform vector search to collect the desired number of outputs, and finally construct the prompt. A Python sample code to perform vector search in Redis follows:

context = ""
q = Query("@embedding:\[VECTOR\_RANGE $radius $vec\]=>{$YIELD\_DISTANCE\_AS: score}") \\
    .sort\_by("score", asc=True) \\
    .return\_fields("overview", "names", "score", "$.crew", "$.genre", "$.score") \\
    .paging(0, 3) \\
    .dialect(2)

# Find all vectors within VSS\_MINIMUM\_SCORE of the query vector
query\_params = {
    "radius": VSS\_MINIMUM\_SCORE,
    "vec": model.encode(query).astype(np.float32).tobytes()
}

res = conn.ft("movie\_idx").search(q, query\_params)

if (res is not None) and len(res.docs):
    it = iter(res.docs\[0:\])
    for x in it:
        movie = f"movie title is: {x\['names'\]}\\n"
        movie += f"movie genre is: {x\['$.genre'\]}\\n"
        movie += f"movie crew is: {x\['$.crew'\]}\\n"
        movie += f"movie score is: {x\['$.score'\]}\\n"
        movie += f"movie overview is: {x\['overview'\]}\\n"
        context += movie + "\\n"

The search command performs a vector search range search and filters results exceeding a certain score specified by VSS_MINIMUM_SCORE and collects three samples. In the example, we extract the desired metadata from the results and concatenate it to create a context for the interaction.

In our example, the dataset provides a short overview of the movie and other information, so we can construct the context by concatenating the information in a string. However, the context window supported by LLMs is limited by a maximum number of tokens (learn more from the OpenAI tokenizer page). In addition, the LLM service provider charges you by the overall number of input and output tokens, so limiting the number of tokens provided in the context and instructing the model to return an output limited in size may be convenient.

Having retrieved the required information, the prompt you construct should include the knowledge you want the LLM to use for generating responses. It should provide clear instructions for handling user queries and accessing the indexed data. An example might be:

prompt = '''Use the provided information to answer the search query the user has sent.
The information in the database provides three movies, choose the one or the ones that fit most.
If you can't answer the user's question, say "Sorry, I am unable to answer the question,
try to refine your question". Do not guess. You must deduce the answer exclusively
from the information provided.
The answer must be formatted in markdown or HTML.
Do not make things up. Do not add personal opinions. Do not add any disclaimer.

Search query:

{}

Information in the database:

{}
'''.format(query, context)

Formatting the prompt with the context and the query from the user completes the retrieval phase, and we are ready to interact with the LLM.

Generation

In the final phase, which concludes this example, we will forward the prompt to the LLM. We will use an OpenAI endpoint to leverage the GPT-3.5 model gpt-3.5-turbo-0613, but we may have used the desired model. Whatever the choice, using an LLM-as-a-service is the best way to set up and prepare a demonstration without major efforts, which a local LLM will imply. To go ahead with GPT-3.5, create your OpenAI token and specify it using the environment variable OPENAI_API_KEY.

export OPENAI_API_KEY="1234567890abcdefghijklmnopqrstuvwxyz"

Using the OpenAI ChatCompletion API is straightforward, refer to the API documentation to learn the details. To send the request you will need to specify, besides the chosen model, at least:

The system message sets the context and the tone of the conversation. It is typically the first message in the interaction and guides the model’s behavior during the conversation. For example, you may specify here that you would like the interaction customized for primary school students. This would tune the tone accordingly and produce responses suitable for youngsters.
The stream behavior. If you have used the ChatGPT online chatbot, you have noticed how the first word is returned almost immediately after submitting the question. This happens because the response is built and streamed as it is produced on the server. This creates a good user experience but requires managing a stream of information and updating the user interface accordingly. In this example, we will go for a batched response, so the time to the first word equals the time to get the full response, which is easier to implement.
The messages in the conversation. In this example, we are submitting the prompt just produced, but note that you won’t be able to ask the LLM questions such as Can you refine the previous response to be shorter? or make a summary of the conversation kept so far. Conversations with the LLM are stateless, so the model does not keep track of any interaction. It is delegated to the client application to store the messages, search for them when required, and forward them as a list of messages.

system_msg = 'You are a smart and knowledgeable AI assistant with expertise in all kinds of movies. You are a very friendly and helpful AI. You are empowered to recommend movies based on the provided context. Do NOT make anything up. Do NOT engage in topics that are not about movies.';

try:
    response = openai.ChatCompletion.create(model="gpt-3.5-turbo-0613",
                                            stream=False,
                                            messages=[{"role": "system", "content": system_msg},
                                                        {"role": "user", "content": prompt}])
    return response["choices"][0]["message"]["content"]
except openai.error.OpenAIError as e:
    # Handle the error here
    if "context window is too large" in str(e):
        print("Error: Maximum context length exceeded. Please shorten your input.")
        return "Maximum context length exceeded"
    else:
        print("An unexpected error occurred:", e)
        return "An unexpected error occurred"

Congratulations, you have completed the setup of a movie expert chatbot! Now follow the activity proposed in the next unit to run the complete implementation.

Activity. Setting up a RAG Chatbot

We have provided you with a Jupyter notebook that includes the entire example and opens an input form to chat with the Generative AI within the notebook. Follow this procedure to create and activate your Python virtual environment:

python -m venv vssvenv
source vssvenv/bin/activate

Once done, install the required modules defined by the requirements.txt requirements file, available under /src/jupyter

pip install -r requirements.txt

Connect to the database using RedisInsight or redis-cli and flush the database with FLUSHALL.
Configure the environment variable to connect export REDIS_URL=redis://user:password@host:port
Configure the OpenAI token using the environment variable: export OPENAI_API_KEY="your-openai-token"

Now, you can start the notebook and execute all the cells.

jupyter notebook moviebot.ipynb

The execution of the notebook will open an input field. Type your question (e.g., Recommend three science fiction movies) and check the result!

Redis Vector Database - Redis University

Modeling vectors in Redis

Storing vectors: the hash and JSON data types

Working with hashes

Working with JSON documents

Activity. Searching vectors

Data types, distances and indexing methods

Choosing the right data type

Choosing the right distance

Choosing the indexing method

Flat index (FLAT)

Hierarchical Navigable Small World graphs (HNSW)

Activity. Vector search with range queries

Activity. Vector search with hybrid queries

Activity. Working with RedisVL

Code Walkthrough

Embedding model creation

Index creation

Vector embedding generation

Perform the search

Implementing a text-based recommender system

Writing a recommender system

Performing range search

Large Language Models

The challenge of outdated information

Redis, the Vector Database for conversational AI use cases

Fine-tuning and Retrieval Augmented Generation (RAG)

LLM conversation memory

Setting up a RAG Chatbot

Preparation

Retrieval

Generation

Activity. Setting up a RAG Chatbot