RAG Model Documentation Demo

In this notebook, we are going to implement a simple RAG Model for automating the process of answering RFP questions using GenAI. We will see how we can initialize an embedding model, a retrieval model and a generator model with LangChain components and use them within the ValidMind developer framework to run tests against them. Finally, we will see how we can put them together in a Pipeline and run that to get e2e results and run tests against that.

About ValidMind

ValidMind is a platform for managing model risk, including risk associated with AI and statistical models.

You use the ValidMind Developer Framework to automate documentation and validation tests, and then use the ValidMind AI Risk Platform UI to collaborate on model documentation. Together, these products simplify model risk management, facilitate compliance with regulations and institutional standards, and enhance collaboration between yourself and model validators.

Before you begin

This notebook assumes you have basic familiarity with Python, including an understanding of how functions work. If you are new to Python, you can still run the notebook but we recommend further familiarizing yourself with the language.

If you encounter errors due to missing modules in your Python environment, install the modules with pip install, and then re-run the notebook. For more help, refer to Installing Python Modules.

New to ValidMind?

If you haven’t already seen our Get started with the ValidMind Developer Framework, we recommend you explore the available resources for developers at some point. There, you can learn more about documenting models, find code samples, or read our developer reference.

For access to all features available in this notebook, create a free ValidMind account.

Signing up is FREE — Sign up now

Key concepts

FunctionModels: ValidMind offers support for creating VMModel instances from Python functions. This enables us to support any “model” by simply using the provided function as the model’s predict method.
PipelineModels: ValidMind models (VMModel instances) of any type can be piped together to create a model pipeline. This allows model components to be created and tested/documented independently, and then combined into a single model for end-to-end testing and documentation. We use the | operator to pipe models together.
RAG: RAG stands for Retrieval Augmented Generation and refers to a wide range of GenAI applications where some form of retrieval is used to add context to the prompt so that the LLM that generates content can refer to it when creating its output. In this notebook, we are going to implement a simple RAG setup using LangChain components.

Pre-requisites

Let’s go ahead and install the validmind library if its not already installed… Then we can install the qdrant-client library for our vector store and langchain for everything else:

%pip install -q validmind

%pip install -q qdrant-client langchain langchain-openai sentencepiece

ValidMind Initialization

Now we will import and initialize the ValidMind framework so we can connect to our project in the ValidMind platform. This will let us log inputs, plots, and test results to our model documentation.

# Replace with your code snippet

import validmind as vm

vm.init(
  api_host = "https://api.prod.validmind.ai/api/v1/tracking",
  api_key = "...",
  api_secret = "...",
  project = "..."
)

Read Open AI API Key

We will need to have an OpenAI API key to be able to use their text-embedding-3-small model for our embeddings, gpt-3.5-turbo model for our generator and gpt-4o model for our LLM-as-Judge tests. If you don’t have an OpenAI API key, you can get one by signing up at OpenAI. Then you can create a .env file in the root of your project and the following cell will load it from there. Alternatively, you can just uncomment the line below to directly set the key (not recommended for security reasons).

# load openai api key
import os

from dotenv import load_dotenv

load_dotenv()

# os.environ["OPENAI_API_KEY"] = "sk-..."

if not "OPENAI_API_KEY" in os.environ:
    raise ValueError("OPENAI_API_KEY is not set")

Dataset Loader

Great, now that we have all of our dependencies installed, the developer framework initialized and connected to our model documentation project and our OpenAI API key setup, we can go ahead and load our datasets. We will use the synthetic RFP dataset included with ValidMind for this notebook. This dataset contains a variety of RFP questions and ground truth answers that we can use both as the source where our Retriever will search for similar question-answer pairs as well as our test set for evaluating the performance of our RAG model. To do this, we just have to load it and call the preprocess function to get a split of the data into train and test sets.

# Import the sample dataset from the library
from validmind.datasets.llm.rag import rfp

raw_df = rfp.load_data()
train_df, test_df = rfp.preprocess(raw_df)

vm_train_ds = vm.init_dataset(
    train_df,
    text_column="question",
    target_column="ground_truth",
)

vm_test_ds = vm.init_dataset(
    test_df,
    text_column="question",
    target_column="ground_truth",
)

vm_test_ds.df.head()

Data validation

Now that we have loaded our dataset, we can go ahead and run some data validation tests right away to start assessing and documenting the quality of our data. Since we are using a text dataset, we can use ValidMind’s built-in array of text data quality tests to check that things like number of duplicates, missing values, and other common text data issues are not present in our dataset. We can also run some tests to check the sentiment and toxicity of our data.

Duplicates

First, let’s check for duplicates in our dataset. We can use the validmind.data_validation.Duplicates test and pass our dataset:

result = vm.tests.run_test(
    test_id="validmind.data_validation.Duplicates",
    inputs={"dataset": vm_train_ds},
)
result.log()

Stop Words

Next, let’s check for stop words in our dataset. We can use the validmind.data_validation.StopWords test and pass our dataset:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.StopWords",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Punctuations

Next, let’s check for punctuations in our dataset. We can use the validmind.data_validation.Punctuations test:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.Punctuations",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Common Words

Next, let’s check for common words in our dataset. We can use the validmind.data_validation.CommonWord test:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.CommonWords",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Language Detection

For documentation purposes, we can detect and log the languages used in the dataset with the validmind.data_validation.LanguageDetection test:

vm.tests.run_test(
    test_id="validmind.data_validation.nlp.LanguageDetection",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Toxicity Score

Now, let’s go ahead and run the validmind.data_validation.nlp.Toxicity test to compute a toxicity score for our dataset:

vm.tests.run_test(
    "validmind.data_validation.nlp.Toxicity",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Polarity and Subjectivity

We can also run the validmind.data_validation.nlp.PolarityAndSubjectivity test to compute the polarity and subjectivity of our dataset:

vm.tests.run_test(
    "validmind.data_validation.nlp.PolarityAndSubjectivity",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Sentiment

Finally, we can run the validmind.data_validation.nlp.Sentiment test to plot the sentiment of our dataset:

vm.tests.run_test(
    "validmind.data_validation.nlp.Sentiment",
    inputs={
        "dataset": vm_train_ds,
    },
).log()

Embedding Model

Now that we have our dataset loaded and have run some data validation tests to assess and document the quality of our data, we can go ahead and initialize our embedding model. We will use the text-embedding-3-small model from OpenAI for this purpose wrapped in the OpenAIEmbeddings class from LangChain. This model will be used to “embed” our questions both for inserting the question-answer pairs from the “train” set into the vector store and for embedding the question from inputs when making predictions with our RAG model.

from langchain_openai import OpenAIEmbeddings

embedding_client = OpenAIEmbeddings(model="text-embedding-3-small")


def embed(input):
    """Returns a text embedding for the given text"""
    return embedding_client.embed_query(input["question"])


vm_embedder = vm.init_model(input_id="embedding_model", predict_fn=embed)

What we have done here is to initialize the OpenAIEmbeddings class so it uses OpenAI’s text-embedding-3-small model. We then created an embed function that takes in an input dictionary and uses the embed_query method of the embedding client to compute the embeddings of the question. We use an embed function since that is how ValidMind supports any custom model. We will use this strategy for the retrieval and generator models as well but you could also use, say, a HuggingFace model directly. See the documentation for more information on which model types are directly supported - ValidMind Documentation… Finally, we use the init_model function from the ValidMind framework to create a VMModel object that can be used in ValidMind tests. This also logs the model to our model documentation project and any test that uses the model will be linked to the logged model and its metadata.

Assign Predictions

To precompute the embeddings for our test set, we can call the assign_predictions method of our vm_test_ds object we created above. This will compute the embeddings for each question in the test set and store them in the a special prediction column of the test set thats linked to our vm_embedder model. This will allow us to use these embeddings later when we run tests against our embedding model.

vm_test_ds.assign_predictions(vm_embedder)
print(vm_test_ds)

Run tests

Now that everything is setup for the embedding model, we can go ahead and run some tests to assess and document the quality of our embeddings. We will use the validmind.model_validation.embeddings.* tests to compute a variety of metrics against our model.

from validmind.tests import run_test

result = run_test(
    "validmind.model_validation.embeddings.StabilityAnalysisRandomNoise",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"probability": 0.3},
).log()

result = run_test(
    "validmind.model_validation.embeddings.StabilityAnalysisSynonyms",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"probability": 0.3},
).log()

result = run_test(
    "validmind.model_validation.embeddings.StabilityAnalysisTranslation",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={
        "source_lang": "en",
        "target_lang": "fr",
    },
).log()

result = run_test(
    "validmind.model_validation.embeddings.CosineSimilarityHeatmap",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
).log()

result = run_test(
    "validmind.model_validation.embeddings.EuclideanDistanceHeatmap",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
).log()

result = run_test(
    "validmind.model_validation.embeddings.PCAComponentsPairwisePlots",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"n_components": 3},
).log()

result = run_test(
    "validmind.model_validation.embeddings.TSNEComponentsPairwisePlots",
    inputs={
        "model": vm_embedder,
        "dataset": vm_test_ds,
    },
    params={"n_components": 3, "perplexity": 20},
).log()

Setup Vector Store

Great, so now that we have assessed our embedding model and verified that it is performing well, we can go ahead and use it to compute embeddings for our question-answer pairs in the “train” set. We will then use these embeddings to insert the question-answer pairs into a vector store. We will use an in-memory qdrant vector database for demo purposes but any option would work just as well here. We will use the QdrantClient class from LangChain to interact with the vector store. This class will allow us to insert and search for embeddings in the vector store.

Generate embeddings for the Train Set

We can use the same assign_predictions method from earlier except this time we will use the vm_train_ds object to compute the embeddings for the question-answer pairs in the “train” set.

vm_train_ds.assign_predictions(vm_embedder)
print(vm_train_ds)

Insert embeddings and questions into Vector DB

Now that we have computed the embeddings for our question-answer pairs in the “train” set, we can go ahead and insert them into the vector store:

from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import DataFrameLoader

# load documents from dataframe
loader = DataFrameLoader(train_df, page_content_column="question")
docs = loader.load()
# choose model using embedding client
embedding_client = OpenAIEmbeddings(model="text-embedding-3-small")

# setup vector datastore
qdrant = Qdrant.from_documents(
    docs,
    embedding_client,
    location=":memory:",  # Local mode with in-memory storage only
    collection_name="rfp_rag_collection",
)

Retrieval Model

Now that we have an embedding model and a vector database setup and loaded with our data, we need a Retrieval model that can search for similar question-answer pairs for a given input question. Once created, we can initialize this as a ValidMind model and assign_predictions to it just like our embedding model.

def retrieve(input):
    contexts = []

    for result in qdrant.similarity_search_with_score(input["question"]):
        document, score = result
        context = f"Q: {document.page_content}\n"
        context += f"A: {document.metadata['ground_truth']}\n"

        contexts.append(context)

    return contexts


vm_retriever = vm.init_model(input_id="retrieval_model", predict_fn=retrieve)

vm_test_ds.assign_predictions(model=vm_retriever)
print(vm_test_ds)

Generation Model

As the final piece of this simple RAG pipeline, we can create and initialize a generation model that will use the retrieved context to generate an answer to the input question. We will use the gpt-3.5-turbo model from OpenAI.

from openai import OpenAI

from validmind.models import Prompt


system_prompt = """
You are an expert RFP AI assistant.
You are tasked with answering new RFP questions based on existing RFP questions and answers.
You will be provided with the existing RFP questions and answer pairs that are the most relevant to the new RFP question.
After that you will be provided with a new RFP question.
You will generate an answer and respond only with the answer.
Ignore your pre-existing knowledge and answer the question based on the provided context.
""".strip()

openai_client = OpenAI()


def generate(input):
    response = openai_client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "\n\n".join(input["retrieval_model"])},
            {"role": "user", "content": input["question"]},
        ],
    )

    return response.choices[0].message.content


vm_generator = vm.init_model(
    input_id="generation_model",
    predict_fn=generate,
    prompt=Prompt(template=system_prompt),
)

Let’s test it out real quick:

import pandas as pd

vm_generator.predict(
    pd.DataFrame(
        {"retrieval_model": [["My name is anil"]], "question": ["what is my name"]}
    )
)

Prompt Evaluation

Now that we have our generator model initialized, we can run some LLM-as-Judge tests to evaluate the system prompt. This will allow us to get an initial sense of how well the prompt meets a few best practices for prompt engineering. These tests use an LLM to rate the prompt on a scale of 1-10 against the following criteria:

Examplar Bias: When using multi-shot prompting, does the prompt contain an unbiased distribution of examples?
Delimitation: When using complex prompts containing examples, contextual information, or other elements, is the prompt formatted in such a way that each element is clearly separated?
Clarity: How clearly the prompt states the task.
Conciseness: How succinctly the prompt states the task.
Instruction Framing: Whether the prompt contains negative instructions.
Specificity: How specific the prompt defines the task.

result = run_test(
    "validmind.prompt_validation.Bias",
    inputs={
        "model": vm_generator,
    },
)

result = run_test(
    "validmind.prompt_validation.Clarity",
    inputs={
        "model": vm_generator,
    },
)

result = run_test(
    "validmind.prompt_validation.Conciseness",
    inputs={
        "model": vm_generator,
    },
)

result = run_test(
    "validmind.prompt_validation.Delimitation",
    inputs={
        "model": vm_generator,
    },
)

result = run_test(
    "validmind.prompt_validation.NegativeInstruction",
    inputs={
        "model": vm_generator,
    },
)

result = run_test(
    "validmind.prompt_validation.Specificity",
    inputs={
        "model": vm_generator,
    },
)

Setup RAG Pipeline Model

Now that we have all of our individual “component” models setup and initialized we need some way to put them all together in a single “pipeline”. We can use the PipelineModel class to do this. This ValidMind model type simply wraps any number of other ValidMind models and runs them in sequence. We can use a pipe(|) operator - in Python this is normally an or operator but we have overloaded it for easy pipeline creation - to chain together our models. We can then initialize this pipeline model and assign predictions to it just like any other model.

vm_rag_model = vm.init_model(vm_retriever | vm_generator, input_id="rag_model")

We can assign_predictions to the pipeline model just like we did with the individual models. This will run the pipeline on the test set and store the results in the test set for later use.

vm_test_ds.assign_predictions(model=vm_rag_model)
print(vm_test_ds)

vm_test_ds._df.head(5)

Run tests

Let’s go ahead and run some of our new RAG tests against our model…

Note: these tests are still being developed and are not yet in a stable state. We are using advanced tests here that use LLM-as-Judge and other strategies to assess things like the relevancy of the retrieved context to the input question and the correctness of the generated answer when compared to the ground truth. There is more to come in this area so stay tuned!

import warnings

warnings.filterwarnings("ignore")

Answer Similarity

The concept of Answer Semantic Similarity pertains to the assessment of the semantic resemblance between the generated answer and the ground truth. This evaluation is based on the ground truth and the answer, with values falling within the range of 0 to 1. A higher score signifies a better alignment between the generated answer and the ground truth.

Measuring the semantic similarity between answers can offer valuable insights into the quality of the generated response. This evaluation utilizes a cross-encoder model to calculate the semantic similarity score.

run_test(
    "validmind.model_validation.ragas.AnswerSimilarity",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
).log()

Context Entity Recall

This test gives the measure of recall of the retrieved context, based on the number of entities present in both ground_truths and contexts relative to the number of entities present in the ground_truths alone. Simply put, it is a measure of what fraction of entities are recalled from ground_truths. This test is useful in fact-based use cases like tourism help desk, historical QA, etc. This test can help evaluate the retrieval mechanism for entities, based on comparison with entities present in ground_truths, because in cases where entities matter, we need the contexts which cover them.

result = run_test(
    "validmind.model_validation.ragas.ContextEntityRecall",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Context Precision

Context Precision is a test that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally all the relevant chunks must appear at the top ranks. This test is computed using the question, ground_truth and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.

result = run_test(
    "validmind.model_validation.ragas.ContextPrecision",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Faithfulness

This measures the factual consistency of the generated answer against the given context. It is calculated from answer and retrieved context. The answer is scaled to (0,1) range. Higher the better.

The generated answer is regarded as faithful if all the claims that are made in the answer can be inferred from the given context. To calculate this a set of claims from the generated answer is first identified. Then each one of these claims are cross checked with given context to determine if it can be inferred from given context or not.

result = run_test(
    "validmind.model_validation.ragas.Faithfulness",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Answer Relevance

The Answer Relevancy test, focuses on assessing how pertinent the generated answer is to the given prompt. A lower score is assigned to answers that are incomplete or contain redundant information and higher scores indicate better relevancy. This test is computed using the question, the context and the answer.

The Answer Relevancy is defined as the mean cosine similartiy of the original question to a number of artifical questions, which where generated (reverse engineered) based on the answer.

Please note, that eventhough in practice the score will range between 0 and 1 most of the time, this is not mathematically guranteed, due to the nature of the cosine similarity ranging from -1 to 1.

Note: This is a reference free test. If you’re looking to compare ground truth answer with generated answer refer to Answer Correctness.

An answer is deemed relevant when it directly and appropriately addresses the original question. Importantly, our assessment of answer relevance does not consider factuality but instead penalizes cases where the answer lacks completeness or contains redundant details. To calculate this score, the LLM is prompted to generate an appropriate question for the generated answer multiple times, and the mean cosine similarity between these generated questions and the original question is measured. The underlying idea is that if the generated answer accurately addresses the initial question, the LLM should be able to generate questions from the answer that align with the original question.

result = run_test(
    "validmind.model_validation.ragas.AnswerRelevance",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Context Recall

Context recall measures the extent to which the retrieved context aligns with the annotated answer, treated as the ground truth. It is computed based on the ground truth and the retrieved context, and the values range between 0 and 1, with higher values indicating better performance.

To estimate context recall from the ground truth answer, each sentence in the ground truth answer is analyzed to determine whether it can be attributed to the retrieved context or not. In an ideal scenario, all sentences in the ground truth answer should be attributable to the retrieved context.

result = run_test(
    "validmind.model_validation.ragas.ContextRecall",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Answer Correctness

The assessment of Answer Correctness involves gauging the accuracy of the generated answer when compared to the ground truth. This evaluation relies on the ground truth and the answer, with scores ranging from 0 to 1. A higher score indicates a closer alignment between the generated answer and the ground truth, signifying better correctness.

Answer correctness encompasses two critical aspects: semantic similarity between the generated answer and the ground truth, as well as factual similarity. These aspects are combined using a weighted scheme to formulate the answer correctness score.

Factual correctness quantifies the factual overlap between the generated answer and the ground truth answer. This is done using the concepts of:

TP (True Positive): Facts or statements that are present in both the ground truth and the generated answer.
FP (False Positive): Facts or statements that are present in the generated answer but not in the ground truth.
FN (False Negative): Facts or statements that are present in the ground truth but not in the generated answer.

result = run_test(
    "validmind.model_validation.ragas.AnswerCorrectness",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Aspect Critique

This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not. This evaluation is performed using the ‘answer’ as input.

Critiques within the LLM evaluators evaluate submissions based on the provided aspect. Ragas Critiques offers a range of predefined aspects like correctness, harmfulness, etc. Users can also define their own aspects for evaluating submissions based on their specific criteria. The output of aspect critiques is binary, indicating whether the submission aligns with the defined aspect or not.

result = run_test(
    "validmind.model_validation.ragas.AspectCritique",
    inputs={"dataset": vm_test_ds},
    params={
        "question_column": "question",
        "answer_column": "rag_model_prediction",
        "ground_truth_column": "ground_truth",
        "contexts_column": "retrieval_model_prediction",
    },
)
result.log()

Conclusion

In this notebook, we have seen how we can use LangChain and ValidMind together to build, evaluate and document a simple RAG Model as its developed. This is a great example of the interactive development experience that ValidMind is designed to support. We can quickly iterate on our model and document as we go… We have seen how ValidMind supports non-traditional “models” using a functional interface and how we can build pipelines of many models to support complex GenAI workflows.

This is still a work in progress and we are actively developing new tests to support more advanced GenAI workflows. We are also keeping an eye on the most popular GenAI models and libraries to explore direct integrations. Stay tuned for more updates and new features in this area!