Evaluating LLMs with MLRun#

Evaluating large language models (LLMs) is crucial throughout the ML lifecycle. During development, thorough evaluation enables users to refine prompts, select models, and tune hyperparameters. During production, real-time evaluation and guardrails ensure that responses from LLMs are reliable, consistent, and relevant in real-world applications.

Challenges in evaluating LLMs#

Evaluating Large Language Models (LLMs) comes with its own set of challenges:

  • Lack of Standardization: There is no single, universally accepted evaluation framework or metrics suite for LLMs. This makes it difficult to compare and benchmark different models across various tasks. There are a number of benchmark datasets for evaluating LLM performance such as GSM8K, however these are not always representative of real-world performance.

  • Complexity of Evaluation Tasks: Many evaluation tasks are complex and multifaceted, involving multiple aspects such as factual accuracy, coherence, fluency, and relevance. Additionally, LLMs are prone to hallucination meaning that the final response may deviate from some provided, factually correct context.

  • Subjectivity in Evaluation: Evaluation metrics and tasks can be subjective, making it challenging to determine the "ground truth" or what constitutes a correct answer. This subjectivity can lead to varying evaluation results across different evaluators or even the same evaluator at different times.

  • Human Judgement Resources: Limited human judgement resources make large-scale evaluations using human judgements impractical. This limitation highlights the need for automated and scalable evaluation methods.

Metrics overview#

Open source frameworks such as Deepeval offer a range of metrics to evaluate various aspects of an LLM's output. In particular, the following metrics are related to comparing the LLM's response to some provided context (like from a RAG system) to ensure the response is high quality, factually correct, and representative of the external knowledge base.

This example uses the following metrics:

  • Answer Relevancy: Measures the quality of an LLM's generator by evaluating how relevant the actual output is compared to the provided input.

  • Faithfulness: Evaluates whether the actual output factually aligns with the contents of the retrieval context.

  • Contextual Precision: Assesses the LLM's retriever by evaluating whether nodes in the retrieval context that are relevant to the given input are ranked higher than irrelevant ones.

  • Contextual Recall: Measures the quality of an LLM's retriever by evaluating the extent to which the retrieval context aligns with the expected output.

  • Contextual Relevancy: Evaluates the overall relevance of the information presented in the retrieval context for a given input.

Prerequisite#

# %pip install --upgrade deepeval
# %pip install "protobuf<3.20"

Setup#

import os

import mlrun
import pandas as pd
from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    FaithfulnessMetric,
)
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase
from mlrun.utils import create_class

# OpenAI
OPENAI_API_KEY = ""
OPENAI_BASE_URL = "https://api.openai.com/v1"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["OPENAI_BASE_URL"] = OPENAI_BASE_URL
OPENAI_MODEL = "gpt-3.5-turbo-0125"

# Ollama
OLLAMA_URL = "http://ollama.default.svc.cluster.local:11434"
OLLAMA_MODEL = "llama3"


# Custom langchain class for deepeval
class DeepEvalLangchainLLM(DeepEvalBaseLLM):
    def __init__(self, model):
        self.model = model

    def load_model(self):
        return self.model

    def generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        return chat_model.invoke(prompt).content

    async def a_generate(self, prompt: str) -> str:
        chat_model = self.load_model()
        res = await chat_model.ainvoke(prompt)
        return res.content

    def get_model_name(self):
        try:
            return self.model.model
        except AttributeError:
            pass
        try:
            return self.model.model_name
        except AttributeError:
            pass
        return "Custom Langchain Model"

Select OpenAI or Ollama#

MODE = "openai"  # also supports ollama

if MODE == "openai":
    model = OPENAI_MODEL

elif MODE == "ollama":
    llm_class = "langchain_community.chat_models.ChatOllama"
    llm_kwargs = {"model": OLLAMA_MODEL, "base_url": OLLAMA_URL}
    llm = create_class(llm_class)(**llm_kwargs)
    model = DeepEvalLangchainLLM(model=llm)
else:
    raise ValueError(f"Mode {MODE} not supported")

print(f"Using mode: {MODE.upper()}\n")
print(f"LLM model: {model}")
Using mode: OPENAI

LLM model: gpt-3.5-turbo-0125

Example evaluation task#

test_case = LLMTestCase(
    input="I'm on an F-1 visa, gow long can I stay in the US after graduation?",
    actual_output="You can stay up to 30 days after completing your degree.",
    expected_output="You can stay up to 60 days after completing your degree.",
    retrieval_context=[
        """If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing
        your degree, unless you have applied for and been approved to participate in OPT."""
    ],
)

faithfulness = FaithfulnessMetric(model=model)

results = evaluate(test_cases=[test_case], metrics=[faithfulness])
Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


======================================================================

Metrics Summary

  - ❌ Faithfulness (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 0.00 because the actual output directly contradicts the retrieval context by stating you are allowed to stay for 30 days after completing your degree instead of the correct 60 days., error: None)

For test case:

  - input: I'm on an F-1 visa, gow long can I stay in the US after graduation?
  - actual output: You can stay up to 30 days after completing your degree.
  - expected output: You can stay up to 60 days after completing your degree.
  - context: None
  - retrieval context: ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n        your degree, unless you have applied for and been approved to participate in OPT.']

======================================================================

Overall Metric Pass Rates

FaithfulnessMetric: 0.00% pass rate

======================================================================
timeout has no effect in blocking mode
✅ Tests finished! Run "deepeval login" to view evaluation results on the web.

Create an evaluation function#

%%writefile evaluate_llm.py

import os

import mlrun
import pandas as pd
from deepeval import evaluate
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase
from mlrun.utils import create_class

@mlrun.handler(outputs=["evaluation"])
def evaluate_llm(
    test_cases: list[dict],
    metrics: list[str],
    model: str
):
    results = evaluate(
        test_cases=[LLMTestCase(**t) for t in test_cases],
        metrics=[create_class(m)(model=model) for m in metrics]
    )
    
    rows = []
    for i, result in enumerate(results):
        for metric in result.metrics:
            result_dict = {
                "test" : f"test_case_{i}",
                "actual_output" : result.actual_output,
                "expected_output" : result.expected_output,
                "context" : result.context,
                "retrieval_context" : result.retrieval_context,
                "user_input" : result.input,
                "test_success" : result.success,
                "metric_success" : metric.success,
                "metric" : metric.__name__,
                "evaluation_model" : metric.evaluation_model,
                "metric_score" : metric.score,
                "metric_reason" : metric.reason,
                "evaluation_cost": metric.evaluation_cost,
                "metric_threshold" : metric.threshold,
                "metric_error" : metric.error
            }
            rows.append(result_dict)
    df = pd.DataFrame(rows)
    return df
Overwriting evaluate_llm.py
project = mlrun.get_or_create_project("evaluate")
> 2024-06-12 16:01:54,211 [info] Project loaded successfully: {'project_name': 'evaluate'}
evaluation_fn = project.set_function(
    name="evaluate-llm",
    func="evaluate_llm.py",
    kind="job",
    image="mlrun/mlrun",
    handler="evaluate_llm",
)

# Only relevant for OpenAI
evaluation_fn.set_envs(
    {"OPENAI_API_KEY": OPENAI_API_KEY, "OPENAI_BASE_URL": OPENAI_BASE_URL}
)
<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f461c116b80>

Run an evaluation job#

evaluation_run = project.run_function(
    evaluation_fn,
    params={
        "test_cases": [
            dict(
                input="I'm on an F-1 visa, gow long can I stay in the US after graduation?",
                actual_output="You can stay up to 30 days after completing your degree.",
                expected_output="You can stay up to 60 days after completing your degree.",
                retrieval_context=[
                    """If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing
                    your degree, unless you have applied for and been approved to participate in OPT."""
                ],
            ),
            dict(
                input="What are some benefits of MLRun?",
                actual_output="MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.",
                expected_output="MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.",
                retrieval_context=[
                    """Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.""",
                    """MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple "local" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.""",
                ],
            ),
        ],
        "metrics": [
            "deepeval.metrics.AnswerRelevancyMetric",
            "deepeval.metrics.FaithfulnessMetric",
            "deepeval.metrics.ContextualPrecisionMetric",
            "deepeval.metrics.ContextualRecallMetric",
            "deepeval.metrics.ContextualRelevancyMetric",
        ],
        "model": OPENAI_MODEL,
    },
    local=True,
)
> 2024-06-12 16:24:21,581 [info] Storing function: {'name': 'evaluate-llm-evaluate-llm', 'uid': '795196166bf64eb896d1501109bc197e', 'db': 'http://mlrun-api:8080'}
Evaluating test cases...
Event loop is already running. Applying nest_asyncio patch to allow async execution...


timeout has no effect in blocking mode


======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 1.00 because the response provided directly answers the question with relevant information., error: None)
  - ❌ Faithfulness (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 0.00 because the actual output does not align with the information presented in the retrieval context, indicating a lack of faithfulness., error: None)
  - ✅ Contextual Precision (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is perfect because the relevant context directly answers the question, providing a clear and concise response., error: None)
  - ✅ Contextual Recall (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 1.00 because the sentence perfectly matches the information retrieved from the 1st node in the retrieval context., error: None)
  - ✅ Contextual Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 1.00 because the input directly relates to the topic of F-1 visa duration after graduation., error: None)

For test case:

  - input: I'm on an F-1 visa, gow long can I stay in the US after graduation?
  - actual output: You can stay up to 30 days after completing your degree.
  - expected output: You can stay up to 60 days after completing your degree.
  - context: None
  - retrieval context: ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n                    your degree, unless you have applied for and been approved to participate in OPT.']

======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 1.00 because the answer is completely relevant to the input., error: None)
  - ❌ Faithfulness (score: 0.3333333333333333, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 0.33 because the actual output includes information about MLRun enabling the development, training, deployment, and management of machine learning models in a serverless environment, as well as providing tools and APIs for building, testing, and deploying model serving functions, which were not mentioned in the retrieval context., error: None)
  - ❌ Contextual Precision (score: 0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 0.00 because the irrelevant nodes are ranked higher than the relevant nodes, as the first node does not directly address the benefits of MLRun and the second node discusses MLRun's features instead of its benefits in reducing engineering efforts, time to production, and computation resources., error: None)
  - ✅ Contextual Recall (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 1.00 because all the expected sentences can be directly attributed to the nodes in the retrieval context, indicating perfect contextual recall., error: None)
  - ✅ Contextual Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-3.5-turbo-0125, reason: The score is 1.00 because the input directly asks about the benefits of MLRun., error: None)

For test case:

  - input: What are some benefits of MLRun?
  - actual output: MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.
  - expected output: MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.
  - context: None
  - retrieval context: ['Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.', 'MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple "local" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.']

======================================================================

Overall Metric Pass Rates

AnswerRelevancyMetric: 100.00% pass rate
FaithfulnessMetric: 0.00% pass rate
ContextualPrecisionMetric: 50.00% pass rate
ContextualRecallMetric: 100.00% pass rate
ContextualRelevancyMetric: 100.00% pass rate

======================================================================
timeout has no effect in blocking mode
✅ Tests finished! Run "deepeval login" to view evaluation results on the web.
Converting input from bool to <class 'numpy.uint8'> for compatibility.
project uid iter start state name labels inputs parameters results artifacts
evaluate 0 Jun 12 16:24:21 completed evaluate-llm-evaluate-llm
v3io_user=nick
kind=local
owner=nick
host=jupyter-nick-9dccd9cf6-kp69z
test_cases=[{'input': "I'm on an F-1 visa, gow long can I stay in the US after graduation?", 'actual_output': 'You can stay up to 30 days after completing your degree.', 'expected_output': 'You can stay up to 60 days after completing your degree.', 'retrieval_context': ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n your degree, unless you have applied for and been approved to participate in OPT.']}, {'input': 'What are some benefits of MLRun?', 'actual_output': 'MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.', 'expected_output': 'MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.', 'retrieval_context': ['Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.', 'MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple "local" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.']}]
metrics=['deepeval.metrics.AnswerRelevancyMetric', 'deepeval.metrics.FaithfulnessMetric', 'deepeval.metrics.ContextualPrecisionMetric', 'deepeval.metrics.ContextualRecallMetric', 'deepeval.metrics.ContextualRelevancyMetric']
model=gpt-3.5-turbo-0125
evaluation

> to track results use the .show() or .logs() methods or click here to open in UI
> 2024-06-12 16:24:37,013 [info] Run execution finished: {'status': 'completed', 'name': 'evaluate-llm-evaluate-llm'}

View the logged output#

evaluation_run.artifact("evaluation").show()
test actual_output expected_output context retrieval_context user_input test_success metric_success metric evaluation_model metric_score metric_reason evaluation_cost metric_threshold metric_error
0 test_case_0 You can stay up to 30 days after completing yo... You can stay up to 60 days after completing yo... None [If you are in the U.S. on an F-1 visa, you ar... I'm on an F-1 visa, gow long can I stay in the... False True Answer Relevancy gpt-3.5-turbo-0125 1.000000 The score is 1.00 because the response provide... 0.000485 0.5 None
1 test_case_0 You can stay up to 30 days after completing yo... You can stay up to 60 days after completing yo... None [If you are in the U.S. on an F-1 visa, you ar... I'm on an F-1 visa, gow long can I stay in the... False False Faithfulness gpt-3.5-turbo-0125 0.000000 The score is 0.00 because the actual output do... 0.000917 0.5 None
2 test_case_0 You can stay up to 30 days after completing yo... You can stay up to 60 days after completing yo... None [If you are in the U.S. on an F-1 visa, you ar... I'm on an F-1 visa, gow long can I stay in the... False True Contextual Precision gpt-3.5-turbo-0125 1.000000 The score is perfect because the relevant cont... 0.000559 0.5 None
3 test_case_0 You can stay up to 30 days after completing yo... You can stay up to 60 days after completing yo... None [If you are in the U.S. on an F-1 visa, you ar... I'm on an F-1 visa, gow long can I stay in the... False True Contextual Recall gpt-3.5-turbo-0125 1.000000 The score is 1.00 because the sentence perfect... 0.000469 0.5 None
4 test_case_0 You can stay up to 30 days after completing yo... You can stay up to 60 days after completing yo... None [If you are in the U.S. on an F-1 visa, you ar... I'm on an F-1 visa, gow long can I stay in the... False True Contextual Relevancy gpt-3.5-turbo-0125 1.000000 The score is 1.00 because the input directly r... 0.000150 0.5 None
5 test_case_1 MLRun is an MLOps orchestration framework that... MLRun is an open MLOps platform for quickly bu... None [Instead of a siloed, complex, and manual proc... What are some benefits of MLRun? False True Answer Relevancy gpt-3.5-turbo-0125 1.000000 The score is 1.00 because the answer is comple... 0.000607 0.5 None
6 test_case_1 MLRun is an MLOps orchestration framework that... MLRun is an open MLOps platform for quickly bu... None [Instead of a siloed, complex, and manual proc... What are some benefits of MLRun? False False Faithfulness gpt-3.5-turbo-0125 0.333333 The score is 0.33 because the actual output in... 0.001351 0.5 None
7 test_case_1 MLRun is an MLOps orchestration framework that... MLRun is an open MLOps platform for quickly bu... None [Instead of a siloed, complex, and manual proc... What are some benefits of MLRun? False False Contextual Precision gpt-3.5-turbo-0125 0.000000 The score is 0.00 because the irrelevant nodes... 0.000776 0.5 None
8 test_case_1 MLRun is an MLOps orchestration framework that... MLRun is an open MLOps platform for quickly bu... None [Instead of a siloed, complex, and manual proc... What are some benefits of MLRun? False True Contextual Recall gpt-3.5-turbo-0125 1.000000 The score is 1.00 because all the expected sen... 0.001079 0.5 None
9 test_case_1 MLRun is an MLOps orchestration framework that... MLRun is an open MLOps platform for quickly bu... None [Instead of a siloed, complex, and manual proc... What are some benefits of MLRun? False True Contextual Relevancy gpt-3.5-turbo-0125 1.000000 The score is 1.00 because the input directly a... 0.000136 0.5 None