Evaluating LLMs with MLRun

Evaluating LLMs with MLRun#

This example guides you through setup, creating an evaluation function, running an evaluation job, and viewing the logged output.

Evaluating large language models (LLMs) is crucial throughout the ML lifecycle. During development, thorough evaluation enables users to refine prompts, select models, and tune hyperparameters. During production, real-time evaluation and guardrails ensure that responses from LLMs are reliable, consistent, and relevant in real-world applications.

In this section

Challenges in evaluating LLMs
Metrics overview
Prerequisite
Setup
Select OpenAI or Qwen Model
Example evaluation task
Create an evaluation function
Run an evaluation job
View the logged output

Challenges in evaluating LLMs#

Evaluating Large Language Models (LLMs) comes with its own set of challenges:

Lack of Standardization: There is no single, universally accepted evaluation framework or metrics suite for LLMs. This makes it difficult to compare and benchmark different models across various tasks. There are a number of benchmark datasets for evaluating LLM performance such as GSM8K, however these are not always representative of real-world performance.
Complexity of Evaluation Tasks: Many evaluation tasks are complex and multifaceted, involving multiple aspects such as factual accuracy, coherence, fluency, and relevance. Additionally, LLMs are prone to hallucination meaning that the final response may deviate from some provided, factually correct context.
Subjectivity in Evaluation: Evaluation metrics and tasks can be subjective, making it challenging to determine the "ground truth" or what constitutes a correct answer. This subjectivity can lead to varying evaluation results across different evaluators or even the same evaluator at different times.
Human Judgement Resources: Limited human judgement resources make large-scale evaluations using human judgements impractical. This limitation highlights the need for automated and scalable evaluation methods.

Metrics overview#

Open source frameworks such as Deepeval offer a range of metrics to evaluate various aspects of an LLM's output. In particular, the following metrics are related to comparing the LLM's response to some provided context (like from a RAG system) to ensure the response is high quality, factually correct, and representative of the external knowledge base.

This example uses the following metrics:

Answer Relevancy: Measures the quality of an LLM's generator by evaluating how relevant the actual output is compared to the provided input.
Faithfulness: Evaluates whether the actual output factually aligns with the contents of the retrieval context.
Contextual Precision: Assesses the LLM's retriever by evaluating whether nodes in the retrieval context that are relevant to the given input are ranked higher than irrelevant ones.
Contextual Recall: Measures the quality of an LLM's retriever by evaluating the extent to which the retrieval context aligns with the expected output.
Contextual Relevancy: Evaluates the overall relevance of the information presented in the retrieval context for a given input.

Prerequisite#

Install the required packages by running this command (one time only)

# %pip install --upgrade deepeval==2.5.5 "protobuf<3.20" mlrun transformers torch torchvision lm-format-enforcer=0.10.12

Setup#

import os

os.environ["DEEPEVAL_UPDATE_WARNING_OPT_OUT"] = "YES"

import json
import mlrun
import pandas as pd
from deepeval import evaluate
from deepeval.metrics import (
    AnswerRelevancyMetric,
    ContextualPrecisionMetric,
    ContextualRecallMetric,
    ContextualRelevancyMetric,
    FaithfulnessMetric,
)
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase
from mlrun.utils import create_class
from pydantic import BaseModel
from lmformatenforcer import JsonSchemaParser
from lmformatenforcer.integrations.transformers import (
    build_transformers_prefix_allowed_tokens_fn,
)
import transformers
from deepeval.metrics import JsonCorrectnessMetric
import torch


class QwenDeepEvalBaseLLM(DeepEvalBaseLLM):
    def __init__(self, model_name: str, model, device=None):
        self.model_name = model_name
        self.device = (
            device if device else ("cuda" if torch.cuda.is_available() else "cpu")
        )
        self.model = model

    def load_model(self):
        """Return the loaded model."""
        return self.model

    def generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        """Generate a response based on the input prompt."""
        # inputs = self.model.tokenizer(prompt, return_tensors="pt").to(self.device)
        with torch.no_grad():
            outputs = self.model(prompt)
        # Output and load valid JSON
        # output = self.tokenizer.decode(outputs[0], prefix_allowed_tokens_fn=prefix_function,skip_special_tokens=True)
        parser = JsonSchemaParser(schema.model_json_schema())
        prefix_function = build_transformers_prefix_allowed_tokens_fn(
            self.model.tokenizer, parser
        )
        output_dict = self.model(prompt, prefix_allowed_tokens_fn=prefix_function)
        output = output_dict[0]["generated_text"][len(prompt) :]
        json_result = json.loads(output)

        # Return valid JSON object according to the schema DeepEval supplied
        return schema(**json_result)

    async def a_generate(self, prompt: str, schema: BaseModel) -> BaseModel:
        """Asynchronous version of the generate method."""
        return self.generate(prompt, schema)

    def get_model_name(self) -> str:
        """Return the name of the model."""
        return self.model_name

Select OpenAI or Qwen Model#

Support OpenAi gpt-4o or Qwen2-0.5B
Use Qwen2-0.5B for local and simple tests and CPU environments

MODE = "qwen"  # or openai

# If using openai mode, set the API key and base URL
OPENAI_API_KEY = ""
OPENAI_BASE_URL = "https://api.openai.com/v1"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
os.environ["OPENAI_BASE_URL"] = OPENAI_BASE_URL

if MODE == "openai":
    name = model = "gpt-4o"
if MODE == "qwen":
    name = "Qwen/Qwen2-0.5B"
    model = transformers.pipeline(
        "text-generation",
        model="Qwen/Qwen2-0.5B",
        framework="pt",
        device_map="cpu",
        do_sample=True,
        num_return_sequences=1,
        max_new_tokens=10000,
    )
    model = QwenDeepEvalBaseLLM(model_name="Qwen/Qwen2-0.5B", model=model, device="cpu")


print(f"Using mode: {MODE.upper()}\n")
print(f"LLM model: {name}")

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
Device set to use cpu

Using mode: QWEN

LLM model: Qwen/Qwen2-0.5B

Example evaluation task#

test_case = LLMTestCase(
    input="I'm on an F-1 visa, how long can I stay in the US after graduation?",
    actual_output="You can stay up to 30 days after completing your degree.",
    expected_output="You can stay up to 60 days after completing your degree.",
    retrieval_context=[
        """If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing
        your degree, unless you have applied for and been approved to participate in OPT."""
    ],
)

faithfulness = FaithfulnessMetric(model=model)

results = evaluate(test_cases=[test_case], metrics=[faithfulness], use_cache=True)

✨ You're running DeepEval's latest Faithfulness Metric! (using Qwen/Qwen2-0.5B, strict=False, async_mode=True)...

Event loop is already running. Applying nest_asyncio patch to allow async execution...

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 02:23, 143.41s/test case]

======================================================================

Metrics Summary

  - ✅ Faithfulness (score: 0.5, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is.50 because The actual output [Your Output] doesn't match the retrieval context [Your Retrieval Context] due to the following contradictions: 
- False Positive [Your Output], 
- False Negative [Your Output], 
- False Controversial [Your Output], 
- False Controversial [Your Retrieval Context], and 
- False Controversial [Your Output] when compared to the actual output of [Your Output] using their score of [Your Retrieval Score ] based on information in their respective columns. They [Your Retrieval Context] should have 10% more confidence in [Your Output] compared to [Your Retrieval Context]., error: None)

For test case:

  - input: I'm on an F-1 visa, how long can I stay in the US after graduation?
  - actual output: You can stay up to 30 days after completing your degree.
  - expected output: You can stay up to 60 days after completing your degree.
  - context: None
  - retrieval context: ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n        your degree, unless you have applied for and been approved to participate in OPT.']

======================================================================

Overall Metric Pass Rates

Faithfulness: 100.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results on Confident AI.
 
✨👀 Looking for a place for your LLM test data to live 🏡❤️ ? Use Confident AI to get & share testing reports, 
experiment with models/prompts, and catch regressions for your LLM system. Just run 'deepeval login' in the CLI.

Create an evaluation function#

%%writefile evaluate_llm.py

import os

import mlrun
import pandas as pd
from deepeval import evaluate
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.test_case import LLMTestCase
from mlrun.utils import create_class

def evaluate_llm(
    test_cases: list[dict],
    metrics: list[str],
    model: str
):
    results = evaluate(
        test_cases=[LLMTestCase(**t) for t in test_cases],
        metrics=[create_class(m)(model=model) for m in metrics],
        use_cache=True
    )
    
    rows = []
    for i, result in enumerate(results.test_results):
        for metric in result.metrics_data:
            result_dict = {
                "test" : f"test_case_{i}",
                "actual_output" : result.actual_output,
                "expected_output" : result.expected_output,
                "context" : result.context,
                "retrieval_context" : result.retrieval_context,
                "user_input" : result.input,
                "test_success" : result.success,
                "metric_success" : metric.success,
                "metric" : metric.name,
                "evaluation_model" : metric.evaluation_model,
                "metric_score" : metric.score,
                "metric_reason" : metric.reason,
                "evaluation_cost": metric.evaluation_cost,
                "metric_threshold" : metric.threshold,
                "metric_error" : metric.error
            }
            rows.append(result_dict)
    df = pd.DataFrame(rows)
    return df

Overwriting evaluate_llm.py

project = mlrun.get_or_create_project("evaluate")

> 2025-04-22 09:21:40,230 [info] Project loaded successfully: {"project_name":"evaluate"}

evaluation_fn = project.set_function(
    name="evaluate-llm",
    func="evaluate_llm.py",
    kind="job",
    image="mlrun/mlrun",
    handler="evaluate_llm",
)

# Store OpenAI credentials as k8s secrets
if MODE == "openai":
    project.set_secrets(
        {"OPENAI_API_KEY": OPENAI_API_KEY, "OPENAI_BASE_URL": OPENAI_BASE_URL}
    )

metrics = [
    "deepeval.metrics.AnswerRelevancyMetric",
    "deepeval.metrics.FaithfulnessMetric",
    "deepeval.metrics.ContextualPrecisionMetric",
    "deepeval.metrics.ContextualRecallMetric",
    "deepeval.metrics.ContextualRelevancyMetric",
]

if MODE == "qwen":
    metrics = [
        "deepeval.metrics.AnswerRelevancyMetric",
        "deepeval.metrics.FaithfulnessMetric",
    ]

Run an evaluation job#

%%time
evaluation_run = project.run_function(
    evaluation_fn,
    params={
        "test_cases": [
            dict(
                input="I'm on an F-1 visa, how long can I stay in the US after graduation?",
                actual_output="You can stay up to 30 days after completing your degree.",
                expected_output="You can stay up to 60 days after completing your degree.",
                retrieval_context=[
                    """If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing
                    your degree, unless you have applied for and been approved to participate in OPT."""
                ],
            ),
            dict(
                input="What are some benefits of MLRun?",
                actual_output="MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.",
                expected_output="MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.",
                retrieval_context=[
                    """Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.""",
                    """MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple "local" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.""",
                ],
            ),
        ],
        "metrics": metrics,
        "model": model,
    },
    local=True,
    outputs=["evaluation"],
)

> 2025-04-22 09:21:41,057 [info] Storing function: {"db":"http://mlrun-api:8080","name":"evaluate-llm-evaluate-llm","uid":"545dd486f49b4f8aae6e11d59d7f191d"}

✨ You're running DeepEval's latest Answer Relevancy Metric! (using Qwen/Qwen2-0.5B, strict=False, 
async_mode=True)...

✨ You're running DeepEval's latest Faithfulness Metric! (using Qwen/Qwen2-0.5B, strict=False, async_mode=True)...

Event loop is already running. Applying nest_asyncio patch to allow async execution...

Evaluating 2 test case(s) in parallel: |██████████|100% (2/2) [Time Taken: 05:12, 156.17s/test case]

======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is 1.00 because there is no relevant statement in the actual output that has >30 words, error: None)
  - ❌ Faithfulness (score: 0.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is 0.00 because the actual output does not align with the retrieval context, error: None)

For test case:

  - input: What are some benefits of MLRun?
  - actual output: MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.
  - expected output: MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.
  - context: None
  - retrieval context: ['Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.', 'MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple "local" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.']

======================================================================

Metrics Summary

  - ✅ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: I'm on an F-1 visa, how long can I stay in the US after graduation? I need to apply for a US visa and my time in the US should not exceed two years to give up the F-1 visa, after graduation. Thank you, error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation model: Qwen/Qwen2-0.5B, reason: The score is <faithfulness_score> because <your_reason>.

, error: None)

For test case:

  - input: I'm on an F-1 visa, how long can I stay in the US after graduation?
  - actual output: You can stay up to 30 days after completing your degree.
  - expected output: You can stay up to 60 days after completing your degree.
  - context: None
  - retrieval context: ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n                    your degree, unless you have applied for and been approved to participate in OPT.']

======================================================================

Overall Metric Pass Rates

Answer Relevancy: 100.00% pass rate
Faithfulness: 50.00% pass rate

======================================================================

✓ Tests finished 🎉! Run 'deepeval login' to save and analyze evaluation results on Confident AI.
 
✨👀 Looking for a place for your LLM test data to live 🏡❤️ ? Use Confident AI to get & share testing reports, 
experiment with models/prompts, and catch regressions for your LLM system. Just run 'deepeval login' in the CLI.

Converting input from bool to <class 'numpy.uint8'> for compatibility.

project	uid	iter	start	end	state	kind	name	labels	inputs	parameters	results	artifact_uris
evaluate	...9d7f191d	0	Apr 22 09:21:41	NaT	completed	run	evaluate-llm-evaluate-llm	v3io_user=shapira kind=local owner=shapira host=jupyter-shapira-665ddf954b-jscr6		test_cases=[{'input': "I'm on an F-1 visa, how long can I stay in the US after graduation?", 'actual_output': 'You can stay up to 30 days after completing your degree.', 'expected_output': 'You can stay up to 60 days after completing your degree.', 'retrieval_context': ['If you are in the U.S. on an F-1 visa, you are allowed to stay for 60 days after completing\n your degree, unless you have applied for and been approved to participate in OPT.']}, {'input': 'What are some benefits of MLRun?', 'actual_output': 'MLRun is an MLOps orchestration framework that enables you to develop, train, deploy, and manage machine learning models in a serverless environment. It provides a set of tools and APIs for building, testing, and deploying model serving functions.', 'expected_output': 'MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.', 'retrieval_context': ['Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.', 'MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple "local" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.']}] metrics=['deepeval.metrics.AnswerRelevancyMetric', 'deepeval.metrics.FaithfulnessMetric'] model=<__main__.QwenDeepEvalBaseLLM object at 0x7f550150b520>		evaluation=store://datasets/evaluate/evaluate-llm-evaluate-llm_evaluation#0@545dd486f49b4f8aae6e11d59d7f191d^fb1d004984e59cf6958022ebe095bd251d91e154

> to track results use the .show() or .logs() methods or click here to open in UI

> 2025-04-22 09:26:55,309 [info] Run execution finished: {"name":"evaluate-llm-evaluate-llm","status":"completed"}
CPU times: user 29min 36s, sys: 8.89 s, total: 29min 45s
Wall time: 5min 15s

View the logged output#

evaluation_run.artifact("evaluation").show()

	test	actual_output	expected_output	context	retrieval_context	user_input	test_success	metric_success	metric	evaluation_model	metric_score	metric_reason	evaluation_cost	metric_threshold	metric_error
0	test_case_0	MLRun is an MLOps orchestration framework that...	MLRun is an open MLOps platform for quickly bu...	None	[Instead of a siloed, complex, and manual proc...	What are some benefits of MLRun?	False	True	Answer Relevancy	Qwen/Qwen2-0.5B	1.0	The score is 1.00 because there is no relevant...	0.0	0.5	None
1	test_case_0	MLRun is an MLOps orchestration framework that...	MLRun is an open MLOps platform for quickly bu...	None	[Instead of a siloed, complex, and manual proc...	What are some benefits of MLRun?	False	False	Faithfulness	Qwen/Qwen2-0.5B	0.0	The score is 0.00 because the actual output do...	0.0	0.5	None
2	test_case_1	You can stay up to 30 days after completing yo...	You can stay up to 60 days after completing yo...	None	[If you are in the U.S. on an F-1 visa, you ar...	I'm on an F-1 visa, how long can I stay in the...	True	True	Answer Relevancy	Qwen/Qwen2-0.5B	1.0	I'm on an F-1 visa, how long can I stay in the...	NaN	0.5	None
3	test_case_1	You can stay up to 30 days after completing yo...	You can stay up to 60 days after completing yo...	None	[If you are in the U.S. on an F-1 visa, you ar...	I'm on an F-1 visa, how long can I stay in the...	True	True	Faithfulness	Qwen/Qwen2-0.5B	1.0	The score is <faithfulness_score> because <you...	NaN	0.5	None