Deploying an LLM using MLRun

Deploying an LLM using MLRun#

This notebook illustrates deploying an LLM using MLRun: it shows how to grab a dataset, which is a list of articles, scrape those articles, chunk and index them into a vector store, and then deploy an open-source Hugging Face model to an endpoint where you can make requests and get responses with a RAG enrichment using the data that you just downloaded.

Since this tutorial is for illustrative purposes, it uses minimal resources — CPU and not GPU, and a small amount of data.

In this tutorial:

MLRun installation and configuration
Set up the vector database in the cluster
Build the vector DB
Serving the function

See also:

MLRun installation and configuration#

Before running this notebook make sure the mlrun packages are installed (pip install mlrun) and that you have configured the access to MLRun service.

# Install MLRun if not installed, run this only once. Restart the notebook after the install
# %pip install mlrun

import json
import mlrun

Get or create a new project

First create, load or use (get) an MLRun Project. The get_or_create_project method tries to load the project from the MLRun DB. If the project does not exist, it creates a new one.

project = mlrun.get_or_create_project(
    "genai-tutorial", "./", user_project=True, allow_cross_project=True
)

> 2026-03-25 07:32:31,391 [info] Project loaded successfully: {"project_name":"genai-tutorial-admin"}

Set up the vector database in the cluster#

These two steps imports a pre-defined dataset and load it into a vector database. Then the vector database is stored in the data layer of the cluster.

If you're not using Iguazio's Jupyter, download fetch-vectordb-data.py.

# The model used is the free open-source PHI 2
MODEL_ID = "microsoft/phi-2"

# Define the dataset for the VectorDB
DATA_SET = mlrun.get_sample_path("data/genai-tutorial/labelled_newscatcher_dataset.csv")

# The location of the VectorDB files
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)

is_ce = CACHE_DIR.startswith("s3://")
is_ce

False

build an image from mlrun/mlrun to include langchain and torch packages#

The image can be created with mlrun/mlrun as base:

commands = [
    "pip install chromadb==0.5.0 langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 langchain-text-splitters==0.2.1 clean-text==0.6.0 transformers==4.41.2",
    "pip install torch --index-url https://download.pytorch.org/whl/cpu",
    "pip install --upgrade requests requests-toolbelt",
]

import sys

minor_version = float(sys.version_info[1])
if minor_version >= 11:
    commands.append("pip install --upgrade --force-reinstall protobuf")
else:
    print(f"minor_version {minor_version} not supported")
commands

['pip install chromadb==0.5.0 langchain==0.2.3 langchain-community==0.2.4 langchain-core==0.2.5 langchain-text-splitters==0.2.1 clean-text==0.6.0 transformers==4.41.2',
 'pip install torch --index-url https://download.pytorch.org/whl/cpu',
 'pip install --upgrade requests requests-toolbelt',
 'pip install --upgrade --force-reinstall protobuf']

Run the following command to build the image, once it's successfully built, no need to run the cell again since it takes quite sometime to build an image.

project.build_image(
    image=".llm-demo-data",
    base_image="mlrun/mlrun",
    set_as_default=False,
    commands=commands,
)

Fetch the dataset for the Vector DB and save it in cluster:

fetch = project.set_function(
    name="fetch-vectordb-data",
    func="src/fetch-vectordb-data.py",
    kind="job",
    image=".llm-demo-data",
)

ret = project.run_function(
    name="fetch-vectordb-data-run",
    function="fetch-vectordb-data",
    handler="handler",
    params={"data_set": DATA_SET},
)

> 2026-03-25 07:35:48,102 [info] Storing function: {"db":"http://mlrun-api:8080","name":"fetch-vectordb-data-run","uid":"c98f11b437884433a08763e3dfd748b6"}

Unexpected run keyword argument 'local' was ignored.

> 2026-03-25 07:35:48,384 [info] Job is running in the background, pod: fetch-vectordb-data-run-6xlxz
Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
WARNING:root:USER_AGENT environment variable not set, consider setting it to identify your requests.
Fetching pages: 100%|##########| 80/80 [00:31<00:00,  2.50it/s]
> 2026-03-25 07:36:46,288 [info] Dataset dowloaded and logged
> 2026-03-25 07:36:46,374 [info] To track results use the CLI: {"info_cmd":"mlrun get run c98f11b437884433a08763e3dfd748b6 -p genai-tutorial-admin","logs_cmd":"mlrun logs c98f11b437884433a08763e3dfd748b6 -p genai-tutorial-admin"}
> 2026-03-25 07:36:46,374 [info] Or click for UI: {"ui_url":"https://dashboard.default-tenant.app.cust-cs-il.iguazio-cd0.com/mlprojects/genai-tutorial-admin/jobs/monitor-jobs/fetch-vectordb-data-run/c98f11b437884433a08763e3dfd748b6/overview"}
> 2026-03-25 07:36:46,375 [info] Run execution finished: {"name":"fetch-vectordb-data-run","status":"completed"}

project	uid	iter	start	end	state	kind	name	labels	inputs	parameters	results	artifacts
genai-tutorial-admin	...dfd748b6	0	Mar 25 07:36:07	2026-03-25 07:36:46.362719+00:00	completed	run	fetch-vectordb-data-run	v3io_user=admin kind=job owner=admin mlrun/client_version=1.11.0-rc40 mlrun/client_python_version=3.11.14 host=fetch-vectordb-data-run-6xlxz		data_set=https://s3.wasabisys.com/iguazio/data/genai-tutorial/labelled_newscatcher_dataset.csv		vector-db-dataset

> to track results use the .show() or .logs() methods or click here to open in UI

> 2026-03-25 07:36:51,791 [info] Run execution finished: {"name":"fetch-vectordb-data-run","status":"completed"}

ret.outputs

{'vector-db-dataset': 'store://datasets/genai-tutorial-admin/fetch-vectordb-data-run_vector-db-dataset:latest@c98f11b437884433a08763e3dfd748b6^6e34a7a65d46b4bc74ead332f37cecd8dba8b258'}

Build the vector DB#

Build the vector DB in the data layer and load the data into it.

If you're not using Iguazio's Jupyter, download the build vector db.

# Build the vector DB using the image
build_vectordb = project.set_function(
    name="build-vectordb",
    func="src/build-vector-db.py",
    kind="job",
    image=".llm-demo-data",
)

if not is_ce:
    build_vectordb.apply(mlrun.auto_mount())
    print("Applying mlrun.auto_mount!")
else:
    print("Not applying mlrun.auto_mount!")

Applying mlrun.auto_mount!

build_vectordb_run = project.run_function(
    function="build-vectordb",
    inputs={"df": ret.outputs["vector-db-dataset"]},
    params={"cache_dir": CACHE_DIR},
    handler="handler_chroma",
    outputs=["vect_db"],
)

VECTORDB_PATH = build_vectordb_run.outputs["vect_db"]

Serving the function#

If you're not using Iguazio's Jupyter, download serving.py. Now you can deploy the the Nuclio function that serves the LLM:

serve_func = project.set_function(
    name="serve-llm",
    func="src/serving.py",
    image=".llm-demo-data",
    kind="nuclio",
)

# Transferring the model and VectorDB path to the serving functions
serve_func.set_envs(
    env_vars={
        "MODEL_ID": MODEL_ID,
        "CACHE_DIR": CACHE_DIR,
        "VECTORDB_PATH": VECTORDB_PATH,
    }
)

# Since the model is stored in memory, use only 1 replica and and one worker
# Since this is running on CPU only, inference might take ~1 minute (increasing timeout)
serve_func.spec.min_replicas = 1
serve_func.spec.max_replicas = 1
serve_func.with_http(worker_timeout=240, gateway_timeout=600, workers=1)
serve_func.set_config("spec.readinessTimeoutSeconds", 3600)

> 2026-03-25 07:44:30,598 [warning] Adding HTTP trigger despite the default HTTP trigger creation being disabled

<mlrun.runtimes.nuclio.function.RemoteRuntime at 0x7f27c9ed5b10>

if not is_ce:
    serve_func.apply(mlrun.auto_mount())
    print("Applying mlrun.auto_mount!")
else:
    print("Not applying mlrun.auto_mount!")

Applying mlrun.auto_mount!

serve_func = project.deploy_function(function="serve-llm")

> 2026-03-25 07:44:37,597 [info] Starting remote function deploy
2026-03-25 07:44:37  (info) Deploying function
2026-03-25 07:44:37  (info) Building
2026-03-25 07:44:38  (info) Staging files and preparing base images
2026-03-25 07:44:38  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2026-03-25 07:44:38  (info) Building processor image
2026-03-25 07:46:33  (info) Build complete
2026-03-25 07:48:29  (info) Function deploy complete
> 2026-03-25 07:48:33,973 [info] Successfully deployed function: {"external_invocation_urls":["genai-tutorial-admin-serve-llm.default-tenant.app.cust-cs-il.iguazio-cd0.com/"],"internal_invocation_urls":["nuclio-genai-tutorial-admin-serve-llm.default-tenant.svc.cluster.local:8080"]}

Test Serving Function#

The inference endpoint of a LLM which is hosted in the cluster

body = {
    "question": "What are some new developments in space travel?",
    "topic": "science",
}

resp = serve_func.function.invoke("/", body=json.dumps(body))

print(resp["response"])

Hi there!

Thanks for your question. There have been some exciting developments in space travel recently. One of the most notable is the successful launch of a rocket from Earth, which poses a potential threat to space travel. This launch has sparked a lot of interest and discussion in the scientific community.

In addition to this, there have been some exciting advancements in the field of space exploration. For example, NASA recently announced that they have discovered a new exoplanet that could potentially support life. This is a major breakthrough in our understanding of the universe and could have significant implications for future space travel.

Overall, there is a lot of exciting news in the world of space travel right now. I hope this helps answer your question!

Best regards,
[Your Name]

print(resp["sources"])

['https://www.express.co.uk/news/science/1324095/space-news-spacex-rocket-launch-stars-elon-musk-Comet-Neowise-latest']

print(resp["prompt"])

The instruction below describes a task. Write a response that appropriately completes the request.

### Instruction:
User question:
What are some new developments in space travel?

Context:
Space news: Rocket launches from earth pose threat to space travel | Science | News | Express.co.uk Express. Home of the Daily and Sunday Express. Online Games Horoscopes Express Rated Shop Paper Newsletters Login Register Your Account Newsletters BookmarksPremium Sign OutUkUs 6C Search News Politics Royal Showbiz & TV Sport Finance Travel Life & Style Shopping Premium Articles UK Royal Weather Politics Defence World US Science History Weird Nature InYourArea HomeNewsScience Space horror: Rocket

### Response:

project.set_function(f"db://{project.name}/fetch-vectordb-data")
project.set_function(f"db://{project.name}/build-vectordb")
project.set_function(f"db://{project.name}/serve-llm")
project.set_source(f"db://{project.name}")
project.save()

Run E2E Workflow#

%%writefile workflow.py
import mlrun
from kfp import dsl

    
@dsl.pipeline(
    name="GenAI demo"
)

def kfpipeline(data_set, cache_dir, model_id):
    
    project = mlrun.get_current_project()
    
    fetch = project.run_function(
        function="fetch-vectordb-data",
        name="fetch-vectordb-data-run",
        handler="handler",
        params = {"data_set" : data_set},
        outputs=['vector-db-dataset']
    )
    
    
    vectordb_build = project.run_function(
        function="build-vectordb",
        inputs={"df" : fetch.outputs["vector-db-dataset"]},
        params={"cache_dir" : cache_dir},
        handler="handler_chroma",
        outputs=["vect_db"]        
    )

    serve_func = project.get_function("serve-llm")
    serve_func.set_envs(
        env_vars={"MODEL_ID": model_id, 
                  "CACHE_DIR": cache_dir, 
                  "VECTORDB_PATH":vectordb_build.outputs["vect_db"]}
    )
    serve_func.spec.min_replicas = 1
    serve_func.spec.max_replicas = 1
    serve_func.with_http(worker_timeout=120, gateway_timeout=150, workers=1)
    serve_func.set_config("spec.readinessTimeoutSeconds", 3600)
    
    deploy = project.deploy_function("serve-llm", verbose=True).after(vectordb_build)

project.set_workflow("main", "workflow.py", embed=True)
project.save()

Please note that the workflow may take up to 20 mins to complete.#

run_id = project.run(
    "main",
    arguments={"cache_dir": CACHE_DIR, "data_set": DATA_SET, "model_id": MODEL_ID},
    engine="remote",
    watch=True,
)