Deploying an LLM using MLRun#

This notebook illustrates deploying an LLM using MLRun.

Since this tutorial is for illustrative purposes, it uses minimal resources — CPU and not GPU, and a small amount of data.

In this tutorial:

See also:



MLRun installation and configuration#

Before running this notebook make sure the mlrun package is installed (pip install mlrun) and that you have configured the access to MLRun service.

# Install MLRun if not installed, run this only once. Restart the notebook after the install
%pip install mlrun
import json
import mlrun

Get or create a new project

First create, load or use (get) an MLRun Project. The get_or_create_project method tries to load the project from the MLRun DB. If the project does not exist, it creates a new one.

project = mlrun.get_or_create_project("genai-tutorial", user_project=True)

Set up the vector database in the cluster#

These two steps imports a pre-defined dataset and load it into a vector database. Then the vector database is stored in the data layer of the cluster.

If you're not using Iguazio's Jupyter, download fetch-vectordb-data.py.

The image for this step can be created using the following Dockerfile (contains MLRun and Chroma DB):

FROM mlrun/mlrun
RUN pip install chromadb langchain langchain-community langchain-core langchain-text-splitters clean-text
# The model used is the free open-source PHI 2
MODEL_ID = "microsoft/phi-2"

# Define the dataset for the VectorDB
DATA_SET = mlrun.get_sample_path("data/genai-tutorial/labelled_newscatcher_dataset.csv")

# The location of the VectorDB files
CACHE_DIR = mlrun.mlconf.artifact_path
CACHE_DIR = (
    CACHE_DIR.replace("v3io://", "/v3io").replace("{{run.project}}", project.name)
    + "/cache"
)

Fetch the dataset for the Vector DB and save it in cluster:

fetch = project.set_function(
    name="fetch-vectordb-data",
    func="src/fetch-vectordb-data.py",
    kind="job",
    image="gcr.io/iguazio/mlrun-genai/llm-demo-data:1.8.0",
)
ret = project.run_function(
    name="fetch-vectordb-data-run",
    function="fetch-vectordb-data",
    handler="handler",
    params={"data_set": DATA_SET},
)
> 2025-05-18 12:53:05,592 [info] Storing function: {"db":"http://mlrun-api:8080","name":"fetch-vectordb-data-run","uid":"6a889fe2f00948088281b3c3d973c6c6"}
> 2025-05-18 12:53:05,895 [info] Job is running in the background, pod: fetch-vectordb-data-run-wrjq6
Since the GPL-licensed package `unidecode` is not installed, using Python's `unicodedata` package which yields worse results.
USER_AGENT environment variable not set, consider setting it to identify your requests.
See API reference for updated usage: https://python.langchain.com/api_reference/community/document_loaders/langchain_community.document_loaders.web_base.WebBaseLoader.html
Fetching pages:   0%|          | 0/80 [00:00<?, ?it/s]Error fetching https://en.brinkwire.com/science/scientists-hope-to-clone-almost-extinct-sumatran-rhinoceros/ with attempt 1/3: Cannot connect to host en.brinkwire.com:443 ssl:default [Name or service not known]. Retrying...
Fetching pages:  10%|#         | 8/80 [00:01<00:11,  6.01it/s]Error fetching https://en.brinkwire.com/science/scientists-hope-to-clone-almost-extinct-sumatran-rhinoceros/ with attempt 2/3: Cannot connect to host en.brinkwire.com:443 ssl:default [Name or service not known]. Retrying...
Fetching pages:  34%|###3      | 27/80 [00:04<00:08,  6.61it/s]Error fetching https://en.brinkwire.com/science/scientists-hope-to-clone-almost-extinct-sumatran-rhinoceros/, skipping due to continue_on_failure=True
Fetching pages: 100%|##########| 80/80 [00:27<00:00,  2.88it/s]
> 2025-05-18 12:53:46,724 [info] Dataset dowloaded and logged
> 2025-05-18 12:53:46,821 [info] To track results use the CLI: {"info_cmd":"mlrun get run 6a889fe2f00948088281b3c3d973c6c6 -p genai-tutorial-shapira","logs_cmd":"mlrun logs 6a889fe2f00948088281b3c3d973c6c6 -p genai-tutorial-shapira"}
> 2025-05-18 12:53:46,821 [info] Or click for UI: {"ui_url":"https://dashboard.default-tenant.app.cust-cs-il.iguazio-cd0.com/mlprojects/genai-tutorial-shapira/jobs/monitor-jobs/fetch-vectordb-data-run/6a889fe2f00948088281b3c3d973c6c6/overview"}
> 2025-05-18 12:53:46,822 [info] Run execution finished: {"name":"fetch-vectordb-data-run","status":"completed"}
project uid iter start end state kind name labels inputs parameters results artifacts
genai-tutorial-shapira 0 May 18 12:53:09 2025-05-18 12:53:46.808379+00:00 completed run fetch-vectordb-data-run
v3io_user=shapira
kind=job
owner=shapira
mlrun/client_version=1.8.0-rc60
mlrun/client_python_version=3.9.18
host=fetch-vectordb-data-run-wrjq6
data_set=https://s3.wasabisys.com/iguazio/data/genai-tutorial/labelled_newscatcher_dataset.csv
vector-db-dataset

> to track results use the .show() or .logs() methods or click here to open in UI
> 2025-05-18 12:53:52,395 [info] Run execution finished: {"name":"fetch-vectordb-data-run","status":"completed"}
ret.outputs
{'vector-db-dataset': 'store://datasets/genai-tutorial-shapira/fetch-vectordb-data-run_vector-db-dataset:latest@6a889fe2f00948088281b3c3d973c6c6^ba1fc664e7d7e18592b216be8a884ff7bf432dd3'}

Build the vector DB#

Build the vector DB in the data layer and load the data into it.

The image for this step can be created using the following Dockerfile (contains MLRun and Chroma DB):

FROM mlrun/mlrun
RUN pip install chromadb langchain langchain-community langchain-core langchain-text-splitters clean-text

If you're not using Iguazio's Jupyter, download the build vector db.

# Build the vector DB using the image
build_vectordb = project.set_function(
    name="build-vectordb",
    func="src/build-vector-db.py",
    kind="job",
    image="gcr.io/iguazio/mlrun-genai/llm-demo-data:1.8.0",
).apply(mlrun.auto_mount())
project.run_function(
    function="build-vectordb",
    inputs={"df": ret.outputs["vector-db-dataset"]},
    params={"cache_dir": CACHE_DIR},
    handler="handler_chroma",
)

Serving the function#

The image for this step can be created using the following Dockerfile (contains Chroma DB, Transformers, TF and PyTorch):

FROM mlrun/mlrun
RUN pip install chromadb transformers tensorflow torch

If you're not using Iguazio's Jupyter, download serving.py. Now you can deploy the the Nuclio function that serves the LLM:

serve_func = project.set_function(
    name="serve-llm",
    func="src/serving.py",
    image="gcr.io/iguazio/mlrun-genai/llmserve:1.0",
    kind="nuclio",
).apply(mlrun.auto_mount())

# Transferring the model and VectorDB path to the serving functions
serve_func.set_envs(env_vars={"MODEL_ID": MODEL_ID, "CACHE_DIR": CACHE_DIR})

# Since the model is stored in memory, use only 1 replica and and one worker
# Since this is running on CPU only, inference might take ~1 minute (increasing timeout)
serve_func.spec.min_replicas = 1
serve_func.spec.max_replicas = 1
serve_func.with_http(worker_timeout=120, gateway_timeout=150, workers=1)
serve_func.set_config("spec.readinessTimeoutSeconds", 1200)
serve_func = project.deploy_function(function="serve-llm")
> 2025-05-18 13:00:49,712 [info] Starting remote function deploy
2025-05-18 13:00:50  (info) Deploying function
2025-05-18 13:00:50  (info) Building
2025-05-18 13:00:50  (info) Staging files and preparing base images
2025-05-18 13:00:50  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-05-18 13:00:50  (info) Building processor image
2025-05-18 13:05:36  (info) Build complete
2025-05-18 13:08:12  (info) Function deploy complete
> 2025-05-18 13:08:14,341 [info] Successfully deployed function: {"external_invocation_urls":["genai-tutorial-shapira-serve-llm.default-tenant.app.cust-cs-il.iguazio-cd0.com/"],"internal_invocation_urls":["nuclio-genai-tutorial-shapira-serve-llm.default-tenant.svc.cluster.local:8080"]}

Test Serving Function#

body = {
    "question": "What are some new developments in space travel?",
    "topic": "science",
}
resp = serve_func.function.invoke("/", body=json.dumps(body))
print(resp["response"])
Some new developments in space travel include the successful launch of a rocket from Earth, which poses a threat to space travel. Additionally, there have been advancements in the field of science, with new discoveries and breakthroughs being made. The news also covers topics such as politics, royal affairs, showbiz and TV, sports, finance, travel, life and style, comment, and world news.
print(resp["sources"])
['https://www.express.co.uk/news/science/1324095/space-news-spacex-rocket-launch-stars-elon-musk-Comet-Neowise-latest']
print(resp["prompt"])
The instruction below describes a task. Write a response that appropriately completes the request.

### Instruction:
User question:
What are some new developments in space travel?

Context:
Space news: Rocket launches from earth pose threat to space travel | Science | News | Express.co.uk Express. Home of the Daily and Sunday Express. Puzzles Horoscopes Express Rated Shop Paper Newsletters Login Register Your Account Newsletters Bookmarks Sign OutUkUs 14C Find us on FacebookFollow us on WhatsApp Follow us on TwitterFind us on Instagram Find us on Youtube Search HOME News Politics Royal Showbiz & TV Sport Finance Travel Life & Style Comment UK World Politics Royal US Weather Science

### Response:
project.set_function(f"db://{project.name}/fetch-vectordb-data")
project.set_function(f"db://{project.name}/build-vectordb")
project.set_function(f"db://{project.name}/serve-llm")
project.set_source(f"db://{project.name}")
project.save()
<mlrun.projects.project.MlrunProject at 0x7f18519288e0>

Run E2E Workflow#

%%writefile workflow.py
import mlrun
from kfp import dsl

    
@dsl.pipeline(
    name="GenAI demo"
)

def kfpipeline(data_set, cache_dir):
    
    project = mlrun.get_current_project()
    
    fetch = project.run_function(
        function="fetch-vectordb-data",
        name="fetch-vectordb-data-run",
        handler="handler",
        params = {"data_set" : data_set},
        outputs=['vector-db-dataset']
    )
    
    
    vectordb_build = project.run_function(
        function="build-vectordb",
        inputs={"df" : fetch.outputs["vector-db-dataset"]},
        params={"cache_dir" : cache_dir},
        handler="handler_chroma"
    )

    deploy = project.deploy_function("serve-llm").after(vectordb_build)
Writing workflow.py
project.set_workflow("main", "workflow.py", embed=True)
project.save()
<mlrun.projects.project.MlrunProject at 0x7f18519288e0>
run_id = project.run(
    "main",
    arguments={
        "cache_dir": CACHE_DIR,
        "data_set": DATA_SET,
    },
    engine="remote",
    watch=True,
)