Serving pre-trained ML/DL models

Serving pre-trained ML/DL models#

This notebook demonstrate how to serve standard ML/DL models using MLRun Serving.

Make sure you went over the basics in MLRun Quick Start Tutorial.

MLRun serving can produce managed real-time serverless pipelines from various tasks, including MLRun models or standard model files. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. Nuclio is a high-performance open-source "serverless" framework that's focused on data, I/O, and compute-intensive workloads.

MLRun serving supports advanced real-time data processing and model serving pipelines.
For more details and examples, see the MLRun serving pipelines documentation.

In this tutorial

Using pre-built MLRun serving classes and images
Create and test the serving function
Deploy the serving function
Build a custom serving class
Building advanced model serving graph

MLRun installation and configuration#

Before running this notebook make sure the mlrun package is installed (pip install mlrun) and that you have configured the access to MLRun service.

# Install MLRun if not installed, run this only once. Restart the notebook after the install!
%pip install mlrun

Get or create a new project

You should create, load or use (get) an MLRun Project. The get_or_create_project() method tries to load the project from the MLRun DB. If the project does not exist, it creates a new one.

import mlrun

project = mlrun.get_or_create_project(
    "tutorial", context="./", user_project=True, allow_cross_project=True
)

Using pre-built MLRun serving classes#

MLRun contains built-in serving functionality for the major ML/DL frameworks (Scikit-Learn, TensorFlow.Keras, ONNX, XGBoost, LightGBM, and PyTorch).

The following table specifies, for each framework, the corresponding MLRun ModelServer serving class and its dependencies:

framework	serving class	dependencies
SciKit-Learn	mlrun.frameworks.sklearn.SKLearnModelServer	scikit-learn
TensorFlow.Keras	mlrun.frameworks.tf_keras.TFKerasModelServer	tensorflow
ONNX	mlrun.frameworks.onnx.ONNXModelServer	onnxruntime
XGBoost	mlrun.frameworks.xgboost.XGBoostModelServer	xgboost
LightGBM	mlrun.frameworks.lgbm.LGBMModelServer	lightgbm
PyTorch	mlrun.frameworks.pytorch.PyTorchModelServer	torch

For GPU support use the mlrun/mlrun-gpu image (adding GPU drivers and support).

Example using SKlearn and TF Keras models

See how to specify the parameters in the following two examples. These use standard pre-trained models (using the iris dataset) stored in MLRun samples repository. (You can use your own models instead.)

models_dir = mlrun.get_sample_path("models/serving/")

suffix = mlrun.__version__.split("-")[0].replace(".", "_")

framework = "sklearn"  # change to 'keras' to try the 2nd option
kwargs = {}
if framework == "sklearn":
    serving_class = "mlrun.frameworks.sklearn.SKLearnModelServer"
    model_path = models_dir + f"sklearn-{suffix}.pkl"
    image = "mlrun/mlrun"
    requirements = []
else:
    serving_class = "mlrun.frameworks.tf_keras.TFKerasModelServer"
    model_path = models_dir + "keras.h5"
    image = "mlrun/mlrun"  # or mlrun/mlrun-gpu when using GPUs
    kwargs["labels"] = {"model-format": "h5"}
    requirements = ["tensorflow==2.8.1"]
    %pip install tensorflow==2.8.1

Log the model#

The model and its metadata are first registered in MLRun's Model Registry. Use the log_model() method to specify the model files and metadata (metrics, schema, parameters, etc.).

import pandas as pd

model_object = project.log_model(
    f"{framework}-model",
    model_file=model_path,
    training_set=pd.DataFrame(
        data=[[1.5, 1.5, 1.5, 1.5]],
        columns=[
            "sepal_length_cm",
            "sepal_width_cm",
            "petal_length_cm",
            "petal_width_cm",
        ],
    ),
    **kwargs,
)

Create and test the serving function#

Create a new serving function, specify its name and the correct image (with your desired framework).

If you want to add specific packages to the base image, specify the requirements attribute, example:
serving_fn = mlrun.new_function(name="serving", image=image, kind="serving", requirements=["tensorflow==2.8.1"])

The following example uses a basic topology of a model router and adds a single model behind it. (You can add multiple models to the same function.)

import pandas as pd

serving_fn = mlrun.new_function(
    name="model-serving", image=image, kind="serving", requirements=requirements
)
serving_fn.add_model(
    framework, model_path=model_object.uri, class_name=serving_class, to_list=True
)

# Plot the serving topology input -> router -> model
serving_fn.plot(rankdir="LR")

../_images/97da0fec796baca705ad23f62c77b50d33212bb0461564d76def28fc768fd89f.svg

Simulate the model server locally (using the mock_server)

# Create a mock server that represents the serving pipeline
server = serving_fn.to_mock_server()

Test the mock model server endpoint

List the served models

server.test("/v2/models/", method="GET")

{'models': ['sklearn']}

Infer using test data

sample = {"inputs": [[5.2, 2.7, 3.9, 1.4], [6.4, 3.1, 5.5, 1.8]]}
server.test(path=f"/v2/models/{framework}/infer", body=sample)

X does not have valid feature names, but RandomForestClassifier was fitted with feature names

{'id': 'a7e055cecf4f49f6af68df7da401ceb9',
 'model_name': 'sklearn',
 'outputs': [1, 2],
 'timestamp': '2025-05-16 11:52:42.498906+00:00'}

See more API options and parameters in Model serving API.

Deploy the serving function#

Deploy the serving function and use invoke to test it with the provided sample by using 2 API's infer & infer_dict.

project.deploy_function(serving_fn)

> 2025-05-16 11:52:42,514 [info] Starting remote function deploy
2025-05-16 11:52:42  (info) Deploying function
2025-05-16 11:52:42  (info) Building
2025-05-16 11:52:42  (info) Staging files and preparing base images
2025-05-16 11:52:42  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2025-05-16 11:52:42  (info) Building processor image
2025-05-16 11:53:38  (info) Build complete
2025-05-16 11:54:30  (info) Function deploy complete
> 2025-05-16 11:54:33,606 [info] Model endpoint creation task completed with state succeeded
> 2025-05-16 11:54:33,606 [info] Successfully deployed function: {"external_invocation_urls":["tutorial-iguazio-model-serving.default-tenant.app.iguazio.com/"],"internal_invocation_urls":["nuclio-tutorial-iguazio-model-serving.default-tenant.svc.cluster.local:8080"]}

DeployStatus(state=ready, outputs={'endpoint': 'http://tutorial-iguazio-model-serving.default-tenant.app.iguazio.com/', 'name': 'tutorial-iguazio-model-serving'})

serving_fn.invoke(path=f"/v2/models/{framework}/infer", body=sample)

> 2025-05-16 11:54:33,656 [info] Invoking function: {"method":"POST","path":"http://nuclio-tutorial-iguazio-model-serving.default-tenant.svc.cluster.local:8080/v2/models/sklearn/infer"}

{'id': 'b67671c5-7e86-459f-8b8d-e763fe488233',
 'model_name': 'sklearn',
 'outputs': [1, 2],
 'timestamp': '2025-05-16 11:54:33.680048+00:00',
 'model_endpoint_uid': '38beeeee25e64c51a0ff5185dc31e208'}

sample_dict = {
    "inputs": [
        {
            "sepal_length_cm": 5.2,
            "sepal_width_cm": 2.7,
            "petal_length_cm": 3.9,
            "petal_width_cm": 1.4,
        },
        {
            "sepal_length_cm": 6.4,
            "sepal_width_cm": 3.1,
            "petal_width_cm": 1.8,
            "petal_length_cm": 5.5,
        },
    ]
}

serving_fn.invoke(path=f"/v2/models/{framework}/infer_dict", body=sample_dict)

> 2025-05-16 11:54:33,691 [info] Invoking function: {"method":"POST","path":"http://nuclio-tutorial-iguazio-model-serving.default-tenant.svc.cluster.local:8080/v2/models/sklearn/infer_dict"}

{'id': 'a554d816-65e0-48b2-b715-fb8818f3676f',
 'model_name': 'sklearn',
 'outputs': [1, 2],
 'timestamp': '2025-05-16 11:54:33.696408+00:00',
 'model_endpoint_uid': '38beeeee25e64c51a0ff5185dc31e208'}

Build a custom serving class#

Model serving classes implement the full model serving functionality, which include loading models, pre- and post-processing, prediction, explainability, and model monitoring.

Model serving classes must inherit from mlrun.serving.V2ModelServer, and at the minimum implement the load() (download the model file(s) and load the model into memory) and predict() (accept request payload and return prediction/inference results) methods.

For more detailed information on custom serving classes, see Build your own model serving class.

The following code demonstrates a minimal scikit-learn (a.k.a. sklearn) serving-class implementation:

from cloudpickle import load
import numpy as np
from typing import List
import mlrun

class ClassifierModel(mlrun.serving.V2ModelServer):
    def load(self) -> None:
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model('.pkl')
        self.model = load(open(model_file, 'rb'))

    def predict(self, body: dict) -> list:
        """Generate model predictions from sample."""
        feats = np.asarray(body['inputs'])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()

In order to create a function that incorporates the code of the new class (in serving.py ) use code_to_function:

serving_fn = mlrun.code_to_function('serving', filename='serving.py', kind='serving',image='mlrun/mlrun')
serving_fn.add_model('my_model',model_path=model_file, class_name='ClassifierModel')

Build an advanced model serving graph#

MLRun graphs enable building and running DAGs (directed acyclic graphs). Graphs are composed of individual steps. The first graph element accepts an Event object, transforms/processes the event and passes the result to the next step in the graph, and so on. The final result can be written out to a destination (file, DB, stream, etc.) or returned back to the caller (one of the graph steps can be marked with .respond()).

The serving graphs can be composed of pre-defined graph steps, block-type elements (model servers, routers, ensembles, data readers and writers, data engineering tasks, validators, etc.), custom steps, or from native python classes/functions. A graph can have data processing steps, model ensembles, model servers, post-processing, etc. Graphs can auto-scale and span multiple function containers (connected through streaming protocols).

See the Advanced Model Serving Graph Notebook Example.

Done!#

Congratulations! You've completed Part 3 of the MLRun getting-started tutorial. Proceed to Part 4: Projects and automated ML pipeline to learn how to create an automated pipeline for your project.