Serving pre-trained ML/DL models#

This notebook demonstrate how to serve standard ML/DL models using MLRun Serving.

Make sure you went over the basics in MLRun Quick Start Tutorial.

MLRun serving can produce managed real-time serverless pipelines from various tasks, including MLRun models or standard model files. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. Nuclio is a high-performance open-source "serverless" framework that's focused on data, I/O, and compute-intensive workloads.

MLRun serving supports advanced real-time data processing and model serving pipelines.
For more details and examples, see the MLRun serving pipelines documentation.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure the mlrun package is installed (pip install mlrun) and that you have configured the access to MLRun service.

# Install MLRun if not installed, run this only once. Restart the notebook after the install!
%pip install mlrun

Get or create a new project

You should create, load or use (get) an MLRun Project. The get_or_create_project() method tries to load the project from the MLRun DB. If the project does not exist, it creates a new one.

import mlrun

project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)
> 2024-08-06 09:21:08,324 [info] Project loaded successfully: {"project_name":"tutorial"}

Using pre-built MLRun serving classes#

MLRun contains built-in serving functionality for the major ML/DL frameworks (Scikit-Learn, TensorFlow.Keras, ONNX, XGBoost, LightGBM, and PyTorch).

The following table specifies, for each framework, the corresponding MLRun ModelServer serving class and its dependencies:

framework

serving class

dependencies

SciKit-Learn

mlrun.frameworks.sklearn.SklearnModelServer

scikit-learn

TensorFlow.Keras

mlrun.frameworks.tf_keras.TFKerasModelServer

tensorflow

ONNX

mlrun.frameworks.onnx.ONNXModelServer

onnxruntime

XGBoost

mlrun.frameworks.xgboost.XGBoostModelServer

xgboost

LightGBM

mlrun.frameworks.lgbm.LGBMModelServer

lightgbm

PyTorch

mlrun.frameworks.pytorch.PyTorchModelServer

torch

For GPU support use the mlrun/mlrun-gpu image (adding GPU drivers and support).

Example using SKlearn and TF Keras models

See how to specify the parameters in the following two examples. These use standard pre-trained models (using the iris dataset) stored in MLRun samples repository. (You can use your own models instead.)

models_dir = mlrun.get_sample_path("models/serving/")

# We choose the correct model to avoid pickle warnings
import sys

suffix = (
    mlrun.__version__.split("-")[0].replace(".", "_")
    if sys.version_info[1] > 7
    else "3.7"
)

framework = "sklearn"  # change to 'keras' to try the 2nd option
kwargs = {}
if framework == "sklearn":
    serving_class = "mlrun.frameworks.sklearn.SklearnModelServer"
    model_path = models_dir + f"sklearn-{suffix}.pkl"
    image = "mlrun/mlrun"
    requirements = []
else:
    serving_class = "mlrun.frameworks.tf_keras.TFKerasModelServer"
    model_path = models_dir + "keras.h5"
    image = "mlrun/mlrun"  # or mlrun/mlrun-gpu when using GPUs
    kwargs["labels"] = {"model-format": "h5"}
    requirements = ["tensorflow==2.8.1"]
    %pip install tensorflow==2.8.1

Log the model#

The model and its metadata are first registered in MLRun's Model Registry. Use the log_model() method to specify the model files and metadata (metrics, schema, parameters, etc.).

import pandas as pd

model_object = project.log_model(
    f"{framework}-model",
    model_file=model_path,
    training_set=pd.DataFrame(
        data=[[1.5, 1.5, 1.5, 1.5]],
        columns=[
            "sepal_length_cm",
            "sepal_width_cm",
            "petal_length_cm",
            "petal_width_cm",
        ],
    ),
    **kwargs,
)

Create and test the serving function#

Create a new serving function, specify its name and the correct image (with your desired framework).

If you want to add specific packages to the base image, specify the requirements attribute, example:

serving_fn = mlrun.new_function(name="serving", image=image, kind="serving", requirements=["tensorflow==2.8.1"])

The following example uses a basic topology of a model router and adds a single model behind it. (You can add multiple models to the same function.)

import pandas as pd

serving_fn = mlrun.new_function(
    name="model-serving", image=image, kind="serving", requirements=requirements
)
serving_fn.add_model(
    framework, model_path=model_object.uri, class_name=serving_class, to_list=True
)

# Plot the serving topology input -> router -> model
serving_fn.plot(rankdir="LR")
../_images/de29572375fcc143ec1dec4162bc3972666d217f89fddc3256df16d3a9518dfd.svg

Simulate the model server locally (using the mock_server)

# Create a mock server that represents the serving pipeline
server = serving_fn.to_mock_server()
> 2024-08-06 09:21:08,641 [warning] run command, file or code were not specified
> 2024-08-06 09:21:14,893 [info] model sklearn was loaded
> 2024-08-06 09:21:14,894 [info] Loaded ['sklearn']

Test the mock model server endpoint

  • List the served models

server.test("/v2/models/", method="GET")
{'models': ['sklearn']}
  • Infer using test data

sample = {"inputs": [[5.2, 2.7, 3.9, 1.4], [6.4, 3.1, 5.5, 1.8]]}
server.test(path=f"/v2/models/{framework}/infer", body=sample)
X does not have valid feature names, but RandomForestClassifier was fitted with feature names
{'id': 'e75c42b1daf9494d9900e7fdef193f0e',
 'model_name': 'sklearn',
 'outputs': [1, 2]}

See more API options and parameters in Model serving API.

Deploy the serving function#

Deploy the serving function and use invoke to test it with the provided sample by using 2 API's infer & infer_dict.

project.deploy_function(serving_fn)
> 2024-08-06 09:21:15,039 [info] Starting remote function deploy
2024-08-06 09:21:15  (info) Deploying function
2024-08-06 09:21:15  (info) Building
2024-08-06 09:21:15  (info) Staging files and preparing base images
2024-08-06 09:21:15  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-08-06 09:21:15  (info) Building processor image
2024-08-06 09:23:30  (info) Build complete
2024-08-06 09:23:40  (info) Function deploy complete
> 2024-08-06 09:23:46,810 [info] Successfully deployed function: {"external_invocation_urls":["tutorial-admin-serving.default-tenant.app.dev13.lab.iguazeng.com/"],"internal_invocation_urls":["nuclio-tutorial-admin-serving.default-tenant.svc.cluster.local:8080"]}
DeployStatus(state=ready, outputs={'endpoint': 'http://tutorial-admin-serving.default-tenant.app.dev13.lab.iguazeng.com/', 'name': 'tutorial-admin-serving'})
serving_fn.invoke(path=f"/v2/models/{framework}/infer", body=sample)
> 2024-08-06 09:23:46,895 [info] Invoking function: {"method":"POST","path":"http://nuclio-tutorial-admin-serving.default-tenant.svc.cluster.local:8080/v2/models/sklearn/infer"}
{'id': 'e26ca488-cd0a-4898-af27-17d19cf0b5ab',
 'model_name': 'sklearn',
 'outputs': [1, 2],
 'model_version': 'latest'}
sample_dict = {
    "inputs": [
        {
            "sepal_length_cm": 5.2,
            "sepal_width_cm": 2.7,
            "petal_length_cm": 3.9,
            "petal_width_cm": 1.4,
        },
        {
            "sepal_length_cm": 6.4,
            "sepal_width_cm": 3.1,
            "petal_width_cm": 1.8,
            "petal_length_cm": 5.5,
        },
    ]
}

serving_fn.invoke(path=f"/v2/models/{framework}/infer_dict", body=sample_dict)
> 2024-08-06 09:23:47,202 [info] Invoking function: {"method":"POST","path":"http://nuclio-tutorial-admin-serving.default-tenant.svc.cluster.local:8080/v2/models/sklearn/infer_dict"}
{'id': '1911b48b-2cc4-4bcb-8270-4d6e98247c2b',
 'model_name': 'sklearn',
 'outputs': [1, 2],
 'model_version': 'latest'}

Build a custom serving class#

Model serving classes implement the full model serving functionality, which include loading models, pre- and post-processing, prediction, explainability, and model monitoring.

Model serving classes must inherit from mlrun.serving.V2ModelServer, and at the minimum implement the load() (download the model file(s) and load the model into memory) and predict() (accept request payload and return prediction/inference results) methods.

For more detailed information on custom serving classes, see Build your own model serving class.

The following code demonstrates a minimal scikit-learn (a.k.a. sklearn) serving-class implementation:

from cloudpickle import load
import numpy as np
from typing import List
import mlrun

class ClassifierModel(mlrun.serving.V2ModelServer):
    def load(self) -> None:
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model('.pkl')
        self.model = load(open(model_file, 'rb'))

    def predict(self, body: dict) -> list:
        """Generate model predictions from sample."""
        feats = np.asarray(body['inputs'])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()

In order to create a function that incorporates the code of the new class (in serving.py ) use code_to_function:

serving_fn = mlrun.code_to_function('serving', filename='serving.py', kind='serving',image='mlrun/mlrun')
serving_fn.add_model('my_model',model_path=model_file, class_name='ClassifierModel')

Build an advanced model serving graph#

MLRun graphs enable building and running DAGs (directed acyclic graphs). Graphs are composed of individual steps. The first graph element accepts an Event object, transforms/processes the event and passes the result to the next step in the graph, and so on. The final result can be written out to a destination (file, DB, stream, etc.) or returned back to the caller (one of the graph steps can be marked with .respond()).

The serving graphs can be composed of pre-defined graph steps, block-type elements (model servers, routers, ensembles, data readers and writers, data engineering tasks, validators, etc.), custom steps, or from native python classes/functions. A graph can have data processing steps, model ensembles, model servers, post-processing, etc. Graphs can auto-scale and span multiple function containers (connected through streaming protocols).

See the Advanced Model Serving Graph Notebook Example.

Done!#

Congratulations! You've completed Part 3 of the MLRun getting-started tutorial. Proceed to Part 4: ML Pipeline to learn how to create an automated pipeline for your project.