Realtime monitoring and drift detection#

This tutorial illustrates the basic model monitoring capabilities of MLRun: deploying a model to a live endpoint and calculating data drift.

See the overview to model monitoring in Model monitoring description, and make sure you have reviewed the basics in MLRun Quick Start Tutorial.

MLRun installation and configuration#

Before running this notebook make sure mlrun is installed and that you have configured the access to the MLRun service.

Set up the project#

First, import the dependencies and create an MLRun project. This contains all of the models, functions, datasets, etc.

%config Completer.use_jedi = False
import os
import pandas as pd
from sklearn.datasets import load_iris
import mlrun
from mlrun import import_function, get_dataitem, get_or_create_project
import uuid

project_name = "tutorial"
project = get_or_create_project(project_name, context="./")
> 2024-09-10 11:51:45,392 [info] Server and client versions are not the same but compatible: {'parsed_server_version': Version(major=1, minor=7, patch=0, prerelease='rc40', build=None), 'parsed_client_version': Version(major=1, minor=6, patch=3, prerelease=None, build=None)}
> 2024-09-10 11:51:45,430 [info] Loading project from path: {'project_name': 'tutorial', 'path': './'}
> 2024-09-10 11:52:00,855 [info] Project loaded successfully: {'project_name': 'tutorial', 'path': './', 'stored_in_db': True}

Note

This tutorial does not focus on training a model. Instead, it starts with a trained model and its corresponding training dataset.

Enable model monitoring#

Model monitoring is enabled per project. enable_model_monitoring() brings up the controller and schedules it according to the base_period, and deploys the writer.

The controller runs, by default, every 10 minutes, which is also the minimum interval. You can modify the frequency with the parameter base_period. To change the base_period, first run disable_model_monitoring(), then run enable_model_monitoring with the new base_period value.

project.set_model_monitoring_credentials(None, "v3io", "v3io", "v3io")
project.enable_model_monitoring(base_period=1)
'Submitted the model-monitoring controller, writer and stream deployment'

Log the model artifacts#

See full parameter details in log_model().

First download the pickle file.

iris = load_iris()
train_set = pd.DataFrame(
    iris["data"],
    columns=["sepal_length_cm", "sepal_width_cm", "petal_length_cm", "petal_width_cm"],
)

model_name = "RandomForestClassifier"
project.log_model(
    model_name,
    model_file="src/model.pkl",
    training_set=train_set,
    framework="sklearn",
)
<mlrun.artifacts.model.ModelArtifact at 0x7fc1921f7850>

Import, enable monitoring, and deploy the serving function#

Import the model server function from the MLRun Function Hub, add the model that was logged via experiment tracking, and enable drift detection.

The model monitoring infrastructure was already enabled in Enable model monitoring. Now, you enable monitoring on this specific function and its related models with set_tracking. This activates all inferences and predictions, which is used for drift detection.

Then you deploy the serving function with drift detection enabled with a single line of code.

The result of this step is that the model-monitoring stream pod writes data to Parquet, by model endpoint. Every base period, the controller checks for new data. Each time it finds new data, it sends it to the relevant app.

# Import the serving function
serving_fn = import_function(
    "hub://v2_model_server", project=project_name, new_name="serving"
)

# Add the model to the serving function's routing spec
serving_fn.add_model(
    model_name, model_path=f"store://models/{project_name}/{model_name}:latest"
)

# Enable monitoring on this serving function
serving_fn.set_tracking()

serving_fn.spec.build.requirements = ["scikit-learn"]

# Deploy the serving function
project.deploy_function(serving_fn)
> 2024-09-10 12:01:43,675 [info] Starting remote function deploy
2024-09-10 12:01:44  (info) Deploying function
2024-09-10 12:01:44  (info) Building
2024-09-10 12:01:44  (info) Staging files and preparing base images
2024-09-10 12:01:44  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-09-10 12:01:44  (info) Building processor image
2024-09-10 12:05:09  (info) Build complete
2024-09-10 12:05:53  (info) Function deploy complete
> 2024-09-10 12:05:57,200 [info] Successfully deployed function: {'internal_invocation_urls': ['nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['tutorial-serving.default-tenant.app.vmdev94.lab.iguazeng.com/']}
DeployStatus(state=ready, outputs={'endpoint': 'http://tutorial-serving.default-tenant.app.vmdev94.lab.iguazeng.com/', 'name': 'tutorial-serving'})

View deployed resources#

At this point, you should see the model-monitoring-controller job in the UI under Projects | Jobs and Workflows.

Invoke the model#

See full parameter details in invoke().

import json
from time import sleep
from random import choice, uniform

iris = load_iris()
iris_data = iris["data"].tolist()

model_name = "RandomForestClassifier"
serving_1 = project.get_function("serving")

for i in range(150):
    data_point = choice(iris_data)
    serving_1.invoke(
        f"v2/models/{model_name}/infer", json.dumps({"inputs": [data_point]})
    )
    sleep(choice([0.01, 0.04]))
> 2024-09-11 09:00:18,459 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:00:18,901 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:00:18,962 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:00:19,030 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:00:19,094 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:00:19,123 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:00:19,152 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:00:19,217 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}

At this stage you can see the model endpoints and minimal meta data (for example, last prediction and average latency) in the Models | Model Endpoints page.

../_images/model_endpoint_1.png

You can also see the basic statistics in Grafana.

Register and deploy the model-monitoring apps#

The next step is to deploy the model-monitoring app to generate the full meta data. Add the monitoring app to the project using set_model_monitoring_function(). Then, deploy the app using deploy_function().

This example illustrates two monitoring apps:

  • The first is the default monitoring app.

  • The second integrates Evidently as an MLRun function to create MLRun artifacts.

After deploying the jobs they show in the UI under Real-time functions (Nuclio).

Default monitoring app#

First download the demo_app.

# register the first app named "demo_app"
my_app = project.set_model_monitoring_function(
    func="src/demo_app.py",
    application_class="DemoMonitoringApp",
    name="myApp",
)

project.deploy_function(my_app)
> 2024-09-10 12:07:14,544 [info] Starting remote function deploy
2024-09-10 12:07:14  (info) Deploying function
2024-09-10 12:07:15  (info) Building
2024-09-10 12:07:15  (info) Staging files and preparing base images
2024-09-10 12:07:15  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-09-10 12:07:15  (info) Building processor image
2024-09-10 12:09:00  (info) Build complete
2024-09-10 12:09:20  (info) Function deploy complete
> 2024-09-10 12:09:27,034 [info] Successfully deployed function: {'internal_invocation_urls': ['nuclio-tutorial-myapp.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['']}
DeployStatus(state=ready, outputs={'endpoint': 'http://', 'name': 'tutorial-myapp'})

Evidently app#

First download evidently_app.

# register the second app named "evidently_app"
my_evidently_app = project.set_model_monitoring_function(
    func="src/evidently_app.py",
    image="mlrun/mlrun",
    requirements=[
        "evidently~=0.4.32",
    ],
    name="MyEvidentlyApp",
    application_class="DemoEvidentlyMonitoringApp",
    evidently_workspace_path=os.path.abspath(
        f"/v3io/projects/{project_name}/artifacts/evidently_workspace"
    ),
    evidently_project_id=str(uuid.uuid4()),
)

project.deploy_function(my_evidently_app)
> 2024-09-10 12:10:16,691 [info] Starting remote function deploy
2024-09-10 12:10:17  (info) Deploying function
2024-09-10 12:10:17  (info) Building
2024-09-10 12:10:17  (info) Staging files and preparing base images
2024-09-10 12:10:17  (warn) Using user provided base image, runtime interpreter version is provided by the base image
2024-09-10 12:10:17  (info) Building processor image
2024-09-10 12:12:58  (info) Build complete
2024-09-10 12:13:22  (info) Function deploy complete
> 2024-09-10 12:13:29,643 [info] Successfully deployed function: {'internal_invocation_urls': ['nuclio-tutorial-myevidentlyapp.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['']}
DeployStatus(state=ready, outputs={'endpoint': 'http://', 'name': 'tutorial-myevidentlyapp'})

Invoke the model again#

The controller checks for new datasets every base_period to send to the app. Invoking the model a second time ensures that the previous window closed and therefore the data contains the full monitoring window. From this point on, the applications are triggered by the controller. The controller checks the Parquet DB every 10 minutes (or non-default base_period) and streams any new data to the app.

import json
from time import sleep
from random import choice, uniform

iris = load_iris()
iris_data = iris["data"].tolist()

model_name = "RandomForestClassifier"
serving_1 = project.get_function("serving")

for i in range(150):
    data_point = choice(iris_data)
    serving_1.invoke(
        f"v2/models/{model_name}/infer", json.dumps({"inputs": [data_point]})
    )
    sleep(choice([0.01, 0.04]))
> 2024-09-11 09:32:28,831 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:32:29,171 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:32:29,198 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:32:29,222 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:32:29,250 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}
> 2024-09-11 09:32:29,275 [info] Invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-serving.default-tenant.svc.cluster.local:8080/v2/models/RandomForestClassifier/infer'}

View the application results#

../_images/mm-myapp.png

And if you've used Evidently:

../_images/mm-logger-dashb-evidently.png

And an example from the various graphs:

../_images/mm-evidently.png

View the status of the model monitoring jobs#

View the model monitoring jobs in Jobs and Workflows. Model monitoring jobs run continuously, therefore they should have a blue dot indicating that the function is running. (A green dot indicates that the job completed.)

For more information on the UI, see Model monitoring using the platform UI.

../_images/mm-monitor-jobs.png

View detailed drift dashboards#

Grafana has detailed dashboards that show additional information on each model in the project:

For more information on the dashboards, see Model monitoring in the Grafana dashboards.

The Overview dashboard displays the model endpoint IDs of a specific project. Only deployed models with Model Monitoring enabled are displayed. Endpoint IDs are URIs used to provide access to performance data and drift detection statistics of a deployed model.

grafana_dashboard_1

The Model Monitoring Details dashboard displays the real-time performance data of the selected model, including graphs of individual features over time.

grafana_dashboard_2

The Model Monitoring Performance dashboard displays drift and operational metrics over time.

grafana_dashboard_3

Done!#

Congratulations! You’ve completed Part 5 of the MLRun getting-started tutorial. To continue, proceed to Part 6 Batch inference and drift detection.