Table of Contents#

Using MLRun#

MLRun is an open MLOps platform for quickly building and managing continuous ML applications across their lifecycle. MLRun integrates into your development and CI/CD environment and automates the delivery of production data, ML pipelines, and online applications. MLRun significantly reduces engineering efforts, time to production, and computation resources. With MLRun, you can choose any IDE on your local machine or on the cloud. MLRun breaks the silos between data, ML, software, and DevOps/MLOps teams, enabling collaboration and fast continuous improvements.

Get started with MLRun Tutorials and examples, Installation and setup guide,

This page explains how MLRun addresses the MLOps tasks, and presents the MLRun core components.

See the supported data stores, development tools, services, platforms, etc., supported by MLRun's open architecture in MLRun ecosystem.

MLOps tasks#

Project management and CI/CD automation
Ingest and process data
Develop and train models
Deploy models and apps
Monitor and alert

The MLOps development workflow section describes the different tasks and stages in detail. MLRun can be used to automate and orchestrate all the different tasks or just specific tasks (and integrate them with what you have already deployed).

Project management and CI/CD automation#

In MLRun the assets, metadata, and services (data, functions, jobs, artifacts, models, secrets, etc.) are organized into projects. Projects can be imported/exported as a whole, mapped to git repositories or IDE projects (in PyCharm, VSCode, etc.), which enables versioning, collaboration, and CI/CD. Project access can be restricted to a set of users and roles. more...

Ingest and process data#

MLRun provides abstract interfaces to various offline and online data sources, supports batch or realtime data processing at scale, data lineage and versioning, structured and unstructured data, and more. In addition, the MLRun Feature store automates the collection, transformation, storage, catalog, serving, and monitoring of data features across the ML lifecycle and enables feature reuse and sharing. more...

Develop and train models#

MLRun allows you to easily build ML pipelines that take data from various sources or the Feature Store and process it, train models at scale with multiple parameters, test models, track each experiment, and register, version and deploy models, etc. MLRun provides scalable built-in or custom model training services that integrate with any framework and can work with 3rd party training/auto-ML services. You can also bring your own pre-trained model and use it in the pipeline. more...

Deploy models and applications#

MLRun rapidly deploys and manages production-grade real-time or batch application pipelines using elastic and resilient serverless functions. MLRun addresses the entire ML application: intercepting application/user requests, running data processing tasks, inferencing using one or more models, driving actions, and integrating with the application logic. more...

Monitor and alert#

Observability is built into the different MLRun objects (data, functions, jobs, models, pipelines, etc.), eliminating the need for complex integrations and code instrumentation. With MLRun, you can observe the application/model resource usage and model behavior (drift, performance, etc.), define custom app metrics, and trigger alerts or retraining jobs. more...

MLRun core components#

MLRun includes the following major components:

Project management & automation (SDK, API, etc.)
Serverless functions
Data & artifacts
Feature store
Batch runs & workflows
Real-time pipelines
Monitoring

Project management: A service (API, SDK, DB, UI) that manages the different project assets (data, functions, jobs, workflows, secrets, etc.) and provides central control and metadata layer.

Serverless functions: An automatically deployed software package with one or more methods and runtime-specific attributes (such as image, libraries, command, arguments, resources, etc.).

Data and artifacts: Glueless connectivity to various data sources, metadata management, catalog, and versioning for structured/unstructured artifacts.

Feature store: Automatically collects, prepares, catalogs, and serves production data features for development (offline) and real-time (online) deployment using minimal engineering effort.

Batch Runs and workflows: Execute one or more functions with specific parameters and collect, track, and compare all their results and artifacts.

Real-time serving pipeline: Rapid deployment of scalable data and ML pipelines using real-time serverless technology, including API handling, data preparation/enrichment, model serving, ensembles, driving and measuring actions, etc.

Real-time monitoring: Monitors data, models, resources, and production components and provides a feedback loop for exploring production data, identifying drift, alerting on anomalies or data quality issues, triggering retraining jobs, measuring business impact, etc.

MLRun architecture#

MLRun started as a community effort to map the different components in the ML project lifecycle, provide a common metadata layer, and automate the operationalization process (a.k.a MLOps).

Instead of a siloed, complex, and manual process, MLRun enables production pipeline design using a modular strategy, where the different parts contribute to a continuous, automated, and far simpler path from research and development to scalable production pipelines without refactoring code, adding glue logic, or spending significant efforts on data and ML engineering.

MLRun uses Serverless Function technology: write the code once, using your preferred development environment and simple "local" semantics, and then run it as-is on different platforms and at scale. MLRun automates the build process, execution, data movement, scaling, versioning, parameterization, output tracking, CI/CD integration, deployment to production, monitoring, and more.

Those easily developed data or ML "functions" can then be published or loaded from a hub and used later to form offline or real-time production pipelines with minimal engineering efforts.

mlrun-flow


MLRun deployment#

MLRun has two main components, the service and the client (SDK):

  • The MLRun service runs over Kubernetes (can also be deployed using local Docker for demo and test purposes). It can orchestrate and integrate with other open source frameworks, as shown in the following diagram.

  • The MLRun client SDK is installed in your development environment and interacts with the service using REST API calls.

mlrun-flow


MLRun: an integrated and open approach#

Data preparation, model development, model and application delivery, and end to end monitoring are tightly connected: they cannot be managed in silos. This is where MLRun MLOps orchestration comes in. ML, data, and DevOps/MLOps teams collaborate using the same set of tools, practices, APIs, metadata, and version control.

MLRun provides an open architecture that supports your existing development tools, services, and practices through an open API/SDK and pluggable architecture.

MLRun simplifies & accelerates the time to production !

pipeline



While each component in MLRun is independent, the integration provides much greater value and simplicity. For example:

  • The training jobs obtain features from the feature store and update the feature store with metadata, which will be used in the serving or monitoring.

  • The real-time pipeline enriches incoming events with features stored in the feature store. It can also use feature metadata (policies, statistics, schema, etc.) to impute missing data or validate data quality.

  • The monitoring layer collects real-time inputs and outputs from the real-time pipeline and compares them with the features data/metadata from the feature store or model metadata generated by the training layer. Then, it writes all the fresh production data back to the feature store so it can be used for various tasks such as data analysis, model retraining (on fresh data), and model improvements.

When one of the components detailed above is updated, it immediately impacts the feature generation, the model serving pipeline, and the monitoring. MLRun applies versioning to each component, as well as versioning and rolling upgrades across components.

MLRun ecosystem#

This section lists the data stores, development tools, services, platforms, etc., supported by MLRun's open ecosystem.

Data stores#
  • Object (S3, GS, az)

  • Files, NFS

  • Pandas/Spark DF

  • BigQuery

  • Snowflake

  • Databricks

  • Redis

  • Iguazio V3IO object/key-value

  • SQL sources

Event sources#
  • HTTP

  • Cron

  • Kafka

  • Iguazio V3IO streams

Execution frameworks#
  • Nuclio

  • Spark

  • Dask

  • Horovod/MPI

  • K8s Jobs

Dev environments#
  • PyCharm

  • VSCode

  • Jupyter

  • Colab

  • AzureML

  • SageMaker

  • Codespaces

  • Others (set with environment variables)

Machine learning frameworks#
  • SKLearn

  • XGBoost

  • LGBM

  • TF / Keras

  • PyTorch

  • ONNX

  • Custom

Platforms#
  • Kubernetes

    • AWS EKS

    • Azure AKS

    • GKE

    • VMWare

  • Local (e.g., Kubernetes engine on Docker Desktop)

  • Docker

    • Linux/KVM

    • NVIDIA DGX

CI/CD#
  • Jenkins

  • Github Actions

  • Gitlab CI/CD

  • KFP

Browser#

MLRun runs on Chrome and Firefox.

MLOps development workflow #

ML applications require you to implement the following stages in a scalable and reproducible way:

  1. Ingest and process data

  2. Develop and train models

  3. Deploy models and applications

  4. Monitor and alert

MLRun automates the MLOps work. It simplifies & accelerates the time to production

Ingest and process data#

There is no ML without data. Before everything else, ML teams need access to historical and/or online data from multiple sources, and they must catalog and organize the data in a way that allows for simple and fast analysis (for example, by storing data in columnar data structures, such as Parquet).

In most cases, the raw data cannot be used as-is for machine learning algorithms for various reasons such as:

  • The data is low quality (missing fields, null values, etc.) and requires cleaning and imputing.

  • The data needs to be converted to numerical or categorical values which can be processed by algorithms.

  • The data is unstructured in text, json, image, or audio formats, and needs to be converted to tabular or vector formats.

  • The data needs to be grouped or aggregated to make it meaningful.

  • The data is encoded or requires joins with reference information.

  • The ML process starts with manual exploratory data analysis and feature engineering on small data extractions. In order to bring accurate models into production, ML teams must work on larger datasets and automate the process of collecting and preparing the data.

Furthermore, batch collection and preparation methodologies such as ETL, SQL queries, and batch analytics don’t work well for operational or real-time pipelines. As a result, ML teams often build separate data pipelines which use stream processing, NoSQL, and containerized micro- services. 80% of data today is unstructured, so an essential part of building operational data pipelines is to convert unstructured textual, audio, and visual data into machine learning- or deep learning-friendly data organization.

data-collection-and-preparation

MLOps solutions should incorporate a feature store that defines the data collection and transformations just once for both batch and real-time scenarios, processes features automatically without manual involvement, and serves the features from a shared catalog to training, serving, and data governance applications. Feature stores must also extend beyond traditional analytics and enable advanced transformations on unstructured data and complex layouts.

Develop and train models#

Whether it’s deep learning or machine learning, MLRun allows you to train your models at scale and capture all the relevant metadata for experiments tracking and lineage.

With MLOps, ML teams build machine learning pipelines that automatically collect and prepare data, select optimal features, run training using different parameter sets or algorithms, evaluate models, and run various model and system tests. All the executions, along with their data, metadata, code and results must be versioned and logged, providing quick results visualization, to compare them with past results and understand which data was used to produce each model.

Pipelines can be more complex—for example, when ML teams need to develop a combination of models, or use Deep Learning or NLP.

training

ML pipelines can be triggered manually, or preferably triggered automatically when:

  • The code, packages or parameters change

  • The input data or feature engineering logic changes

  • Concept drift is detected, and the model needs to be re-trained with fresh data

ML pipelines:

  • Are built using micro-services (containers or serverless functions), usually over Kubernetes.

  • Have all their inputs (code, package dependencies, data, parameters) and the outputs (logs, metrics, data/features, artifacts, models) tracked for every step in the pipeline, in order to reproduce and/or explain the experiment results.

  • Use versioning for all the data and artifacts used throughout the pipeline.

  • Store code and configuration in versioned Git repositories.

  • Use Continuous Integration (CI) techniques to automate the pipeline initiation, test automation, and for the review and approval process.

Pipelines should be executed over scalable services or functions, which can span elastically over multiple servers or containers. This way, jobs complete faster, and computation resources are freed up once they complete, saving significant costs.

The resulting models are stored in a versioned model repository along with metadata, performance metrics, required parameters, statistical information, etc. Models can be loaded later into batch or real-time serving micro-services or functions.

Deploy models and applications#

With MLRun, in addition to a batch inference, you can deploy a robust and scalable real-time pipeline for more complex and online scenarios. MLRun uses Nuclio, an open source serverless framework for creating real-time pipelines for model deployment.

Once an ML model has been built, it needs to be integrated with real-world data and the business application or front-end services. The entire application, or parts thereof, need to be deployed without disrupting the service. Deployment can be extremely challenging if the ML components aren’t treated as an integral part of the application or production pipeline.

Production pipelines usually consist of:

  • Real-time data collection, validation, and feature engineering logic

  • One or more model serving services

  • API services and/or application integration logic

  • Data and model monitoring services

  • Resource monitoring and alerting services

  • Event, telemetry, and data/features logging services

The different services are interdependent. For example, if the inputs to a model change, the feature engineering logic must be upgraded along with the model serving and model monitoring services. These dependencies require online production pipelines (graphs) to reflect these changes.

building-online-ml-services

Production pipelines can be more complex when using unstructured data, deep learning, NLP or model ensembles, so having flexible mechanisms to build and wire up the pipeline graphs is critical.

Production pipelines are usually interconnected with fast streaming or messaging protocols, so they should be elastic to address traffic and demand fluctuations, and they should allow non-disruptive upgrades to one or more elements of the pipeline. These requirements are best addressed with fast serverless technologies.

Production pipeline development and deployment flow:

  1. Develop production components:

    • API services and application integration logic

    • Feature collection, validation, and transformation

    • Model serving graphs

  2. Test online pipelines with simulated data

  3. Deploy online pipelines to production

  4. Monitor models and data and detect drift

  5. Retrain models and re-engineer data when needed

  6. Upgrade pipeline components (non-disruptively) when needed

Monitor and alert#

Once the model is deployed, use MLRun to track the operational statistics as well as identify drift. When drift is identified, MLRun can trigger the training pipeline to train a new model.

AI services and applications are becoming an essential part of any business. This trend brings with it liabilities, which drive further complexity. ML teams need to add data, code and experiment tracking, monitor data to detect quality problems, and monitor models to detect concept drift and improve model accuracy through the use of AutoML techniques and ensembles, and so on.

Nothing lasts forever, not even carefully constructed models that have been trained using mountains of well-labeled data. ML teams need to react quickly to adapt to constantly changing patterns in real-world data. Monitoring machine learning models is a core component of MLOps to keep deployed models current and predicting with the utmost accuracy, and to ensure they deliver value long-term.

Tutorials and Examples#

The following tutorials provide a hands-on introduction to using MLRun to implement a data science workflow and automate machine-learning operations (MLOps).



Make sure you start with the Quick start tutorial to understand the basics

Introduction to MLRun - Use serverless functions to train and deploy models

Quick start tutorial#

Open In Colab

Introduction to MLRun - Use serverless functions to train and deploy models

This notebook provides a quick overview of developing and deploying machine learning applications using the MLRun MLOps orchestration framework.

Tutorial steps:

Install MLRun#

MLRun has a backend service that can run locally or over Kubernetes (preferred). See the instructions for installing it locally using Docker or over Kubernetes Cluster. Alternatively, you can use Iguazio's managed MLRun service.

Before you start, make sure the MLRun client package is installed and configured properly:

This notebook uses sklearn and numpy. If it is not installed in your environment run !pip install scikit-learn~=1.4 numpy~=1.26.

# Install MLRun and sklearn, run this only once (restart the notebook after the install !!!)
%pip install mlrun scikit-learn~=1.4 numpy~=1.26

Restart the notebook kernel after the pip installation.

import mlrun

Configure the client environment#

MLRun client connects to the local or remote MLRun service/cluster using a REST API. To configure the service address, credentials, and default settings, you use the mlrun.set_environment() method, or environment variables, (see details in Set up your client environment.)

You can skip this step when using MLRun Jupyter notebooks or Iguazio's managed notebooks.

Define MLRun project and ML functions#

MLRun Project is a container for all your work on a particular activity or application. Projects host functions, workflow, artifacts, secrets, and more. Projects have access control and can be accessed by one or more users. They are usually associated with a GIT and interact with CI/CD frameworks for automation. See the MLRun Projects documentation.

Create a new project

project = mlrun.get_or_create_project("quick-tutorial", "./", user_project=True)
> 2022-09-20 13:19:49,414 [info] loaded project quick-tutorial from MLRun DB

MLRun serverless functions specify the source code, base image, extra package requirements, runtime engine kind (batch job, real-time serving, spark, dask, etc.), and desired resources (cpu, gpu, mem, storage, …). The runtime engines (local, job, Nuclio, Spark, etc.) automatically transform the function code and spec into fully managed and elastic services that run over Kubernetes. Function source code can come from a single file (.py, .ipynb, etc.) or a full archive (git, zip, tar). MLRun can execute an entire file/notebook or specific function classes/handlers.

Note

The @mlrun.handler is a decorator that logs the returning values to MLRun as configured. This example uses the default settings so that it logs a dataset (pd.DataFrame) and a string value by getting the returned objects types. In addition to logging outputs, the decorator can parse incoming inputs to the required type. For more info, see the mlrun.handler documentation.

Function code

Run the following cell to generate the data prep file (or copy it manually):

%%writefile data-prep.py

import pandas as pd
from sklearn.datasets import load_breast_cancer

import mlrun


@mlrun.handler(outputs=["dataset", "label_column"])
def breast_cancer_generator():
    """
    A function which generates the breast cancer dataset
    """
    breast_cancer = load_breast_cancer()
    breast_cancer_dataset = pd.DataFrame(
        data=breast_cancer.data, columns=breast_cancer.feature_names
    )
    breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
    breast_cancer_dataset = pd.concat(
        [breast_cancer_dataset, breast_cancer_labels], axis=1
    )

    return breast_cancer_dataset, "label"
Overwriting data-prep.py

Create a serverless function object from the code above, and register it in the project

data_gen_fn = project.set_function(
    "data-prep.py",
    name="data-prep",
    kind="job",
    image="mlrun/mlrun",
    handler="breast_cancer_generator",
)
project.save()  # save the project with the latest config
<mlrun.projects.project.MlrunProject at 0x7ff72063d460>

Run your data processing function and log artifacts#

Functions are executed (using the CLI or SDK run command) with an optional handler, various params, inputs, and resource requirements. This generates a run object that can be tracked through the CLI, UI, and SDK. Multiple functions can be executed and tracked as part of a multi-stage pipeline (workflow).

Note

When a function has additional package requirements, or needs to include the content of a source archive, you must first build the function using the project.build_function() method.

The local flag indicates if the function is executed locally or "teleported" and executed in the Kubernetes cluster. The execution progress and results can be viewed in the UI (see hyperlinks below).


Run using the SDK

gen_data_run = project.run_function("data-prep", local=True)
> 2022-09-20 13:22:59,351 [info] starting run data-prep-breast_cancer_generator uid=1ea3533192364dbc8898ce328988d0a3 DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
quick-tutorial-iguazio 0 Sep 20 13:22:59 completed data-prep-breast_cancer_generator
v3io_user=iguazio
kind=
owner=iguazio
host=jupyter-5654cb444f-c9wk2
label_column=label
dataset

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-09-20 13:22:59,693 [info] run executed, status=completed

Print the run state and outputs

gen_data_run.state()
'completed'
gen_data_run.outputs
{'label_column': 'label',
 'dataset': 'store://artifacts/quick-tutorial-iguazio/data-prep-breast_cancer_generator_dataset:1ea3533192364dbc8898ce328988d0a3'}

Print the output dataset artifact (DataItem object) as dataframe

gen_data_run.artifact("dataset").as_df().head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension label
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

Train a model using an MLRun built-in Function Hub#

MLRun provides a Function Hub that hosts a set of pre-implemented and validated ML, DL, and data processing functions.

You can import the auto-trainer hub function that can: train an ML model using a variety of ML frameworks; generate various metrics and charts; and log the model along with its metadata into the MLRun model registry.

# Import the function
trainer = mlrun.import_function("hub://auto_trainer")

See the auto_trainer function usage instructions in the Function Hub or by typing trainer.doc()

Run the function on the cluster (if there is)

trainer_run = project.run_function(
    trainer,
    inputs={"dataset": gen_data_run.outputs["dataset"]},
    params={
        "model_class": "sklearn.ensemble.RandomForestClassifier",
        "train_test_split_size": 0.2,
        "label_columns": "label",
        "model_name": "cancer",
    },
    handler="train",
)
> 2022-09-20 13:23:14,811 [info] starting run auto-trainer-train uid=84057e1510174611a5d2de0671ee803e DB=http://mlrun-api:8080
> 2022-09-20 13:23:14,970 [info] Job is running in the background, pod: auto-trainer-train-dzjwz
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-3pzdch1o because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
> 2022-09-20 13:23:20,953 [info] Sample set not given, using the whole training set as the sample set
> 2022-09-20 13:23:21,143 [info] training 'cancer'
> 2022-09-20 13:23:22,561 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
quick-tutorial-iguazio 0 Sep 20 13:23:20 completed auto-trainer-train
v3io_user=iguazio
kind=job
owner=iguazio
mlrun/client_version=1.1.0
host=auto-trainer-train-dzjwz
dataset
model_class=sklearn.ensemble.RandomForestClassifier
train_test_split_size=0.2
label_columns=label
model_name=cancer
accuracy=0.956140350877193
f1_score=0.967741935483871
precision_score=0.9615384615384616
recall_score=0.974025974025974
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-09-20 13:23:24,216 [info] run executed, status=completed

View the job progress results and the selected run in the MLRun UI

train job in UI


Results (metrics) and artifacts are generated and tracked automatically by MLRun

trainer_run.outputs
{'accuracy': 0.956140350877193,
 'f1_score': 0.967741935483871,
 'precision_score': 0.9615384615384616,
 'recall_score': 0.974025974025974,
 'feature-importance': 'v3io:///projects/quick-tutorial-iguazio/artifacts/auto-trainer-train/0/feature-importance.html',
 'test_set': 'store://artifacts/quick-tutorial-iguazio/auto-trainer-train_test_set:84057e1510174611a5d2de0671ee803e',
 'confusion-matrix': 'v3io:///projects/quick-tutorial-iguazio/artifacts/auto-trainer-train/0/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/quick-tutorial-iguazio/artifacts/auto-trainer-train/0/roc-curves.html',
 'calibration-curve': 'v3io:///projects/quick-tutorial-iguazio/artifacts/auto-trainer-train/0/calibration-curve.html',
 'model': 'store://artifacts/quick-tutorial-iguazio/cancer:84057e1510174611a5d2de0671ee803e'}
# Display HTML output artifacts
trainer_run.artifact("confusion-matrix").show()

Build, test, and deploy the model serving functions#

MLRun serving can produce managed, real-time, serverless, pipelines composed of various data processing and ML tasks. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. For more details and examples, see MLRun serving graphs.

Create a model serving function

serving_fn = mlrun.new_function(
    "serving",
    image="mlrun/mlrun",
    kind="serving",
    requirements=["scikit-learn~=1.3.0"],
)

Add a model

The basic serving topology supports a router with multiple child models attached to it. The function.add_model() method allows you to add models and specify the name, model_path (to a model file, dir, or artifact), and the serving class (built-in or user defined).

serving_fn.add_model(
    "cancer-classifier",
    model_path=trainer_run.outputs["model"],
    class_name="mlrun.frameworks.sklearn.SklearnModelServer",
)
<mlrun.serving.states.TaskStep at 0x7ff6da1ac190>
# Plot the serving graph topology
serving_fn.spec.graph.plot(rankdir="LR")
_images/4f435a5180c2bf96a1330e48c3e65857e409e81b599c97ea6332e0337e937fb0.svg

Simulating the model server locally

# Create a mock (simulator of the real-time function)
server = serving_fn.to_mock_server()
> 2022-09-20 13:24:24,867 [warning] run command, file or code were not specified
> 2022-09-20 13:24:25,240 [info] model cancer-classifier was loaded
> 2022-09-20 13:24:25,241 [info] Loaded ['cancer-classifier']

Test the mock model server endpoint

  • List the served models

server.test("/v2/models/", method="GET")
{'models': ['cancer-classifier']}
  • Infer using test data

my_data = {
    "inputs": [
        [
            1.371e01,
            2.083e01,
            9.020e01,
            5.779e02,
            1.189e-01,
            1.645e-01,
            9.366e-02,
            5.985e-02,
            2.196e-01,
            7.451e-02,
            5.835e-01,
            1.377e00,
            3.856e00,
            5.096e01,
            8.805e-03,
            3.029e-02,
            2.488e-02,
            1.448e-02,
            1.486e-02,
            5.412e-03,
            1.706e01,
            2.814e01,
            1.106e02,
            8.970e02,
            1.654e-01,
            3.682e-01,
            2.678e-01,
            1.556e-01,
            3.196e-01,
            1.151e-01,
        ]
    ]
}
server.test("/v2/models/cancer-classifier/infer", body=my_data)
X does not have valid feature names, but RandomForestClassifier was fitted with feature names
{'id': '27d3f10a36ce465f841d3e19ca404889',
 'model_name': 'cancer-classifier',
 'outputs': [0]}
  • Read the model name, ver and schema (input and output features)

Deploy a real-time serving function (over Kubernetes or Docker)

This section requires Nuclio to be installed (over k8s or Docker).

Use the mlrun deploy_function() method to build and deploy a Nuclio serving function from your serving-function code. You can deploy the function object (serving_fn) or reference pre-registered project functions.

project.deploy_function(serving_fn)
> 2022-09-20 13:24:34,823 [info] Starting remote function deploy
2022-09-20 13:24:35  (info) Deploying function
2022-09-20 13:24:35  (info) Building
2022-09-20 13:24:35  (info) Staging files and preparing base images
2022-09-20 13:24:35  (info) Building processor image
2022-09-20 13:25:35  (info) Build complete
2022-09-20 13:26:05  (info) Function deploy complete
> 2022-09-20 13:26:06,030 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-quick-tutorial-iguazio-serving.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['quick-tutorial-iguazio-serving-quick-tutorial-iguazio.default-tenant.app.alexp-edge.lab.iguazeng.com/']}
DeployStatus(state=ready, outputs={'endpoint': 'http://quick-tutorial-iguazio-serving-quick-tutorial-iguazio.default-tenant.app.alexp-edge.lab.iguazeng.com/', 'name': 'quick-tutorial-iguazio-serving'})
  • Test the live endpoint

serving_fn.invoke("/v2/models/cancer-classifier/infer", body=my_data)
> 2022-09-20 13:26:06,094 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-quick-tutorial-iguazio-serving.default-tenant.svc.cluster.local:8080/v2/models/cancer-classifier/infer'}
{'id': '2533b72a-6d94-4c51-b960-02a2deaf84b6',
 'model_name': 'cancer-classifier',
 'outputs': [0]}

Done!#

Congratulation! You've completed Part 1 of the MLRun getting-started tutorial. Proceed to Part 2: Train, Track, Compare, and Register Models to learn how to train an ML model.

Train, compare, and register models#

This notebook provides a quick overview of training ML models using MLRun MLOps orchestration framework.

Make sure you reviewed the basics in MLRun Quick Start Tutorial.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure mlrun and sklearn packages are installed (pip install mlrun scikit-learn~=1.3) and that you have configured the access to the MLRun service.

# Install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

Define MLRun project and a training functions#

You should create, load, or use (get) an MLRun project that holds all your functions and assets.

Get or create a new project

The get_or_create_project() method tries to load the project from MLRun DB. If the project does not exist, it creates a new one.

import mlrun

project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)
> 2022-09-20 13:55:10,543 [info] loaded project tutorial from None or context and saved in MLRun DB

Add (auto) MLOps to your training function

Training functions generate models and various model statistics. You'll want to store the models along with all the relevant data, metadata, and measurements. MLRun can apply all the MLOps functionality automatically ("Auto-MLOps") by simply using the framework-specific apply_mlrun() method.

This is the line to add to your code, as shown in the training function below.

apply_mlrun(model=model, model_name="my_model", x_test=x_test, y_test=y_test)

apply_mlrun() manages the training process and automatically logs all the framework-specific model object, details, data, metadata, and metrics. It accepts the model object and various optional parameters. When specifying the x_test and y_test data it generates various plots and calculations to evaluate the model. Metadata and parameters are automatically recorded (from MLRun context object) and therefore don't need to be specified.

Function code

Run the following cell to generate the trainer.py file (or copy it manually):

%%writefile trainer.py

import pandas as pd

from sklearn import ensemble
from sklearn.model_selection import train_test_split

import mlrun
from mlrun.frameworks.sklearn import apply_mlrun


@mlrun.handler()
def train(
    dataset: pd.DataFrame,
    label_column: str = "label",
    n_estimators: int = 100,
    learning_rate: float = 0.1,
    max_depth: int = 3,
    model_name: str = "cancer_classifier",
):
    # Initialize the x & y data
    x = dataset.drop(label_column, axis=1)
    y = dataset[label_column]

    # Train/Test split the dataset
    x_train, x_test, y_train, y_test = train_test_split(
        x, y, test_size=0.2, random_state=42
    )

    # Pick an ideal ML model
    model = ensemble.GradientBoostingClassifier(
        n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth
    )

    # -------------------- The only line you need to add for MLOps -------------------------
    # Wraps the model with MLOps (test set is provided for analysis & accuracy measurements)
    apply_mlrun(model=model, model_name=model_name, x_test=x_test, y_test=y_test)
    # --------------------------------------------------------------------------------------

    # Train the model
    model.fit(x_train, y_train)
Overwriting trainer.py

Create a serverless function object from the code above, and register it in the project

trainer = project.set_function(
    "trainer.py", name="trainer", kind="job", image="mlrun/mlrun", handler="train"
)

Run the training function and log the artifacts and model#

Create a dataset for training

import pandas as pd
from sklearn.datasets import load_breast_cancer

breast_cancer = load_breast_cancer()
breast_cancer_dataset = pd.DataFrame(
    data=breast_cancer.data, columns=breast_cancer.feature_names
)
breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
breast_cancer_dataset = pd.concat([breast_cancer_dataset, breast_cancer_labels], axis=1)

breast_cancer_dataset.to_csv("cancer-dataset.csv", index=False)

Run the function (locally) using the generated dataset

trainer_run = project.run_function(
    "trainer",
    inputs={"dataset": "cancer-dataset.csv"},
    params={"n_estimators": 100, "learning_rate": 1e-1, "max_depth": 3},
    local=True,
)
> 2022-09-20 13:56:57,630 [info] starting run trainer-train uid=b3f1bc3379324767bee22f44942b96e4 DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
tutorial-iguazio 0 Sep 20 13:56:57 completed trainer-train
v3io_user=iguazio
kind=
owner=iguazio
host=jupyter-5654cb444f-c9wk2
dataset
n_estimators=100
learning_rate=0.1
max_depth=3
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-09-20 13:56:59,356 [info] run executed, status=completed

View the auto generated results and artifacts

trainer_run.outputs
{'accuracy': 0.956140350877193,
 'f1_score': 0.965034965034965,
 'precision_score': 0.9583333333333334,
 'recall_score': 0.971830985915493,
 'feature-importance': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/feature-importance.html',
 'test_set': 'store://artifacts/tutorial-iguazio/trainer-train_test_set:b3f1bc3379324767bee22f44942b96e4',
 'confusion-matrix': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/roc-curves.html',
 'calibration-curve': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/calibration-curve.html',
 'model': 'store://artifacts/tutorial-iguazio/cancer_classifier:b3f1bc3379324767bee22f44942b96e4'}
trainer_run.artifact("feature-importance").show()

Export model files + metadata into a zip (requires MLRun 1.1.0 and later)

You can export() the model package (files + metadata) into a zip, and load it on a remote system/cluster by running model = project.import_artifact(key, path)).

trainer_run.artifact("model").meta.export("model.zip")

Hyper-parameter tuning and model/experiment comparison#

Run a GridSearch with a couple of parameters, and select the best run with respect to the max accuracy.
(For more details, see MLRun Hyper-Param and Iterative jobs.)

For basic usage you can run the hyperparameters tuning job by using the arguments:

  • hyperparams for the hyperparameters options and values of choice.

  • selector for specifying how to select the best model.

Running a remote function

To run the hyper-param task over the cluster you need the input data to be available for the job, using object storage or the MLRun versioned artifact store.

The following line logs (and uploads) the dataframe as a project artifact:

dataset_artifact = project.log_dataset(
    "cancer-dataset", df=breast_cancer_dataset, index=False
)

Run the function over the remote Kubernetes cluster (local is not set):

hp_tuning_run = project.run_function(
    "trainer",
    inputs={"dataset": dataset_artifact.uri},
    hyperparams={
        "n_estimators": [10, 100, 1000],
        "learning_rate": [1e-1, 1e-3],
        "max_depth": [2, 8],
    },
    selector="max.accuracy",
)
> 2022-09-20 13:57:28,217 [info] starting run trainer-train uid=b7696b221a174f66979be01138797f19 DB=http://mlrun-api:8080
> 2022-09-20 13:57:28,365 [info] Job is running in the background, pod: trainer-train-xfzfp
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-zuih5pkq because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
> 2022-09-20 13:58:07,356 [info] best iteration=3, used criteria max.accuracy
> 2022-09-20 13:58:07,750 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
tutorial-iguazio 0 Sep 20 13:57:33 completed trainer-train
v3io_user=iguazio
kind=job
owner=iguazio
mlrun/client_version=1.1.0
dataset
best_iteration=3
accuracy=0.9649122807017544
f1_score=0.9722222222222222
precision_score=0.958904109589041
recall_score=0.9859154929577465
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model
iteration_results
parallel_coordinates

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-09-20 13:58:13,925 [info] run executed, status=completed

View Hyper-param results and the selected run in the MLRun UI

hprun

Interactive Parallel Coordinates Plot

pcp


List the generated models and compare the different runs

hp_tuning_run.outputs
{'best_iteration': 3,
 'accuracy': 0.9649122807017544,
 'f1_score': 0.9722222222222222,
 'precision_score': 0.958904109589041,
 'recall_score': 0.9859154929577465,
 'feature-importance': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/3/feature-importance.html',
 'test_set': 'store://artifacts/tutorial-iguazio/trainer-train_test_set:b7696b221a174f66979be01138797f19',
 'confusion-matrix': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/3/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/3/roc-curves.html',
 'calibration-curve': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/3/calibration-curve.html',
 'model': 'store://artifacts/tutorial-iguazio/cancer_classifier:b7696b221a174f66979be01138797f19',
 'iteration_results': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/iteration_results.csv',
 'parallel_coordinates': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/parallel_coordinates.html'}
# List the models in the project (can apply filters)
models = project.list_models()
for model in models:
    print(f"uri: {model.uri}, metrics: {model.metrics}")
Hide code cell output
uri: store://models/tutorial-iguazio/cancer_classifier#0:b3f1bc3379324767bee22f44942b96e4, metrics: {'accuracy': 0.956140350877193, 'f1_score': 0.965034965034965, 'precision_score': 0.9583333333333334, 'recall_score': 0.971830985915493}
uri: store://models/tutorial-iguazio/cancer_classifier#1:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.956140350877193, 'f1_score': 0.965034965034965, 'precision_score': 0.9583333333333334, 'recall_score': 0.971830985915493}
uri: store://models/tutorial-iguazio/cancer_classifier#2:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.956140350877193, 'f1_score': 0.965034965034965, 'precision_score': 0.9583333333333334, 'recall_score': 0.971830985915493}
uri: store://models/tutorial-iguazio/cancer_classifier#3:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.9649122807017544, 'f1_score': 0.9722222222222222, 'precision_score': 0.958904109589041, 'recall_score': 0.9859154929577465}
uri: store://models/tutorial-iguazio/cancer_classifier#4:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.6228070175438597, 'f1_score': 0.7675675675675676, 'precision_score': 0.6228070175438597, 'recall_score': 1.0}
uri: store://models/tutorial-iguazio/cancer_classifier#5:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.6228070175438597, 'f1_score': 0.7675675675675676, 'precision_score': 0.6228070175438597, 'recall_score': 1.0}
uri: store://models/tutorial-iguazio/cancer_classifier#6:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.956140350877193, 'f1_score': 0.965034965034965, 'precision_score': 0.9583333333333334, 'recall_score': 0.971830985915493}
uri: store://models/tutorial-iguazio/cancer_classifier#7:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.9385964912280702, 'f1_score': 0.951048951048951, 'precision_score': 0.9444444444444444, 'recall_score': 0.9577464788732394}
uri: store://models/tutorial-iguazio/cancer_classifier#8:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.9473684210526315, 'f1_score': 0.9577464788732394, 'precision_score': 0.9577464788732394, 'recall_score': 0.9577464788732394}
uri: store://models/tutorial-iguazio/cancer_classifier#9:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.9473684210526315, 'f1_score': 0.9577464788732394, 'precision_score': 0.9577464788732394, 'recall_score': 0.9577464788732394}
uri: store://models/tutorial-iguazio/cancer_classifier#10:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.6228070175438597, 'f1_score': 0.7675675675675676, 'precision_score': 0.6228070175438597, 'recall_score': 1.0}
uri: store://models/tutorial-iguazio/cancer_classifier#11:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.6228070175438597, 'f1_score': 0.7675675675675676, 'precision_score': 0.6228070175438597, 'recall_score': 1.0}
uri: store://models/tutorial-iguazio/cancer_classifier#12:b7696b221a174f66979be01138797f19, metrics: {'accuracy': 0.9385964912280702, 'f1_score': 0.951048951048951, 'precision_score': 0.9444444444444444, 'recall_score': 0.9577464788732394}
# To view the full model object use:
# print(models[0].to_yaml())
# Compare the runs (generate interactive parallel coordinates plot and a table)
project.list_runs(name="trainer-train", iter=True).compare()
Hide code cell output
uid iter start state name parameters results
12 Sep 20 13:57:59 completed trainer-train
n_estimators=1000
learning_rate=0.001
max_depth=8
accuracy=0.9385964912280702
f1_score=0.951048951048951
precision_score=0.9444444444444444
recall_score=0.9577464788732394
11 Sep 20 13:57:57 completed trainer-train
n_estimators=100
learning_rate=0.001
max_depth=8
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
10 Sep 20 13:57:55 completed trainer-train
n_estimators=10
learning_rate=0.001
max_depth=8
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
9 Sep 20 13:57:53 completed trainer-train
n_estimators=1000
learning_rate=0.1
max_depth=8
accuracy=0.9473684210526315
f1_score=0.9577464788732394
precision_score=0.9577464788732394
recall_score=0.9577464788732394
8 Sep 20 13:57:51 completed trainer-train
n_estimators=100
learning_rate=0.1
max_depth=8
accuracy=0.9473684210526315
f1_score=0.9577464788732394
precision_score=0.9577464788732394
recall_score=0.9577464788732394
7 Sep 20 13:57:50 completed trainer-train
n_estimators=10
learning_rate=0.1
max_depth=8
accuracy=0.9385964912280702
f1_score=0.951048951048951
precision_score=0.9444444444444444
recall_score=0.9577464788732394
6 Sep 20 13:57:45 completed trainer-train
n_estimators=1000
learning_rate=0.001
max_depth=2
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
5 Sep 20 13:57:43 completed trainer-train
n_estimators=100
learning_rate=0.001
max_depth=2
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
4 Sep 20 13:57:42 completed trainer-train
n_estimators=10
learning_rate=0.001
max_depth=2
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
3 Sep 20 13:57:38 completed trainer-train
n_estimators=1000
learning_rate=0.1
max_depth=2
accuracy=0.9649122807017544
f1_score=0.9722222222222222
precision_score=0.958904109589041
recall_score=0.9859154929577465
2 Sep 20 13:57:36 completed trainer-train
n_estimators=100
learning_rate=0.1
max_depth=2
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
1 Sep 20 13:57:33 completed trainer-train
n_estimators=10
learning_rate=0.1
max_depth=2
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
1 Sep 20 13:57:33 completed trainer-train
n_estimators=10
learning_rate=0.1
max_depth=2
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
2 Sep 20 13:57:33 completed trainer-train
n_estimators=100
learning_rate=0.1
max_depth=2
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
3 Sep 20 13:57:33 completed trainer-train
n_estimators=1000
learning_rate=0.1
max_depth=2
accuracy=0.9649122807017544
f1_score=0.9722222222222222
precision_score=0.958904109589041
recall_score=0.9859154929577465
4 Sep 20 13:57:33 completed trainer-train
n_estimators=10
learning_rate=0.001
max_depth=2
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
5 Sep 20 13:57:33 completed trainer-train
n_estimators=100
learning_rate=0.001
max_depth=2
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
6 Sep 20 13:57:33 completed trainer-train
n_estimators=1000
learning_rate=0.001
max_depth=2
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
7 Sep 20 13:57:33 completed trainer-train
n_estimators=10
learning_rate=0.1
max_depth=8
accuracy=0.9385964912280702
f1_score=0.951048951048951
precision_score=0.9444444444444444
recall_score=0.9577464788732394
8 Sep 20 13:57:33 completed trainer-train
n_estimators=100
learning_rate=0.1
max_depth=8
accuracy=0.9473684210526315
f1_score=0.9577464788732394
precision_score=0.9577464788732394
recall_score=0.9577464788732394
9 Sep 20 13:57:33 completed trainer-train
n_estimators=1000
learning_rate=0.1
max_depth=8
accuracy=0.9473684210526315
f1_score=0.9577464788732394
precision_score=0.9577464788732394
recall_score=0.9577464788732394
10 Sep 20 13:57:33 completed trainer-train
n_estimators=10
learning_rate=0.001
max_depth=8
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
11 Sep 20 13:57:33 completed trainer-train
n_estimators=100
learning_rate=0.001
max_depth=8
accuracy=0.6228070175438597
f1_score=0.7675675675675676
precision_score=0.6228070175438597
recall_score=1.0
12 Sep 20 13:57:33 completed trainer-train
n_estimators=1000
learning_rate=0.001
max_depth=8
accuracy=0.9385964912280702
f1_score=0.951048951048951
precision_score=0.9444444444444444
recall_score=0.9577464788732394
0 Sep 20 13:56:57 completed trainer-train
n_estimators=100
learning_rate=0.1
max_depth=3
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
0 Sep 20 13:56:16
error
trainer-train
n_estimators=100
learning_rate=0.1
max_depth=3
0 Sep 20 13:56:03
error
trainer-train
n_estimators=100
learning_rate=0.1
max_depth=3
0 Sep 20 13:55:25 running trainer-train
n_estimators=100
learning_rate=0.1
max_depth=3

Build and test the model serving functions#

MLRun serving can produce managed, real-time, serverless, pipelines composed of various data processing and ML tasks. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. For more details and examples, see the MLRun Serving Graphs.

Create a model serving function from your code, and (view it here)

serving_fn = mlrun.new_function("serving", image="mlrun/mlrun", kind="serving")
serving_fn.add_model(
    "cancer-classifier",
    model_path=hp_tuning_run.outputs["model"],
    class_name="mlrun.frameworks.sklearn.SklearnModelServer",
)
<mlrun.serving.states.TaskStep at 0x7feb1f55faf0>
# Create a mock (simulator of the real-time function)
server = serving_fn.to_mock_server()

my_data = {
    "inputs": [
        [
            1.371e01,
            2.083e01,
            9.020e01,
            5.779e02,
            1.189e-01,
            1.645e-01,
            9.366e-02,
            5.985e-02,
            2.196e-01,
            7.451e-02,
            5.835e-01,
            1.377e00,
            3.856e00,
            5.096e01,
            8.805e-03,
            3.029e-02,
            2.488e-02,
            1.448e-02,
            1.486e-02,
            5.412e-03,
            1.706e01,
            2.814e01,
            1.106e02,
            8.970e02,
            1.654e-01,
            3.682e-01,
            2.678e-01,
            1.556e-01,
            3.196e-01,
            1.151e-01,
        ]
    ]
}
server.test("/v2/models/cancer-classifier/infer", body=my_data)
> 2022-09-20 14:12:35,714 [warning] run command, file or code were not specified
> 2022-09-20 14:12:35,859 [info] model cancer-classifier was loaded
> 2022-09-20 14:12:35,860 [info] Loaded ['cancer-classifier']
/conda/envs/mlrun-extended/lib/python3.8/site-packages/sklearn/base.py:450: UserWarning:

X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names
{'id': 'd9aee47cadc042ebbd9474ec0179a446',
 'model_name': 'cancer-classifier',
 'outputs': [0]}

Done!#

Congratulation! You've completed Part 2 of the MLRun getting-started tutorial. Proceed to Part 3: Model serving to learn how to deploy and serve your model using a serverless function.

Serving pre-trained ML/DL models#

This notebook demonstrate how to serve standard ML/DL models using MLRun Serving.

Make sure you went over the basics in MLRun Quick Start Tutorial.

MLRun serving can produce managed real-time serverless pipelines from various tasks, including MLRun models or standard model files. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. Nuclio is a high-performance open-source "serverless" framework that's focused on data, I/O, and compute-intensive workloads.

MLRun serving supports advanced real-time data processing and model serving pipelines.
For more details and examples, see the MLRun serving pipelines documentation.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure the mlrun package is installed (pip install mlrun) and that you have configured the access to MLRun service.

# Install MLRun if not installed, run this only once. Restart the notebook after the install!
%pip install mlrun

Get or create a new project

You should create, load or use (get) an MLRun Project. The get_or_create_project() method tries to load the project from the MLRun DB. If the project does not exist, it creates a new one.

import mlrun

project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)

Using pre-built MLRun serving classes#

MLRun contains built-in serving functionality for the major ML/DL frameworks (Scikit-Learn, TensorFlow.Keras, ONNX, XGBoost, LightGBM, and PyTorch).

The following table specifies, for each framework, the corresponding MLRun ModelServer serving class and its dependencies:

framework

serving class

dependencies

SciKit-Learn

mlrun.frameworks.sklearn.SklearnModelServer

scikit-learn

TensorFlow.Keras

mlrun.frameworks.tf_keras.TFKerasModelServer

tensorflow

ONNX

mlrun.frameworks.onnx.ONNXModelServer

onnxruntime

XGBoost

mlrun.frameworks.xgboost.XGBoostModelServer

xgboost

LightGBM

mlrun.frameworks.lgbm.LGBMModelServer

lightgbm

PyTorch

mlrun.frameworks.pytorch.PyTorchModelServer

torch

For GPU support use the mlrun/mlrun-gpu image (adding GPU drivers and support).

Example using SKlearn and TF Keras models

See how to specify the parameters in the following two examples. These use standard pre-trained models (using the iris dataset) stored in MLRun samples repository. (You can use your own models instead.)

models_dir = mlrun.get_sample_path("models/serving/")

# We choose the correct model to avoid pickle warnings
import sys

suffix = (
    mlrun.__version__.split("-")[0].replace(".", "_")
    if sys.version_info[1] > 7
    else "3.7"
)

framework = "sklearn"  # change to 'keras' to try the 2nd option
kwargs = {}
if framework == "sklearn":
    serving_class = "mlrun.frameworks.sklearn.SklearnModelServer"
    model_path = models_dir + f"sklearn-{suffix}.pkl"
    image = "mlrun/mlrun"
    requirements = []
else:
    serving_class = "mlrun.frameworks.tf_keras.TFKerasModelServer"
    model_path = models_dir + "keras.h5"
    image = "mlrun/mlrun"  # or mlrun/mlrun-gpu when using GPUs
    kwargs["labels"] = {"model-format": "h5"}
    requirements = ["tensorflow"]
Log the model#

The model and its metadata are first registered in MLRun's Model Registry. Use the log_model() method to specify the model files and metadata (metrics, schema, parameters, etc.).

model_object = project.log_model(f"{framework}-model", model_file=model_path, **kwargs)

Create and test the serving function#

Create a new serving function, specify its name and the correct image (with your desired framework).

If you want to add specific packages to the base image, specify the requirements attribute, example:

serving_fn = mlrun.new_function("serving", image=image, kind="serving", requirements=["tensorflow==2.8.1"])

The following example uses a basic topology of a model router and adds a single model behind it. (You can add multiple models to the same function.)

serving_fn = mlrun.new_function("serving", image=image, kind="serving", requirements={})
serving_fn.add_model(
    framework, model_path=model_object.uri, class_name=serving_class, to_list=True
)

# Plot the serving topology input -> router -> model
serving_fn.plot(rankdir="LR")
_images/de29572375fcc143ec1dec4162bc3972666d217f89fddc3256df16d3a9518dfd.svg

Simulate the model server locally (using the mock_server)

# Create a mock server that represents the serving pipeline
server = serving_fn.to_mock_server()

Test the mock model server endpoint

  • List the served models

server.test("/v2/models/", method="GET")
{'models': ['sklearn']}
  • Infer using test data

sample = {
    "inputs": [
        {
            "sepal_length_cm": {0: 5.2, 1: 6.4},
            "sepal_width_cm": {0: 2.7, 1: 3.1},
            "petal_length_cm": {0: 3.9, 1: 5.5},
            "petal_width_cm": {0: 1.4, 1: 1.8},
        }
    ]
}
server.test(path=f"/v2/models/{framework}/infer", body=sample)
X does not have valid feature names, but RandomForestClassifier was fitted with feature names
{'id': '826608653e95452b9ac48fcca1ab8c47',
 'model_name': 'sklearn',
 'outputs': [0, 2]}

See more API options and parameters in Model serving API.

Deploy the serving function#

Deploy the serving function and use invoke to test it with the provided sample.

project.deploy_function(serving_fn)
> 2023-03-13 08:33:34,552 [info] Starting remote function deploy
2023-03-13 08:33:34  (info) Deploying function
2023-03-13 08:33:34  (info) Building
2023-03-13 08:33:34  (info) Staging files and preparing base images
2023-03-13 08:33:34  (info) Building processor image
2023-03-13 08:34:29  (info) Build complete
2023-03-13 08:34:39  (info) Function deploy complete
> 2023-03-13 08:34:45,922 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-tutorial-yonis-serving.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['tutorial-yonis-serving-tutorial-yonis.default-tenant.app.vmdev30.lab.iguazeng.com/']}
DeployStatus(state=ready, outputs={'endpoint': 'http://tutorial-yonis-serving-tutorial-yonis.default-tenant.app.vmdev30.lab.iguazeng.com/', 'name': 'tutorial-yonis-serving'})
serving_fn.invoke(path=f"/v2/models/{framework}/infer", body=sample)
> 2023-03-13 08:34:46,009 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-yonis-serving.default-tenant.svc.cluster.local:8080/v2/models/sklearn/infer'}
{'id': 'b699f7e6-2d3b-4fa4-9534-fa6b9fa3f423',
 'model_name': 'sklearn',
 'outputs': [0, 2]}

Build a custom serving class#

Model serving classes implement the full model serving functionality, which include loading models, pre- and post-processing, prediction, explainability, and model monitoring.

Model serving classes must inherit from mlrun.serving.V2ModelServer, and at the minimum implement the load() (download the model file(s) and load the model into memory) and predict() (accept request payload and return prediction/inference results) methods.

For more detailed information on custom serving classes, see Build your own model serving class.

The following code demonstrates a minimal scikit-learn (a.k.a. sklearn) serving-class implementation:

from cloudpickle import load
import numpy as np
from typing import List
import mlrun

class ClassifierModel(mlrun.serving.V2ModelServer):
    def load(self):
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model('.pkl')
        self.model = load(open(model_file, 'rb'))

    def predict(self, body: dict) -> List:
        """Generate model predictions from sample."""
        feats = np.asarray(body['inputs'])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()

In order to create a function that incorporates the code of the new class (in serving.py ) use code_to_function:

serving_fn = mlrun.code_to_function('serving', filename='serving.py', kind='serving',image='mlrun/mlrun')
serving_fn.add_model('my_model',model_path=model_file, class_name='ClassifierModel')

Build an advanced model serving graph#

MLRun graphs enable building and running DAGs (directed acyclic graphs). Graphs are composed of individual steps. The first graph element accepts an Event object, transforms/processes the event and passes the result to the next step in the graph, and so on. The final result can be written out to a destination (file, DB, stream, etc.) or returned back to the caller (one of the graph steps can be marked with .respond()).

The serving graphs can be composed of pre-defined graph steps, block-type elements (model servers, routers, ensembles, data readers and writers, data engineering tasks, validators, etc.), custom steps, or from native python classes/functions. A graph can have data processing steps, model ensembles, model servers, post-processing, etc. Graphs can auto-scale and span multiple function containers (connected through streaming protocols).

See the Advanced Model Serving Graph Notebook Example.

Done!#

Congratulations! You've completed Part 3 of the MLRun getting-started tutorial. Proceed to Part 4: ML Pipeline to learn how to create an automated pipeline for your project.

Projects and automated ML pipeline#

This notebook demonstrates how to work with projects, source control (git), and automating the ML pipeline.

Make sure you went over the basics in MLRun Quick Start Tutorial.

MLRun Project is a container for all your work on a particular activity: all the associated code, functions, jobs, workflows, data, models, and artifacts. Projects can be mapped to git repositories to enable versioning, collaboration, and CI/CD.

You can create project definitions using the SDK or a yaml file and store those in the MLRun DB, a file, or an archive. Once the project is loaded you can run jobs/workflows that refer to any project element by name, allowing separation between configuration and code. See load projects for details.

Projects contain workflows that execute the registered functions in a sequence/graph (DAG), and that can reference project parameters, secrets and artifacts by name. MLRun currently supports two workflow engines, local (for simple tasks) and Kubeflow Pipelines (for more complex/advanced tasks). MLRun also supports a real-time workflow engine (see online serving pipelines (graphs).

An ML Engineer can gather the different functions created by the data engineer and data scientist and create this automated pipeline.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure the mlrun package is installed (pip install mlrun) and that you have configured the access to MLRun service.

# Install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

Set up the project and functions#

Get or create a project

There are three ways to create/load MLRun projects:

  • mlrun.projects.new_project() — Create a new MLRun project and optionally load it from a yaml/zip/git template.

  • mlrun.projects.load_project() — Load a project from a context directory or remote git/zip/tar archive.

  • mlrun.projects.get_or_create_project() — Load a project from the MLRun DB if it exists, or from a specified context/archive.

Projects refer to a context directory that holds all the project code and configuration. Its default value is "./", which is the directory the MLRun client runs from. The context dir is usually mapped to a git repository and/or to an IDE (PyCharm, VSCode, etc.) project.

import mlrun

project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)
> 2022-09-20 14:59:47,322 [info] loaded project tutorial from MLRun DB

Register project functions#

To run workflows, you must save the definitions for the functions in the project so that function objects are initialized automatically when you load a project or when running a project version in automated CI/CD workflows. In addition, you might want to set/register other project attributes such as global parameters, secrets, and data.

Functions are registered using the set_function() command, where you can specify the code, requirements, image, etc. Functions can be created from a single code/notebook file or have access to the entire project context directory. (By adding the with_repo=True flag, it guarantees that the project context is cloned into the function runtime environment).

Function registration examples:

    # Example: register a notebook file as a function
    project.set_function('mynb.ipynb', name='test-function', image="mlrun/mlrun", handler="run_test")

    # Define a job (batch) function that uses code/libs from the project repo
    project.set_function(
        name="myjob", handler="my_module.job_handler",
        image="mlrun/mlrun", kind="job", with_repo=True,
    )

Function code

Run the following cell to generate the data prep file (or copy it manually):

%%writefile data-prep.py

import pandas as pd
from sklearn.datasets import load_breast_cancer

import mlrun


@mlrun.handler(outputs=["dataset", "label_column"])
def breast_cancer_generator():
    """
    A function that generates the breast cancer dataset
    """
    breast_cancer = load_breast_cancer()
    breast_cancer_dataset = pd.DataFrame(
        data=breast_cancer.data, columns=breast_cancer.feature_names
    )
    breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
    breast_cancer_dataset = pd.concat(
        [breast_cancer_dataset, breast_cancer_labels], axis=1
    )

    return breast_cancer_dataset, "label"
Overwriting data-prep.py

Register the function above in the project

project.set_function(
    "data-prep.py",
    name="data-prep",
    kind="job",
    image="mlrun/mlrun",
    handler="breast_cancer_generator",
)
<mlrun.runtimes.kubejob.KubejobRuntime at 0x7fd96c30a0a0>

Register additional project objects and metadata

You can define other objects (workflows, artifacts, secrets) and parameters in the project and use them in your functions, for example:

    # Register a simple named artifact in the project (to be used in workflows)  
    data_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'
    project.set_artifact('data', target_path=data_url)

    # Add a multi-stage workflow (./workflow.py) to the project with the name 'main' and save the project 
    project.set_workflow('main', "./workflow.py")
    
    # Read env vars from dict or file and set as project secrets
    project.set_secrets({"SECRET1": "value"})
    project.set_secrets(file_path="secrets.env")
    
    project.spec.params = {"x": 5}

Save the project

# Save the project in the db (and into the project.yaml file)
project.save()
<mlrun.projects.project.MlrunProject at 0x7fd96c2fdb50>

When you save the project it stores the project definitions in the project.yaml. This allows reconstructing the project in a remote cluster or a CI/CD system.

See the generated project file: project.yaml.

Work with GIT and archives#

Push the project code/metadata into an archive#

Use standard git commands to push the current project tree into a git archive. Make sure you .save() the project before pushing it.

git remote add origin <server>
git commit -m "Commit message"
git push origin master

Alternatively, you can use MLRun SDK calls:

  • project.create_remote(git_uri, branch=branch) — to register the remote Git path

  • project.push() — save the project state and commit/push updates to the remote git repo

You can also save the project content and metadata into a local or remote .zip archive, for example:

project.export("../archive1.zip")
project.export("s3://my-bucket/archive1.zip")
project.export(f"v3io://projects/{project.name}/archive1.zip")

Load a project from local/remote archive#

The project metadata and context (code and configuration) can be loaded and initialized using the load_project() method. When url (of the git/zip/tar) is specified, it clones a remote repo into the local context dir.

# Load the project and run the 'main' workflow
project = load_project(context="./", name="myproj", url="git://github.com/mlrun/project-archive.git")
project.run("main", arguments={'data': data_url})

Projects can also be loaded and executed using the CLI:

mlrun project -n myproj -u "git://github.com/mlrun/project-archive.git" .
mlrun project -r main -w -a data=<data-url> .
# load the project in the current context dir
project = mlrun.load_project("./")

Build and run automated ML pipelines and CI/CD#

A pipeline is created by running an MLRun "workflow". The following code defines a workflow and writes it to a file in your local directory, with the file name workflow.py. The workflow describes a directed acyclic graph (DAG) which is executed using the local, remote, or kubeflow engines.

See running a multi-stage workflow. The defined pipeline includes the following steps:

  • Generate/prepare the data (ingest).

  • Train and the model (train).

  • Deploy the model as a real-time serverless function (serving).

Note

A pipeline can also include continuous build integration and deployment (CI/CD) steps, such as building container images and deploying models.

%%writefile './workflow.py'

from kfp import dsl
import mlrun

# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(name="breast-cancer-demo")
def pipeline(model_name="cancer-classifier"):
    # Run the ingestion function with the new image and params
    ingest = mlrun.run_function(
        "data-prep",
        name="get-data",
        outputs=["dataset"],
    )

    # Train a model using the auto_trainer hub function
    train = mlrun.run_function(
        "hub://auto_trainer",
        inputs={"dataset": ingest.outputs["dataset"]},
        params = {
            "model_class": "sklearn.ensemble.RandomForestClassifier",
            "train_test_split_size": 0.2,
            "label_columns": "label",
            "model_name": model_name,
        }, 
        handler='train',
        outputs=["model"],
    )

    # Deploy the trained model as a serverless function
    serving_fn = mlrun.new_function("serving", image="mlrun/mlrun", kind="serving")
    serving_fn.with_code(body=" ")
    mlrun.deploy_function(
        serving_fn,
        models=[
            {
                "key": model_name,
                "model_path": train.outputs["model"],
                "class_name": 'mlrun.frameworks.sklearn.SklearnModelServer',
            }
        ],
    )
Writing ./workflow.py

Run the workflow

# Run the workflow
run_id = project.run(
    workflow_path="./workflow.py",
    arguments={"model_name": "cancer-classifier"},
    watch=True,
)
Pipeline running (id=6907fa23-dcdc-49bd-adbd-dfb7f8d25997), click here to view the details in MLRun UI
_images/0f23a9b1e8d13f7384b50668cc0422deb2ab81cd776222becb0d948ee465045f.svg

Run Results

Workflow 6907fa23-dcdc-49bd-adbd-dfb7f8d25997 finished, state=Succeeded
click the hyper links below to see detailed results
uid start state name parameters results
Sep 20 15:00:35 completed auto-trainer-train
model_class=sklearn.ensemble.RandomForestClassifier
train_test_split_size=0.2
label_columns=label
model_name=cancer-classifier
accuracy=0.956140350877193
f1_score=0.9635036496350365
precision_score=0.9565217391304348
recall_score=0.9705882352941176
Sep 20 15:00:07 completed get-data
label_column=label

View the pipeline in MLRun UI

workflow


Run workflows using the CLI

With MLRun you can use a single command to load the code from local dir or remote archive (Git, zip, …) and execute a pipeline. This can be very useful for integration with CI/CD frameworks and practices. See CI/CD integration for more details.

The following command loads the project from the current dir (.) and executes the workflow with an argument, for running on k8s.

mlrun project -r ./workflow.py -w -a model_name=classifier2 .

Test the deployed model endpoint#

Now that your model is deployed using the pipeline, you can invoke it as usual:

serving_fn = project.get_function("serving")
# Create a mock (simulator of the real-time function)
my_data = {
    "inputs": [
        [
            1.371e01,
            2.083e01,
            9.020e01,
            5.779e02,
            1.189e-01,
            1.645e-01,
            9.366e-02,
            5.985e-02,
            2.196e-01,
            7.451e-02,
            5.835e-01,
            1.377e00,
            3.856e00,
            5.096e01,
            8.805e-03,
            3.029e-02,
            2.488e-02,
            1.448e-02,
            1.486e-02,
            5.412e-03,
            1.706e01,
            2.814e01,
            1.106e02,
            8.970e02,
            1.654e-01,
            3.682e-01,
            2.678e-01,
            1.556e-01,
            3.196e-01,
            1.151e-01,
        ]
    ]
}
serving_fn.invoke("/v2/models/cancer-classifier/infer", body=my_data)
> 2022-09-20 15:09:02,664 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-iguazio-serving.default-tenant.svc.cluster.local:8080/v2/models/cancer-classifier/infer'}
{'id': '7ecaf987-bd79-470e-b930-19959808b678',
 'model_name': 'cancer-classifier',
 'outputs': [0]}

Done!#

Congratulations! You’ve completed Part 4 of the MLRun getting-started tutorial. To continue, proceed to Part 5 Model monitoring and drift detection.

You might also want to explore the following demos:

Model monitoring and drift detection#

This tutorial illustrates leveraging the model monitoring capabilities of MLRun to deploy a model to a live endpoint and calculate data drift.

Make sure you have reviewed the basics in MLRun Quick Start Tutorial.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure mlrun is installed and that you have configured the access to the MLRun service.

# Install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun tqdm ipywidgets

Set up the project#

First, import the dependencies and create an MLRun project. This contains all of the models, functions, datasets, etc.:

import os

import mlrun
import pandas as pd
project = mlrun.get_or_create_project(name="tutorial", context="./", user_project=True)
> 2023-03-12 17:02:37,120 [info] loaded project tutorial from MLRun DB

Note

This tutorial does not focus on training a model. Instead, it starts with a trained model and its corresponding training dataset.

Next, log the following model file and dataset to deploy and calculate data drift. The model is a AdaBoostClassifier from sklearn, and the dataset is in csv format.

# We choose the correct model to avoid pickle warnings
import sys

suffix = (
    mlrun.__version__.split("-")[0].replace(".", "_")
    if sys.version_info[1] > 7
    else "3.7"
)

model_path = mlrun.get_sample_path(f"models/model-monitoring/model-{suffix}.pkl")
training_set_path = mlrun.get_sample_path("data/model-monitoring/iris_dataset.csv")

Log the model with training data#

Log the model using MLRun experiment tracking. This is usually done in a training pipeline, but you can also bring in your pre-trained models from other sources. See Working with data and model artifacts and Automated experiment tracking for more information.

model_name = "RandomForestClassifier"
model_artifact = project.log_model(
    key=model_name,
    model_file=model_path,
    framework="sklearn",
    training_set=pd.read_csv(training_set_path),
    label_column="label",
)
# the model artifact unique URI
model_artifact.uri
'store://models/tutorial-yonis/RandomForestClassifier#0:9e8859ee-dc11-4874-a4f7-ebdce46a5a82'

Import and deploy the serving function#

Import the model server function from the MLRun Function Hub. Additionally, mount the filesytem, add the model that was logged via experiment tracking, and enable drift detection.

The core line here is serving_fn.set_tracking() that creates the required infrastructure behind the scenes to perform drift detection. See the Model monitoring overview for more info on what is deployed.

# Import the serving function from the Function Hub and mount filesystem
serving_fn = mlrun.import_function("hub://v2_model_server", new_name="serving")

# Add the model to the serving function's routing spec
serving_fn.add_model(model_name, model_path=model_artifact.uri)

# Enable model monitoring
serving_fn.set_tracking()
Deploy the serving function with drift detection#

Deploy the serving function with drift detection enabled with a single line of code:

mlrun.deploy_function(serving_fn)
> 2023-03-12 17:02:38,651 [info] Starting remote function deploy
2023-03-12 17:02:40  (info) Deploying function
2023-03-12 17:02:40  (info) Building
2023-03-12 17:02:40  (info) Staging files and preparing base images
2023-03-12 17:02:40  (info) Building processor image
2023-03-12 17:03:40  (info) Build complete
2023-03-12 17:03:51  (info) Function deploy complete
> 2023-03-12 17:03:52,969 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-tutorial-yonis-serving.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['tutorial-yonis-serving-tutorial-yonis.default-tenant.app.vmdev30.lab.iguazeng.com/']}
DeployStatus(state=ready, outputs={'endpoint': 'http://tutorial-yonis-serving-tutorial-yonis.default-tenant.app.vmdev30.lab.iguazeng.com/', 'name': 'tutorial-yonis-serving'})

View deployed resources#

At this point, you should see the newly deployed model server, as well as a model-monitoring-stream, and a scheduled job (in yellow). The model-monitoring-stream collects, processes, and saves the incoming requests to the model server. The scheduled job does the actual calculation (by default every hour).

Note

You will not see model-monitoring-batch jobs listed until they actually run (by default every hour).

drift_table_plot

Simulate production traffic#

Next, use the following code to simulate incoming production data using elements from the training set. Because the data is coming from the same training set you logged, you should not expect any data drift.

Note

By default, the drift calculation starts via the scheduled hourly batch job after receiving 10,000 incoming requests.

import json
import logging
from random import choice

from tqdm.notebook import tqdm

# Suppress print messages
logging.getLogger(name="mlrun").setLevel(logging.WARNING)

# Get training set as list
iris_data = (
    pd.read_csv(training_set_path).drop("label", axis=1).to_dict(orient="split")["data"]
)

# Simulate traffic using random elements from training set
for i in tqdm(range(12_000)):
    data_point = choice(iris_data)
    serving_fn.invoke(
        f"v2/models/{model_name}/infer", json.dumps({"inputs": [data_point]})
    )

# Resume normal logging
logging.getLogger(name="mlrun").setLevel(logging.INFO)

View drift calculations and status#

Once data drift has been calculated, you can view it in the MLRun UI. This includes a high-level overview of the model status:

model_endpoint_1

A more detailed view on model information and overall drift metrics:

model_endpoint_2

As well as a view for feature-level distributions and drift metrics:

model_endpoint_3

View detailed drift dashboards#

Finally, there are also more detailed Grafana dashboards that show additional information on each model in the project:

For more information on accessing these dashboards, see Model monitoring using Grafana dashboards.

grafana_dashboard_1

Graphs of individual features over time:

grafana_dashboard_2

As well as drift and operational metrics over time:

grafana_dashboard_3

Add MLOps to existing code#

This tutorial showcases how easy it is to apply MLRun on your existing code. With only 7 lines of code, you get:

  • Experiment tracking — Track every single run of your experiment to learn what yielded the best results.

  • Automatic Logging — Log datasets, metrics results and plots with one line of code. MLRun takes care for all the rest.

  • Parameterization — Enable running your code with different parameters, run hyperparameters tuning and get the most out of your code.

  • Resource management — Control the amount of resources available for your experiment.

Use this kaggle code by Sylas as an example, part of the competition New York City Taxi Fare Prediction.

Tutorial steps:

Get the data#

You can download the original data from kaggle. However, since the original data is 5.7GB in size, this demo uses sampled data. Since this demo uses MLRun's DataItem to pass the datasets, the sampled data is downloaded automatically. However, if you want to look at the data, you can download it: training set, and testing set.

Code review#

Use the original code with the minimum changes required to apply MLRun to it. The code itself is straightforward:

  1. Read the training data and perform feature engineering on it to preprocess it for training.

  2. Train a LightGBM regression model using LightGBM's train function.

  3. Read the testing data and save the contest expected submission file.

You can Download the script.py file[Download here], or copy / paste it from here:

Show code
import gc

import lightgbm as lgbm
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# [MLRun] Import MLRun:
import mlrun
from mlrun.frameworks.lgbm import apply_mlrun

# [MLRun] Get MLRun's context:
context = mlrun.get_or_create_ctx("apply-mlrun-tutorial")

# [MLRun] Reading train data from context instead of local file:
train_df = context.get_input("train_set", "./train.csv").as_df()
# train_df =  pd.read_csv('./train.csv')

# Drop rows with null values
train_df = train_df.dropna(how="any", axis="rows")


def clean_df(df):
    return df[
        (df.fare_amount > 0)
        & (df.fare_amount <= 500)
        &
        # (df.passenger_count >= 0) & (df.passenger_count <= 8)  &
        (
            (df.pickup_longitude != 0)
            & (df.pickup_latitude != 0)
            & (df.dropoff_longitude != 0)
            & (df.dropoff_latitude != 0)
        )
    ]


train_df = clean_df(train_df)


# To Compute Haversine distance
def sphere_dist(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon):
    """
    Return distance along great radius between pickup and drop-off coordinates.
    """
    # Define earth radius (km)
    R_earth = 6371
    # Convert degrees to radians
    pickup_lat, pickup_lon, dropoff_lat, dropoff_lon = map(
        np.radians, [pickup_lat, pickup_lon, dropoff_lat, dropoff_lon]
    )
    # Compute distances along lat, lon dimensions
    dlat = dropoff_lat - pickup_lat
    dlon = dropoff_lon - pickup_lon

    # Compute haversine distance
    a = (
        np.sin(dlat / 2.0) ** 2
        + np.cos(pickup_lat) * np.cos(dropoff_lat) * np.sin(dlon / 2.0) ** 2
    )
    return 2 * R_earth * np.arcsin(np.sqrt(a))


def sphere_dist_bear(pickup_lat, pickup_lon, dropoff_lat, dropoff_lon):
    """
    Return distance along great radius between pickup and drop-off coordinates.
    """
    # Convert degrees to radians
    pickup_lat, pickup_lon, dropoff_lat, dropoff_lon = map(
        np.radians, [pickup_lat, pickup_lon, dropoff_lat, dropoff_lon]
    )
    # Compute distances along lat, lon dimensions
    dlon = pickup_lon - dropoff_lon

    # Compute bearing distance
    a = np.arctan2(
        np.sin(dlon * np.cos(dropoff_lat)),
        np.cos(pickup_lat) * np.sin(dropoff_lat)
        - np.sin(pickup_lat) * np.cos(dropoff_lat) * np.cos(dlon),
    )
    return a


def radian_conv(degree):
    """
    Return radian.
    """
    return np.radians(degree)


def add_airport_dist(dataset):
    """
    Return minimum distance from pickup or drop-off coordinates to each airport.
    JFK: John F. Kennedy International Airport
    EWR: Newark Liberty International Airport
    LGA: LaGuardia Airport
    SOL: Statue of Liberty
    NYC: Newyork Central
    """
    jfk_coord = (40.639722, -73.778889)
    ewr_coord = (40.6925, -74.168611)
    lga_coord = (40.77725, -73.872611)
    sol_coord = (40.6892, -74.0445)  # Statue of Liberty
    nyc_coord = (40.7141667, -74.0063889)

    pickup_lat = dataset["pickup_latitude"]
    dropoff_lat = dataset["dropoff_latitude"]
    pickup_lon = dataset["pickup_longitude"]
    dropoff_lon = dataset["dropoff_longitude"]

    pickup_jfk = sphere_dist(pickup_lat, pickup_lon, jfk_coord[0], jfk_coord[1])
    dropoff_jfk = sphere_dist(jfk_coord[0], jfk_coord[1], dropoff_lat, dropoff_lon)
    pickup_ewr = sphere_dist(pickup_lat, pickup_lon, ewr_coord[0], ewr_coord[1])
    dropoff_ewr = sphere_dist(ewr_coord[0], ewr_coord[1], dropoff_lat, dropoff_lon)
    pickup_lga = sphere_dist(pickup_lat, pickup_lon, lga_coord[0], lga_coord[1])
    dropoff_lga = sphere_dist(lga_coord[0], lga_coord[1], dropoff_lat, dropoff_lon)
    pickup_sol = sphere_dist(pickup_lat, pickup_lon, sol_coord[0], sol_coord[1])
    dropoff_sol = sphere_dist(sol_coord[0], sol_coord[1], dropoff_lat, dropoff_lon)
    pickup_nyc = sphere_dist(pickup_lat, pickup_lon, nyc_coord[0], nyc_coord[1])
    dropoff_nyc = sphere_dist(nyc_coord[0], nyc_coord[1], dropoff_lat, dropoff_lon)

    dataset["jfk_dist"] = pickup_jfk + dropoff_jfk
    dataset["ewr_dist"] = pickup_ewr + dropoff_ewr
    dataset["lga_dist"] = pickup_lga + dropoff_lga
    dataset["sol_dist"] = pickup_sol + dropoff_sol
    dataset["nyc_dist"] = pickup_nyc + dropoff_nyc

    return dataset


def add_datetime_info(dataset):
    # Convert to datetime format
    dataset["pickup_datetime"] = pd.to_datetime(
        dataset["pickup_datetime"], format="%Y-%m-%d %H:%M:%S UTC"
    )

    dataset["hour"] = dataset.pickup_datetime.dt.hour
    dataset["day"] = dataset.pickup_datetime.dt.day
    dataset["month"] = dataset.pickup_datetime.dt.month
    dataset["weekday"] = dataset.pickup_datetime.dt.weekday
    dataset["year"] = dataset.pickup_datetime.dt.year

    return dataset


train_df = add_datetime_info(train_df)
train_df = add_airport_dist(train_df)
train_df["distance"] = sphere_dist(
    train_df["pickup_latitude"],
    train_df["pickup_longitude"],
    train_df["dropoff_latitude"],
    train_df["dropoff_longitude"],
)

train_df["bearing"] = sphere_dist_bear(
    train_df["pickup_latitude"],
    train_df["pickup_longitude"],
    train_df["dropoff_latitude"],
    train_df["dropoff_longitude"],
)
train_df["pickup_latitude"] = radian_conv(train_df["pickup_latitude"])
train_df["pickup_longitude"] = radian_conv(train_df["pickup_longitude"])
train_df["dropoff_latitude"] = radian_conv(train_df["dropoff_latitude"])
train_df["dropoff_longitude"] = radian_conv(train_df["dropoff_longitude"])


train_df.drop(columns=["key", "pickup_datetime"], inplace=True)

y = train_df["fare_amount"]
train_df = train_df.drop(columns=["fare_amount"])


print(train_df.head())

x_train, x_test, y_train, y_test = train_test_split(
    train_df, y, random_state=123, test_size=0.10
)

del train_df
del y
gc.collect()

params = {
    "boosting_type": "gbdt",
    "objective": "regression",
    "nthread": 4,
    "num_leaves": 31,
    "learning_rate": 0.05,
    "max_depth": -1,
    "subsample": 0.8,
    "bagging_fraction": 1,
    "max_bin": 5000,
    "bagging_freq": 20,
    "colsample_bytree": 0.6,
    "metric": "rmse",
    "min_split_gain": 0.5,
    "min_child_weight": 1,
    "min_child_samples": 10,
    "scale_pos_weight": 1,
    "zero_as_missing": True,
    "seed": 0,
    # "categorical_feature": "name:year,month,day,weekday",
}

train_set = lgbm.Dataset(x_train, y_train)
valid_set = lgbm.Dataset(x_test, y_test)

# [MLRun] Apply MLRun on the LightGBM module:
apply_mlrun(context=context)

model = lgbm.train(
    params,
    num_boost_round=10000,
    train_set=train_set,
    valid_sets=[valid_set],
    callbacks=[lgbm.early_stopping(stopping_rounds=500)],
)

del x_train
del y_train
del x_test
del y_test
gc.collect()

# [MLRun] Reading test data from context instead of local file:
test_df = context.get_input("test_set", "./test.csv").as_df()
# test_df =  pd.read_csv('./test.csv')
print(test_df.head())
test_df = add_datetime_info(test_df)
test_df = add_airport_dist(test_df)
test_df["distance"] = sphere_dist(
    test_df["pickup_latitude"],
    test_df["pickup_longitude"],
    test_df["dropoff_latitude"],
    test_df["dropoff_longitude"],
)

test_df["bearing"] = sphere_dist_bear(
    test_df["pickup_latitude"],
    test_df["pickup_longitude"],
    test_df["dropoff_latitude"],
    test_df["dropoff_longitude"],
)
test_df["pickup_latitude"] = radian_conv(test_df["pickup_latitude"])
test_df["pickup_longitude"] = radian_conv(test_df["pickup_longitude"])
test_df["dropoff_latitude"] = radian_conv(test_df["dropoff_latitude"])
test_df["dropoff_longitude"] = radian_conv(test_df["dropoff_longitude"])


test_key = test_df["key"]
test_df = test_df.drop(columns=["key", "pickup_datetime"])

# Predict from test set
prediction = model.predict(test_df, num_iteration=model.best_iteration)
submission = pd.DataFrame({"key": test_key, "fare_amount": prediction})

# [MLRun] Log the submission instead of saving it locally:
context.log_dataset(key="taxi_fare_submission", df=submission, format="csv")
# submission.to_csv('taxi_fare_submission.csv',index=False)

This demo focuses on reviewing the changes / additions made to the original code so that you can apply MLRun on top of it. Seven lines of code are added / replaced as you can see in the sections below:

Initialization#
Imports#

On lines 9-10, add 2 imports:

  • mlrun — Import MLRun of course.

  • apply_mlrun — Use the apply_mlrun function from MLRun's frameworks, a sub-package for common ML/DL frameworks integrations with MLRun.

import mlrun
from mlrun.frameworks.lgbm import apply_mlrun
MLRun context#

To get parameters and inputs into the code, you need to get MLRun's context. Use the function get_or_create_ctx.

Line 13:

context = mlrun.get_or_create_ctx("apply-mlrun-tutorial")
Get Training Set#

In the original code the training set was read from a local file. Now you want to get it from the user who runs the code. Use the context to get the "training_set" input by using the get_input method. To maintain the original logic, include the default path for when the training set was not provided by the user.

Line 16:

train_df = context.get_input("train_set", "./train.csv").as_df()  
# Instead of: `train_df =  pd.read_csv('./train.csv')`
Apply MLRun#

Now use the apply_mlrun function from MLRun's LightGBM framework integration. MLRun automatically wraps the LightGBM module and enables automatic logging and evaluation.

Line 209:

apply_mlrun(context=context)
Logging the dataset#

Similar to the way you got the training set, you get the test dataset as an input from the MLRun content.

Line 226:

test_df = context.get_input("test_set", "./test.csv").as_df()
# Instead of: `test_df =  pd.read_csv('./test.csv')`
Save the submission#

Finally, instead of saving the result locally, log the submission to MLRun.

Line 258:

context.log_dataset(key="taxi_fare_submission", df=submission, format="csv")  
# Instead of: `submission.to_csv('taxi_fare_submission.csv',index=False)`

Run the script with MLRun#

Now you can run the script and see MLRun in action.

import mlrun
Create a project#

Create a project using the function get_or_create_project. To read more about MLRun projects, see Projects.

project = mlrun.get_or_create_project(
    name="apply-mlrun-tutorial", context="./", user_project=True
)
> 2022-08-09 18:21:26,785 [info] loaded project apply-mlrun-tutorial from MLRun DB
Create a function#

Create an MLRun function using the function code_to_function. To read more about MLRun functions, see Functions.

script_function = mlrun.code_to_function(
    filename="./src/script.py",
    name="apply-mlrun-tutorial-function",
    kind="job",
    image="mlrun/mlrun",
    requirements=["lightgbm"],
)
script_function.deploy()
Run the function#

Now you can run the function, providing it with the inputs you want. Use the datasets links to send them to the function. MLRun downloads and reads them into pd.DataFrame automatically.

script_run = script_function.run(
    inputs={
        "train_set": "https://s3.us-east-1.wasabisys.com/iguazio/data/nyc-taxi/train.csv",
        "test_set": "https://s3.us-east-1.wasabisys.com/iguazio/data/nyc-taxi/test.csv",
    },
)
> 2022-08-09 18:21:26,851 [info] starting run apply-mlrun-tutorial-function uid=8d82ef16a15d4151a16060c13b133170 DB=http://mlrun-api:8080
> 2022-08-09 18:21:27,017 [info] handler was not provided running main (./script.py)
> 2022-08-09 18:21:39,330 [info] logging run results to: http://mlrun-api:8080
   pickup_longitude  pickup_latitude  ...  distance   bearing
0         -1.288826         0.710721  ...  1.030764 -2.918897
1         -1.291824         0.710546  ...  8.450134 -0.375217
2         -1.291242         0.711418  ...  1.389525  2.599961
3         -1.291319         0.710927  ...  2.799270  0.133905
4         -1.290987         0.711536  ...  1.999157 -0.502703

[5 rows x 17 columns]
[LightGBM] [Warning] bagging_fraction is set=1, subsample=0.8 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] Met categorical feature which contains sparse values. Consider renumbering to consecutive integers started from zero
[LightGBM] [Warning] bagging_fraction is set=1, subsample=0.8 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008352 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 55092
[LightGBM] [Info] Number of data points in the train set: 194071, number of used features: 17
[LightGBM] [Warning] bagging_fraction is set=1, subsample=0.8 will be ignored. Current value: bagging_fraction=1
[LightGBM] [Info] Start training from score 11.335573
                           key  ... passenger_count
0  2015-01-27 13:08:24.0000002  ...               1
1  2015-01-27 13:08:24.0000003  ...               1
2  2011-10-08 11:53:44.0000002  ...               1
3  2012-12-01 21:12:12.0000002  ...               1
4  2012-12-01 21:12:12.0000003  ...               1

[5 rows x 7 columns]
project uid iter start state name labels inputs parameters results artifacts
apply-mlrun-tutorial-guyl
...3b133170
0 Aug 09 18:21:39 completed apply-mlrun-tutorial-function
v3io_user=guyl
kind=
owner=guyl
host=jupyter-guyl-66857b7999-xnncv
train_set
test_set
valid_0_rmse=3.905279481685527
valid_0_rmse_plot
valid_0-feature-importance
valid_0
model
taxi_fare_submission
> 2022-08-09 18:22:02,987 [info] run executed, status=completed

Review outputs#

To view the outputs yielded by the MLRun automatic logging and evaluation, call the outputs property on the run object:

script_run.outputs
{'valid_0_rmse': 3.905279481685527,
 'valid_0_rmse_plot': 'v3io:///projects/apply-mlrun-tutorial-guyl/artifacts/apply-mlrun-tutorial-function/0/valid_0_rmse_plot.html',
 'valid_0-feature-importance': 'v3io:///projects/apply-mlrun-tutorial-guyl/artifacts/apply-mlrun-tutorial-function/0/valid_0-feature-importance.html',
 'valid_0': 'store://artifacts/apply-mlrun-tutorial-guyl/apply-mlrun-tutorial-function_valid_0:8d82ef16a15d4151a16060c13b133170',
 'model': 'store://artifacts/apply-mlrun-tutorial-guyl/model:8d82ef16a15d4151a16060c13b133170',
 'taxi_fare_submission': 'store://artifacts/apply-mlrun-tutorial-guyl/apply-mlrun-tutorial-function_taxi_fare_submission:8d82ef16a15d4151a16060c13b133170'}

MLRun automatically detects all the metrics calculated and collects the data along with the training. Here there was one validation set named valid_0 and the RMSE metric was calculated on it. You can see the RMSE values per iteration plot and the final score including the features importance plot.

You can explore the different artifacts by calling the artifact function like this:

script_run.artifact("valid_0_rmse_plot").show()
script_run.artifact("valid_0-feature-importance").show()

And of course, you can also see the submission that was logged:

script_run.artifact("taxi_fare_submission").show()
key fare_amount
0 2015-01-27 13:08:24.0000002 10.281408
1 2015-01-27 13:08:24.0000003 11.019641
2 2011-10-08 11:53:44.0000002 4.898061
3 2012-12-01 21:12:12.0000002 7.758042
4 2012-12-01 21:12:12.0000003 15.298775
... ... ...
9909 2015-05-10 12:37:51.0000002 9.117569
9910 2015-01-12 17:05:51.0000001 10.850885
9911 2015-04-19 20:44:15.0000001 55.048856
9912 2015-01-31 01:05:19.0000005 20.110280
9913 2015-01-18 14:06:23.0000006 7.081041

9914 rows × 2 columns

Batch inference and drift detection#

This tutorial leverages a function from the MLRun Function Hub to perform batch inference using a logged model and a new prediction dataset. The function also calculates data drift by comparing the new prediction dataset with the original training set.

Make sure you have reviewed the basics in MLRun Quick Start Tutorial.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure mlrun is installed and that you have configured the access to the MLRun service.

# Install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

Set up a project#

First, import the dependencies and create an MLRun project. The project contains all of your models, functions, datasets, etc.:

import mlrun
import pandas as pd
project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)
> 2023-09-14 11:28:28,749 [info] Loading project from path: {'project_name': 'tutorial', 'path': './'}
> 2023-09-14 11:28:44,183 [info] Project loaded successfully: {'project_name': 'tutorial', 'path': './', 'stored_in_db': True}

Note

This tutorial does not focus on training a model. Instead, it starts with a trained model and its corresponding training and prediction dataset.

You'll use the following model files and datasets to perform the batch prediction. The model is a DecisionTreeClassifier from sklearn and the datasets are in parquet format.

# We choose the correct model to avoid pickle warnings
import sys

suffix = (
    mlrun.__version__.split("-")[0].replace(".", "_")
    if sys.version_info[1] > 7
    else "3.7"
)

model_path = mlrun.get_sample_path(f"models/batch-predict/model-{suffix}.pkl")
training_set_path = mlrun.get_sample_path("data/batch-predict/training_set.parquet")
prediction_set_path = mlrun.get_sample_path("data/batch-predict/prediction_set.parquet")

View the data#

The training data has 20 numerical features and a binary (0,1) label:

pd.read_parquet(training_set_path).head()
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_11 feature_12 feature_13 feature_14 feature_15 feature_16 feature_17 feature_18 feature_19 label
0 0.572754 0.171079 0.403080 0.955429 0.272039 0.360277 -0.995429 0.437239 0.991556 0.010004 ... 0.112194 -0.319256 -0.392631 -0.290766 1.265054 1.037082 -1.200076 0.820992 0.834868 0
1 0.623733 -0.149823 -1.410537 -0.729388 -1.996337 -1.213348 1.461307 1.187854 -1.790926 -0.981600 ... 0.428653 -0.503820 -0.798035 2.038105 -3.080463 0.408561 1.647116 -0.838553 0.680983 1
2 0.814168 -0.221412 0.020822 1.066718 -0.573164 0.067838 0.923045 0.338146 0.981413 1.481757 ... -1.052559 -0.241873 -1.232272 -0.010758 0.806800 0.661162 0.589018 0.522137 -0.924624 0
3 1.062279 -0.966309 0.341471 -0.737059 1.460671 0.367851 -0.435336 0.445308 -0.655663 -0.196220 ... 0.641017 0.099059 1.902592 -1.024929 0.030703 -0.198751 -0.342009 -1.286865 -1.118373 1
4 0.195755 0.576332 -0.260496 0.841489 0.398269 -0.717972 0.810550 -1.058326 0.368610 0.606007 ... 0.195267 0.876144 0.151615 0.094867 0.627353 -0.389023 0.662846 -0.857000 1.091218 1

5 rows × 21 columns

The prediction data has 20 numerical features, but no label - this is what you will predict:

pd.read_parquet(prediction_set_path).head()
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 feature_10 feature_11 feature_12 feature_13 feature_14 feature_15 feature_16 feature_17 feature_18 feature_19
0 -2.059506 -1.314291 2.721516 -2.132869 -0.693963 0.376643 3.017790 3.876329 -1.294736 0.030773 0.401491 2.775699 2.361580 0.173441 0.879510 1.141007 4.608280 -0.518388 0.129690 2.794967
1 -1.190382 0.891571 3.726070 0.673870 -0.252565 -0.729156 2.646563 4.782729 0.318952 -0.781567 1.473632 1.101721 3.723400 -0.466867 -0.056224 3.344701 0.194332 0.463992 0.292268 4.665876
2 -0.996384 -0.099537 3.421476 0.162771 -1.143458 -1.026791 2.114702 2.517553 -0.154620 -0.465423 -1.723025 1.729386 2.820340 -1.041428 -0.331871 2.909172 2.138613 -0.046252 -0.732631 4.716266
3 -0.289976 -1.680019 3.126478 -0.704451 -1.149112 1.174962 2.860341 3.753661 -0.326119 2.128411 -0.508000 2.328688 3.397321 -0.932060 -1.442370 2.058517 3.881936 2.090635 -0.045832 4.197315
4 -0.294866 1.044919 2.924139 0.814049 -1.455054 -0.270432 3.380195 2.339669 1.029101 -1.171018 -1.459395 1.283565 0.677006 -2.147444 -0.494150 3.222041 6.219348 -1.914110 0.317786 4.143443

Log the model with training data#

Next, log the model using MLRun experiment tracking. This is usually done in a training pipeline, but you can also bring in your pre-trained models from other sources. See Working with data and model artifacts and Automated experiment tracking for more information.

In this example, you are logging a training set with the model for future comparison, however you can also directly pass in your training set to the batch prediction function.

model_artifact = project.log_model(
    key="model",
    model_file=model_path,
    framework="sklearn",
    training_set=pd.read_parquet(training_set_path),
    label_column="label",
)
# the model artifact unique URI
model_artifact.uri
'store://models/tutorial-iguazio/model#0:3ba91513-7dae-45b2-b118-d2197ade55a3'

Import and run the batch inference function#

Next, import the batch inference function from the MLRun Function Hub:

fn = mlrun.import_function("hub://batch_inference_v2")
Run batch inference#

Finally, perform the batch prediction by passing in your model and datasets. In addition, you can trigger the drift analysis batch job on the provided dataset by passing "trigger_monitoring_job": True.

If you do perform drift analysis, a new model endpoint record is generated. Model endpoint is a unique MLRun entity that includes statistics and important details about your model and function. You can perform the drift analysis on an existing model endpoint, but you need to make sure that you don't mix unrelated datasets that could affect the final drift analysis process. In general, it's recommended to perform the drift analysis on a new model endpoint to avoid possible analysis conflicts.

See the corresponding batch inference example notebook for an exhaustive list of other parameters that are supported:

run = project.run_function(
    fn,
    inputs={
        "dataset": prediction_set_path,
    },
    params={
        "model_path": model_artifact.uri,
        "perform_drift_analysis": True,
        "trigger_monitoring_job": True,
    },
)
> 2023-09-14 11:34:23,114 [info] Storing function: {'name': 'batch-inference-v2-infer', 'uid': 'bfe783edeaaa46c98d3508418deae6c2', 'db': 'http://mlrun-api:8080'}
> 2023-09-14 11:34:23,809 [info] Job is running in the background, pod: batch-inference-v2-infer-xx8zv
> 2023-09-14 11:34:27,850 [info] Loading model...
> 2023-09-14 11:34:28,573 [info] Loading data...
> 2023-09-14 11:34:30,136 [info] Calculating prediction...
> 2023-09-14 11:34:30,139 [info] Logging result set (x | prediction)...
> 2023-09-14 11:34:30,431 [info] Performing drift analysis...
> 2023-09-14 11:34:32,308 [info] Storing function: {'name': 'model-monitoring-batch', 'uid': '33b8a8e07b0542d08ac6f81f6d97f240', 'db': 'http://mlrun-api:8080'}
> 2023-09-14 11:34:32,555 [info] Job is running in the background, pod: model-monitoring-batch-lm7gd
> 2023-09-14 11:34:48,049 [info] Initializing BatchProcessor: {'project': 'tutorial-iguazio'}
divide by zero encountered in log
> 2023-09-14 11:34:48,465 [info] Drift result: {'drift_result': defaultdict(<class 'dict'>, {'feature_13': {'tvd': 0.03959999999999999, 'hellinger': 0.04519449310948248, 'kld': 0.013240526944533322}, 'tvd_sum': 5.6988, 'tvd_mean': 0.2713714285714286, 'hellinger_sum': 6.916628739424696, 'hellinger_mean': 0.3293632733059379, 'kld_sum': 39.76276059167618, 'kld_mean': 1.8934647900798178, 'feature_16': {'tvd': 0.6359999999999999, 'hellinger': 0.8003245177804857, 'kld': 4.682651890289595}, 'feature_12': {'tvd': 0.599, 'hellinger': 0.807957523155725, 'kld': 4.574238261717538}, 'feature_0': {'tvd': 0.022600000000000002, 'hellinger': 0.033573681953213544, 'kld': 0.007454628335938183}, 'label': {'tvd': 0.0456, 'hellinger': 0.032273432278649546, 'kld': 0.00833405069284858}, 'feature_7': {'tvd': 0.6646, 'hellinger': 0.7949812589747411, 'kld': 4.94920993092669}, 'feature_17': {'tvd': 0.03280000000000001, 'hellinger': 0.038955714991485355, 'kld': 0.009013894995753259}, 'feature_18': {'tvd': 0.04240000000000001, 'hellinger': 0.046474652187650754, 'kld': 0.015894237456896394}, 'feature_3': {'tvd': 0.03840000000000001, 'hellinger': 0.04913963802969945, 'kld': 0.017331946342858503}, 'feature_4': {'tvd': 0.03859999999999999, 'hellinger': 0.04691128230500036, 'kld': 0.016210210229378626}, 'feature_5': {'tvd': 0.049600000000000005, 'hellinger': 0.05408439667580992, 'kld': 0.017044292121321365}, 'feature_10': {'tvd': 0.040399999999999985, 'hellinger': 0.04473407115759961, 'kld': 0.016107595511326803}, 'feature_2': {'tvd': 0.6882, 'hellinger': 0.7900559843329186, 'kld': 5.214389130958726}, 'feature_9': {'tvd': 0.042800000000000005, 'hellinger': 0.04656727009349971, 'kld': 0.01261823423079203}, 'feature_6': {'tvd': 0.6598000000000002, 'hellinger': 0.7926084404395208, 'kld': 4.680064555912078}, 'feature_8': {'tvd': 0.038400000000000004, 'hellinger': 0.039720263747100804, 'kld': 0.009435438488007906}, 'feature_14': {'tvd': 0.038, 'hellinger': 0.05472944756352956, 'kld': 0.022306296862012256}, 'feature_1': {'tvd': 0.0434, 'hellinger': 0.046301454033261864, 'kld': 0.014581581489297469}, 'feature_11': {'tvd': 0.6384000000000001, 'hellinger': 0.8058863402254881, 'kld': 4.800461353025042}, 'feature_19': {'tvd': 0.7812000000000001, 'hellinger': 0.7993397396310429, 'kld': 6.966803905103182}, 'feature_15': {'tvd': 0.5189999999999999, 'hellinger': 0.746815136758792, 'kld': 3.715368630042354}})}
> 2023-09-14 11:34:48,465 [info] Drift status: {'endpoint_id': '66112327e1633e85ceba587c58fa0e56833bf311', 'drift_status': 'NO_DRIFT', 'drift_measure': 0.3003673509386833}
> 2023-09-14 11:34:48,473 [info] Generate a new V3IO KV schema file: {'kv_table_path': 'pipelines/tutorial-iguazio/model-endpoints/endpoints/'}
> 2023-09-14 11:34:48,494 [warning] Could not write drift measures to TSDB: {'err': Error("cannot call API - write error: backend Write failed: failed to create adapter: No TSDB schema file found at 'v3io-webapi:8081/users/pipelines/tutorial-iguazio/model-endpoints/events/'."), 'tsdb_path': 'pipelines/tutorial-iguazio/model-endpoints/events/', 'endpoint': '66112327e1633e85ceba587c58fa0e56833bf311'}
> 2023-09-14 11:34:48,494 [info] Done updating drift measures: {'endpoint_id': '66112327e1633e85ceba587c58fa0e56833bf311'}
> 2023-09-14 11:34:48,601 [info] Run execution finished: {'status': 'completed', 'name': 'model-monitoring-batch'}
> 2023-09-14 11:34:48,700 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 33b8a8e07b0542d08ac6f81f6d97f240 -p tutorial-iguazio', 'logs_cmd': 'mlrun logs 33b8a8e07b0542d08ac6f81f6d97f240 -p tutorial-iguazio'}
> 2023-09-14 11:34:48,700 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.dev63.lab.iguazeng.com/mlprojects/tutorial-iguazio/jobs/monitor/33b8a8e07b0542d08ac6f81f6d97f240/overview'}
> 2023-09-14 11:34:48,701 [info] Run execution finished: {'status': 'completed', 'name': 'model-monitoring-batch'}
> 2023-09-14 11:34:50,772 [info] Run execution finished: {'status': 'completed', 'name': 'batch-inference-v2-infer'}
project uid iter start state name labels inputs parameters results artifacts
tutorial-iguazio 0 Sep 14 11:34:27 completed batch-inference-v2-infer
v3io_user=iguazio
kind=job
owner=iguazio
mlrun/client_version=1.5.0-rc12
mlrun/client_python_version=3.9.16
host=batch-inference-v2-infer-xx8zv
dataset
model_path=store://models/tutorial-iguazio/model#0:3ba91513-7dae-45b2-b118-d2197ade55a3
perform_drift_analysis=True
trigger_monitoring_job=True
batch_id=08683420069e1a367216aa745f02ef9f73a87f595a333571b417c6ac
drift_status=False
drift_metric=0.3003673509386833
prediction
drift_table_plot
features_drift_results

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-09-14 11:34:52,077 [info] Run execution finished: {'status': 'completed', 'name': 'batch-inference-v2-infer'}

Predictions and drift status#

These are the batch predictions on the prediction set from the model:

run.artifact("prediction").as_df().head()
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_11 feature_12 feature_13 feature_14 feature_15 feature_16 feature_17 feature_18 feature_19 label
0 -2.059506 -1.314291 2.721516 -2.132869 -0.693963 0.376643 3.017790 3.876329 -1.294736 0.030773 ... 2.775699 2.361580 0.173441 0.879510 1.141007 4.608280 -0.518388 0.129690 2.794967 0
1 -1.190382 0.891571 3.726070 0.673870 -0.252565 -0.729156 2.646563 4.782729 0.318952 -0.781567 ... 1.101721 3.723400 -0.466867 -0.056224 3.344701 0.194332 0.463992 0.292268 4.665876 1
2 -0.996384 -0.099537 3.421476 0.162771 -1.143458 -1.026791 2.114702 2.517553 -0.154620 -0.465423 ... 1.729386 2.820340 -1.041428 -0.331871 2.909172 2.138613 -0.046252 -0.732631 4.716266 0
3 -0.289976 -1.680019 3.126478 -0.704451 -1.149112 1.174962 2.860341 3.753661 -0.326119 2.128411 ... 2.328688 3.397321 -0.932060 -1.442370 2.058517 3.881936 2.090635 -0.045832 4.197315 0
4 -0.294866 1.044919 2.924139 0.814049 -1.455054 -0.270432 3.380195 2.339669 1.029101 -1.171018 ... 1.283565 0.677006 -2.147444 -0.494150 3.222041 6.219348 -1.914110 0.317786 4.143443 1

5 rows × 21 columns

There is also a drift table plot that compares the drift between the training data and prediction data per feature:

run.artifact("drift_table_plot").show()

Drift Table Plot

Finally, you also get a numerical drift metric and boolean flag denoting whether or not data drift is detected:

run.status.results
{'batch_id': '08683420069e1a367216aa745f02ef9f73a87f595a333571b417c6ac',
 'drift_status': False,
 'drift_metric': 0.3003673509386833}
# Data/concept drift per feature
import json

json.loads(run.artifact("features_drift_results").get())
{'feature_13': 0.04239724655474124,
 'feature_16': 0.7181622588902428,
 'feature_12': 0.7034787615778625,
 'feature_0': 0.028086840976606773,
 'label': 0.03893671613932477,
 'feature_7': 0.7297906294873706,
 'feature_17': 0.03587785749574268,
 'feature_18': 0.04443732609382538,
 'feature_3': 0.043769819014849734,
 'feature_4': 0.042755641152500176,
 'feature_5': 0.05184219833790496,
 'feature_10': 0.042567035578799796,
 'feature_2': 0.7391279921664593,
 'feature_9': 0.04468363504674985,
 'feature_6': 0.7262042202197605,
 'feature_8': 0.039060131873550404,
 'feature_14': 0.046364723781764774,
 'feature_1': 0.04485072701663093,
 'feature_11': 0.7221431701127441,
 'feature_19': 0.7902698698155215,
 'feature_15': 0.6329075683793959}

Examining the drift results in the dashboard#

This section reviews the main charts and statistics that can be found on the platform dashboard. See Model monitoring overview to learn more about the available model monitoring features and how to use them.

Before analyzing the results in the visual dashboards, run another batch infer job, but this time with a lower drift threshold, to get a drifted result. The drift decision rule is the value per-feature mean of the Total Variance Distance (TVD) and Hellinger distance scores. By default, the threshold is 0.7 but you can modify it through the batch infer process. As seen above, the drift result in this case was ~0.3. Reduce the threshold value to 0.2 in the following run to generate the drifted result:

run = project.run_function(
    fn,
    inputs={
        "dataset": prediction_set_path,
    },
    params={
        "model_path": model_artifact.uri,
        "perform_drift_analysis": True,
        "trigger_monitoring_job": True,
        "model_endpoint_name": "drifted-model-endpoint",
        "model_endpoint_drift_threshold": 0.2,
        "model_endpoint_possible_drift_threshold": 0.1,
    },
)
> 2023-09-14 14:00:21,477 [info] Storing function: {'name': 'batch-inference-v2-infer', 'uid': '02af70631bbe4d7aaec58cbef4bf725d', 'db': 'http://mlrun-api:8080'}
> 2023-09-14 14:00:21,796 [info] Job is running in the background, pod: batch-inference-v2-infer-vthxf
> 2023-09-14 14:00:45,833 [info] Loading model...
> 2023-09-14 14:00:46,530 [info] Loading data...
> 2023-09-14 14:00:55,812 [info] Calculating prediction...
> 2023-09-14 14:00:55,816 [info] Logging result set (x | prediction)...
> 2023-09-14 14:00:56,120 [info] Performing drift analysis...
> 2023-09-14 14:00:57,897 [info] Storing function: {'name': 'model-monitoring-batch', 'uid': '7e3fa8d0dd12430d8b1934fdf76b6720', 'db': 'http://mlrun-api:8080'}
> 2023-09-14 14:00:58,138 [info] Job is running in the background, pod: model-monitoring-batch-6hcb8
> 2023-09-14 14:01:13,295 [info] Initializing BatchProcessor: {'project': 'tutorial-iguazio'}
divide by zero encountered in log
> 2023-09-14 14:01:13,741 [info] Drift result: {'drift_result': defaultdict(<class 'dict'>, {'feature_8': {'tvd': 0.038400000000000004, 'hellinger': 0.039720263747100804, 'kld': 0.009435438488007906}, 'tvd_sum': 5.6988, 'tvd_mean': 0.2713714285714286, 'hellinger_sum': 6.916628739424697, 'hellinger_mean': 0.32936327330593795, 'kld_sum': 39.76276059167616, 'kld_mean': 1.8934647900798172, 'label': {'tvd': 0.0456, 'hellinger': 0.032273432278649546, 'kld': 0.00833405069284858}, 'feature_14': {'tvd': 0.038, 'hellinger': 0.05472944756352956, 'kld': 0.022306296862012256}, 'feature_1': {'tvd': 0.0434, 'hellinger': 0.046301454033261864, 'kld': 0.014581581489297469}, 'feature_0': {'tvd': 0.022600000000000002, 'hellinger': 0.033573681953213544, 'kld': 0.007454628335938183}, 'feature_2': {'tvd': 0.6882, 'hellinger': 0.7900559843329186, 'kld': 5.214389130958726}, 'feature_15': {'tvd': 0.5189999999999999, 'hellinger': 0.746815136758792, 'kld': 3.715368630042354}, 'feature_17': {'tvd': 0.03280000000000001, 'hellinger': 0.038955714991485355, 'kld': 0.009013894995753259}, 'feature_12': {'tvd': 0.599, 'hellinger': 0.807957523155725, 'kld': 4.574238261717538}, 'feature_11': {'tvd': 0.6384000000000001, 'hellinger': 0.8058863402254881, 'kld': 4.800461353025042}, 'feature_18': {'tvd': 0.04240000000000001, 'hellinger': 0.046474652187650754, 'kld': 0.015894237456896394}, 'feature_9': {'tvd': 0.042800000000000005, 'hellinger': 0.04656727009349971, 'kld': 0.01261823423079203}, 'feature_10': {'tvd': 0.040399999999999985, 'hellinger': 0.04473407115759961, 'kld': 0.016107595511326803}, 'feature_16': {'tvd': 0.6359999999999999, 'hellinger': 0.8003245177804857, 'kld': 4.682651890289595}, 'feature_5': {'tvd': 0.049600000000000005, 'hellinger': 0.05408439667580992, 'kld': 0.017044292121321365}, 'feature_13': {'tvd': 0.03959999999999999, 'hellinger': 0.04519449310948248, 'kld': 0.013240526944533322}, 'feature_4': {'tvd': 0.03859999999999999, 'hellinger': 0.04691128230500036, 'kld': 0.016210210229378626}, 'feature_19': {'tvd': 0.7812000000000001, 'hellinger': 0.7993397396310429, 'kld': 6.966803905103182}, 'feature_3': {'tvd': 0.03840000000000001, 'hellinger': 0.04913963802969945, 'kld': 0.017331946342858503}, 'feature_7': {'tvd': 0.6646, 'hellinger': 0.7949812589747411, 'kld': 4.94920993092669}, 'feature_6': {'tvd': 0.6598000000000002, 'hellinger': 0.7926084404395208, 'kld': 4.680064555912078}})}
> 2023-09-14 14:01:13,741 [info] Drift status: {'endpoint_id': '4a68274cf8a733c9ea3c4497a9fe24cf623e378f', 'drift_status': 'DRIFT_DETECTED', 'drift_measure': 0.3003673509386833}
> 2023-09-14 14:01:13,791 [warning] Could not write drift measures to TSDB: {'err': Error("cannot call API - write error: backend Write failed: failed to create adapter: No TSDB schema file found at 'v3io-webapi:8081/users/pipelines/tutorial-iguazio/model-endpoints/events/'."), 'tsdb_path': 'pipelines/tutorial-iguazio/model-endpoints/events/', 'endpoint': '4a68274cf8a733c9ea3c4497a9fe24cf623e378f'}
> 2023-09-14 14:01:13,792 [info] Done updating drift measures: {'endpoint_id': '4a68274cf8a733c9ea3c4497a9fe24cf623e378f'}
> 2023-09-14 14:01:13,889 [info] Run execution finished: {'status': 'completed', 'name': 'model-monitoring-batch'}
> 2023-09-14 14:01:14,265 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 7e3fa8d0dd12430d8b1934fdf76b6720 -p tutorial-iguazio', 'logs_cmd': 'mlrun logs 7e3fa8d0dd12430d8b1934fdf76b6720 -p tutorial-iguazio'}
> 2023-09-14 14:01:14,265 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.dev63.lab.iguazeng.com/mlprojects/tutorial-iguazio/jobs/monitor/7e3fa8d0dd12430d8b1934fdf76b6720/overview'}
> 2023-09-14 14:01:14,265 [info] Run execution finished: {'status': 'completed', 'name': 'model-monitoring-batch'}
> 2023-09-14 14:01:16,296 [info] Run execution finished: {'status': 'completed', 'name': 'batch-inference-v2-infer'}
project uid iter start state name labels inputs parameters results artifacts
tutorial-iguazio 0 Sep 14 14:00:45 completed batch-inference-v2-infer
v3io_user=iguazio
kind=job
owner=iguazio
mlrun/client_version=1.5.0-rc12
mlrun/client_python_version=3.9.16
host=batch-inference-v2-infer-vthxf
dataset
model_path=store://models/tutorial-iguazio/model#0:3ba91513-7dae-45b2-b118-d2197ade55a3
perform_drift_analysis=True
trigger_monitoring_job=True
model_endpoint_name=drifted-model-endpoint
model_endpoint_drift_threshold=0.2
model_endpoint_possible_drift_threshold=0.1
batch_id=7e505020685f41b05b2fa191977a085c8d47e89cfef145d1626c0284
drift_status=True
drift_metric=0.3003673509386833
prediction
drift_table_plot
features_drift_results

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-09-14 14:01:20,495 [info] Run execution finished: {'status': 'completed', 'name': 'batch-inference-v2-infer'}

Now you can observe the drift result:

run.status.results
{'batch_id': '7e505020685f41b05b2fa191977a085c8d47e89cfef145d1626c0284',
 'drift_status': True,
 'drift_metric': 0.3003673509386833}
Model Endpoints#

In the Projects page > Model endpoint summary list, you can see the new two model endpoints, including their drift status:

Model Endpoints Summary List

You can zoom into one of the model endpoints to get an overview about the selected endpoint, including the calculated statistical drift metrics:

Model Endpoint Overview

Press Features Analysis to see details of the drift analysis in a table format with each feature in the selected model on its own line, including the predicted label:

Model Endpoint Feature Analysis

Next steps#

In a production setting, you probably want to incorporate this as part of a larger pipeline or application.

For example, if you use this function for the prediction capabilities, you can pass the prediction output as the input to another pipeline step, store it in an external location like S3, or send to an application or user.

If you use this function for the drift detection capabilities, you can use the drift_status and drift_metrics outputs to automate further pipeline steps, send a notification, or kick off a re-training pipeline.

Feature store example (stocks)#

This notebook demonstrates the following:

  • Generate features and feature-sets

  • Build complex transformations and ingest to offline and real-time data stores

  • Fetch feature vectors for training

  • Save feature vectors for re-use in real-time pipelines

  • Access features and their statistics in real-time

Note

By default, this demo works with the online feature store, which is currently not part of the Open Source MLRun default deployment.

In this section

Get started#

Install the latest MLRun package and restart the notebook.

Setting up the environment and project:

import mlrun

mlrun.get_or_create_project("stocks", "./")
> 2023-02-05 11:43:17,605 [info] Created and saved project stocks: {'from_template': None, 'overwrite': False, 'context': './', 'save': True}
> 2023-02-05 11:43:17,607 [info] created project stocks and saved in MLRun DB
<mlrun.projects.project.MlrunProject at 0x7f689811ea10>

Create sample data for demo#

Hide code cell content
import pandas as pd

quotes = pd.DataFrame(
    {
        "time": [
            pd.Timestamp("2016-05-25 13:30:00.023"),
            pd.Timestamp("2016-05-25 13:30:00.023"),
            pd.Timestamp("2016-05-25 13:30:00.030"),
            pd.Timestamp("2016-05-25 13:30:00.041"),
            pd.Timestamp("2016-05-25 13:30:00.048"),
            pd.Timestamp("2016-05-25 13:30:00.049"),
            pd.Timestamp("2016-05-25 13:30:00.072"),
            pd.Timestamp("2016-05-25 13:30:00.075"),
        ],
        "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"],
        "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
        "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
    }
)

trades = pd.DataFrame(
    {
        "time": [
            pd.Timestamp("2016-05-25 13:30:00.023"),
            pd.Timestamp("2016-05-25 13:30:00.038"),
            pd.Timestamp("2016-05-25 13:30:00.048"),
            pd.Timestamp("2016-05-25 13:30:00.048"),
            pd.Timestamp("2016-05-25 13:30:00.048"),
        ],
        "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
        "price": [51.95, 51.95, 720.77, 720.92, 98.0],
        "quantity": [75, 155, 100, 100, 100],
    }
)


stocks = pd.DataFrame(
    {
        "ticker": ["MSFT", "GOOG", "AAPL"],
        "name": ["Microsoft Corporation", "Alphabet Inc", "Apple Inc"],
        "exchange": ["NASDAQ", "NASDAQ", "NASDAQ"],
    }
)

import datetime


def move_date(df, col):
    max_date = df[col].max()
    now_date = datetime.datetime.now()
    delta = now_date - max_date
    df[col] = df[col] + delta
    return df


quotes = move_date(quotes, "time")
trades = move_date(trades, "time")
View the demo data#
quotes
Hide code cell output
time ticker bid ask
0 2021-05-23 09:04:07.013574 GOOG 720.50 720.93
1 2021-05-23 09:04:07.013574 MSFT 51.95 51.96
2 2021-05-23 09:04:07.020574 MSFT 51.97 51.98
3 2021-05-23 09:04:07.031574 MSFT 51.99 52.00
4 2021-05-23 09:04:07.038574 GOOG 720.50 720.93
5 2021-05-23 09:04:07.039574 AAPL 97.99 98.01
6 2021-05-23 09:04:07.062574 GOOG 720.50 720.88
7 2021-05-23 09:04:07.065574 MSFT 52.01 52.03
trades
Hide code cell output
time ticker price quantity
0 2021-05-23 09:04:07.041766 MSFT 51.95 75
1 2021-05-23 09:04:07.056766 MSFT 51.95 155
2 2021-05-23 09:04:07.066766 GOOG 720.77 100
3 2021-05-23 09:04:07.066766 GOOG 720.92 100
4 2021-05-23 09:04:07.066766 AAPL 98.00 100
stocks
Hide code cell output
ticker name exchange
0 MSFT Microsoft Corporation NASDAQ
1 GOOG Alphabet Inc NASDAQ
2 AAPL Apple Inc NASDAQ

Define, infer and ingest feature sets#

import mlrun.feature_store as fstore
from mlrun.feature_store.steps import *
from mlrun.features import MinMaxValidator
Build and ingest simple feature set (stocks)#
# add feature set without time column (stock ticker metadata)
stocks_set = fstore.FeatureSet("stocks", entities=[fstore.Entity("ticker")])
fstore.ingest(stocks_set, stocks, infer_options=fstore.InferOptions.default())
name exchange
ticker
MSFT Microsoft Corporation NASDAQ
GOOG Alphabet Inc NASDAQ
AAPL Apple Inc NASDAQ
Build an advanced feature set - with feature engineering pipeline#

Define a feature set with custom data processing and time aggregation functions:

# create a new feature set
quotes_set = fstore.FeatureSet("stock-quotes", entities=[fstore.Entity("ticker")])

Define a custom pipeline step (python class)

class MyMap(MapClass):
    def __init__(self, multiplier=1, **kwargs):
        super().__init__(**kwargs)
        self._multiplier = multiplier

    def do(self, event):
        event["multi"] = event["bid"] * self._multiplier
        return event

Build and show the transformation pipeline

Use storey stream processing classes along with library and custom classes:

quotes_set.graph.to("MyMap", multiplier=3).to(
    "storey.Extend", _fn="({'extra': event['bid'] * 77})"
).to("storey.Filter", "filter", _fn="(event['bid'] > 51.92)").to(FeaturesetValidator())

quotes_set.add_aggregation("ask", ["sum", "max"], "1h", "10m", name="asks1")
quotes_set.add_aggregation("ask", ["sum", "max"], "5h", "10m", name="asks5")
quotes_set.add_aggregation("bid", ["min", "max"], "1h", "10m", name="bids")

# add feature validation policy
quotes_set["bid"] = fstore.Feature(validator=MinMaxValidator(min=52, severity="info"))

# add default target definitions and plot
quotes_set.set_targets()
quotes_set.plot(rankdir="LR", with_targets=True)
_images/01e548158b906df8cc3f4f282097c5f50e603245dd43f2314bc9ff5ef6aa1447.svg

Test and show the pipeline results locally (allow to quickly develop and debug)

fstore.preview(
    quotes_set,
    quotes,
    entity_columns=["ticker"],
    timestamp_key="time",
    options=fstore.InferOptions.default(),
)
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.013574 args={'min': 52, 'value': 51.95}
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.020574 args={'min': 52, 'value': 51.97}
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.031574 args={'min': 52, 'value': 51.99}
asks1_sum_1h asks1_max_1h asks5_sum_5h asks5_max_5h bids_min_1h bids_max_1h time bid ask multi extra
ticker
GOOG 720.93 720.93 720.93 720.93 720.50 720.50 2021-05-23 09:04:07.013574 720.50 720.93 2161.50 55478.50
MSFT 51.96 51.96 51.96 51.96 51.95 51.95 2021-05-23 09:04:07.013574 51.95 51.96 155.85 4000.15
MSFT 103.94 51.98 103.94 51.98 51.95 51.97 2021-05-23 09:04:07.020574 51.97 51.98 155.91 4001.69
MSFT 155.94 52.00 155.94 52.00 51.95 51.99 2021-05-23 09:04:07.031574 51.99 52.00 155.97 4003.23
GOOG 1441.86 720.93 1441.86 720.93 720.50 720.50 2021-05-23 09:04:07.038574 720.50 720.93 2161.50 55478.50
AAPL 98.01 98.01 98.01 98.01 97.99 97.99 2021-05-23 09:04:07.039574 97.99 98.01 293.97 7545.23
GOOG 2162.74 720.93 2162.74 720.93 720.50 720.50 2021-05-23 09:04:07.062574 720.50 720.88 2161.50 55478.50
MSFT 207.97 52.03 207.97 52.03 51.95 52.01 2021-05-23 09:04:07.065574 52.01 52.03 156.03 4004.77
# print the feature set object
print(quotes_set.to_yaml())
Hide code cell output
kind: FeatureSet
metadata:
  name: stock-quotes
spec:
  entities:
  - name: ticker
    value_type: str
  features:
  - name: asks1_sum_1h
    value_type: float
    aggregate: true
  - name: asks1_max_1h
    value_type: float
    aggregate: true
  - name: asks5_sum_5h
    value_type: float
    aggregate: true
  - name: asks5_max_5h
    value_type: float
    aggregate: true
  - name: bids_min_1h
    value_type: float
    aggregate: true
  - name: bids_max_1h
    value_type: float
    aggregate: true
  - name: bid
    value_type: float
    validator:
      kind: minmax
      severity: info
      min: 52
  - name: ask
    value_type: float
  - name: multi
    value_type: float
  - name: extra
    value_type: float
  partition_keys: []
  timestamp_key: time
  source:
    path: None
  targets:
  - name: parquet
    kind: parquet
  - name: nosql
    kind: nosql
  graph:
    states:
      MyMap:
        kind: task
        class_name: MyMap
        class_args:
          multiplier: 3
      storey.Extend:
        kind: task
        class_name: storey.Extend
        class_args:
          _fn: '({''extra'': event[''bid''] * 77})'
        after:
        - MyMap
      filter:
        kind: task
        class_name: storey.Filter
        class_args:
          _fn: (event['bid'] > 51.92)
        after:
        - storey.Extend
      FeaturesetValidator:
        kind: task
        class_name: mlrun.feature_store.steps.FeaturesetValidator
        class_args:
          featureset: .
          columns: null
        after:
        - filter
      Aggregates:
        kind: task
        class_name: storey.AggregateByKey
        class_args:
          aggregates:
          - name: asks1
            column: ask
            operations:
            - sum
            - max
            windows:
            - 1h
            period: 10m
          - name: asks5
            column: ask
            operations:
            - sum
            - max
            windows:
            - 5h
            period: 10m
          - name: bids
            column: bid
            operations:
            - min
            - max
            windows:
            - 1h
            period: 10m
          table: .
        after:
        - FeaturesetValidator
  output_path: v3io:///projects/{{run.project}}/artifacts
status:
  state: created
  stats:
    ticker:
      count: 8
      unique: 3
      top: MSFT
      freq: 4
    asks1_sum_1h:
      count: 8.0
      mean: 617.9187499999999
      min: 51.96
      max: 2162.74
      std: 784.8779804245735
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 1
      - - 51.96
        - 157.499
        - 263.03799999999995
        - 368.57699999999994
        - 474.11599999999993
        - 579.655
        - 685.194
        - 790.733
        - 896.2719999999999
        - 1001.8109999999999
        - 1107.35
        - 1212.889
        - 1318.4279999999999
        - 1423.9669999999999
        - 1529.5059999999999
        - 1635.0449999999998
        - 1740.5839999999998
        - 1846.1229999999998
        - 1951.6619999999998
        - 2057.2009999999996
        - 2162.74
    asks1_max_1h:
      count: 8.0
      mean: 308.59625
      min: 51.96
      max: 720.93
      std: 341.7989955655851
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 51.96
        - 85.4085
        - 118.857
        - 152.3055
        - 185.754
        - 219.2025
        - 252.65099999999998
        - 286.0995
        - 319.54799999999994
        - 352.9964999999999
        - 386.44499999999994
        - 419.89349999999996
        - 453.3419999999999
        - 486.7904999999999
        - 520.2389999999999
        - 553.6875
        - 587.136
        - 620.5844999999999
        - 654.0329999999999
        - 687.4815
        - 720.93
    asks5_sum_5h:
      count: 8.0
      mean: 617.9187499999999
      min: 51.96
      max: 2162.74
      std: 784.8779804245735
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 1
      - - 51.96
        - 157.499
        - 263.03799999999995
        - 368.57699999999994
        - 474.11599999999993
        - 579.655
        - 685.194
        - 790.733
        - 896.2719999999999
        - 1001.8109999999999
        - 1107.35
        - 1212.889
        - 1318.4279999999999
        - 1423.9669999999999
        - 1529.5059999999999
        - 1635.0449999999998
        - 1740.5839999999998
        - 1846.1229999999998
        - 1951.6619999999998
        - 2057.2009999999996
        - 2162.74
    asks5_max_5h:
      count: 8.0
      mean: 308.59625
      min: 51.96
      max: 720.93
      std: 341.7989955655851
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 51.96
        - 85.4085
        - 118.857
        - 152.3055
        - 185.754
        - 219.2025
        - 252.65099999999998
        - 286.0995
        - 319.54799999999994
        - 352.9964999999999
        - 386.44499999999994
        - 419.89349999999996
        - 453.3419999999999
        - 486.7904999999999
        - 520.2389999999999
        - 553.6875
        - 587.136
        - 620.5844999999999
        - 654.0329999999999
        - 687.4815
        - 720.93
    bids_min_1h:
      count: 8.0
      mean: 308.41125
      min: 51.95
      max: 720.5
      std: 341.59667259325835
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 51.95
        - 85.3775
        - 118.80499999999999
        - 152.2325
        - 185.65999999999997
        - 219.08749999999998
        - 252.515
        - 285.94249999999994
        - 319.36999999999995
        - 352.79749999999996
        - 386.22499999999997
        - 419.6524999999999
        - 453.0799999999999
        - 486.50749999999994
        - 519.935
        - 553.3625
        - 586.79
        - 620.2175
        - 653.645
        - 687.0725
        - 720.5
    bids_max_1h:
      count: 8.0
      mean: 308.42625
      min: 51.95
      max: 720.5
      std: 341.58380276661245
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 51.95
        - 85.3775
        - 118.80499999999999
        - 152.2325
        - 185.65999999999997
        - 219.08749999999998
        - 252.515
        - 285.94249999999994
        - 319.36999999999995
        - 352.79749999999996
        - 386.22499999999997
        - 419.6524999999999
        - 453.0799999999999
        - 486.50749999999994
        - 519.935
        - 553.3625
        - 586.79
        - 620.2175
        - 653.645
        - 687.0725
        - 720.5
    time:
      count: 8
      mean: '2021-05-23 09:04:07.035699200'
      min: '2021-05-23 09:04:07.013574'
      max: '2021-05-23 09:04:07.065574'
    bid:
      count: 8.0
      mean: 308.42625
      min: 51.95
      max: 720.5
      std: 341.58380276661245
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 51.95
        - 85.3775
        - 118.80499999999999
        - 152.2325
        - 185.65999999999997
        - 219.08749999999998
        - 252.515
        - 285.94249999999994
        - 319.36999999999995
        - 352.79749999999996
        - 386.22499999999997
        - 419.6524999999999
        - 453.0799999999999
        - 486.50749999999994
        - 519.935
        - 553.3625
        - 586.79
        - 620.2175
        - 653.645
        - 687.0725
        - 720.5
    ask:
      count: 8.0
      mean: 308.59
      min: 51.96
      max: 720.93
      std: 341.79037903369954
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 51.96
        - 85.4085
        - 118.857
        - 152.3055
        - 185.754
        - 219.2025
        - 252.65099999999998
        - 286.0995
        - 319.54799999999994
        - 352.9964999999999
        - 386.44499999999994
        - 419.89349999999996
        - 453.3419999999999
        - 486.7904999999999
        - 520.2389999999999
        - 553.6875
        - 587.136
        - 620.5844999999999
        - 654.0329999999999
        - 687.4815
        - 720.93
    multi:
      count: 8.0
      mean: 925.27875
      min: 155.85000000000002
      max: 2161.5
      std: 1024.7514082998375
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 155.85000000000002
        - 256.13250000000005
        - 356.415
        - 456.6975
        - 556.98
        - 657.2625
        - 757.545
        - 857.8275
        - 958.11
        - 1058.3925
        - 1158.6750000000002
        - 1258.9575
        - 1359.2399999999998
        - 1459.5225
        - 1559.8049999999998
        - 1660.0875
        - 1760.37
        - 1860.6525000000001
        - 1960.935
        - 2061.2175
        - 2161.5
    extra:
      count: 8.0
      mean: 23748.82125
      min: 4000.15
      max: 55478.5
      std: 26301.95281302916
      hist:
      - - 4
        - 1
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 0
        - 3
      - - 4000.15
        - 6574.0675
        - 9147.985
        - 11721.9025
        - 14295.82
        - 16869.7375
        - 19443.655000000002
        - 22017.572500000002
        - 24591.49
        - 27165.4075
        - 29739.325
        - 32313.2425
        - 34887.16
        - 37461.0775
        - 40034.995
        - 42608.9125
        - 45182.83
        - 47756.747500000005
        - 50330.665
        - 52904.582500000004
        - 55478.5
  preview:
  - - asks1_sum_1h
    - asks1_max_1h
    - asks5_sum_5h
    - asks5_max_5h
    - bids_min_1h
    - bids_max_1h
    - time
    - bid
    - ask
    - multi
    - extra
  - - 720.93
    - 720.93
    - 720.93
    - 720.93
    - 720.5
    - 720.5
    - 2021-05-23T09:04:07.013574
    - 720.5
    - 720.93
    - 2161.5
    - 55478.5
  - - 51.96
    - 51.96
    - 51.96
    - 51.96
    - 51.95
    - 51.95
    - 2021-05-23T09:04:07.013574
    - 51.95
    - 51.96
    - 155.85000000000002
    - 4000.15
  - - 103.94
    - 51.98
    - 103.94
    - 51.98
    - 51.95
    - 51.97
    - 2021-05-23T09:04:07.020574
    - 51.97
    - 51.98
    - 155.91
    - 4001.69
  - - 155.94
    - 52.0
    - 155.94
    - 52.0
    - 51.95
    - 51.99
    - 2021-05-23T09:04:07.031574
    - 51.99
    - 52.0
    - 155.97
    - 4003.23
  - - 1441.86
    - 720.93
    - 1441.86
    - 720.93
    - 720.5
    - 720.5
    - 2021-05-23T09:04:07.038574
    - 720.5
    - 720.93
    - 2161.5
    - 55478.5
  - - 98.01
    - 98.01
    - 98.01
    - 98.01
    - 97.99
    - 97.99
    - 2021-05-23T09:04:07.039574
    - 97.99
    - 98.01
    - 293.96999999999997
    - 7545.23
  - - 2162.74
    - 720.93
    - 2162.74
    - 720.93
    - 720.5
    - 720.5
    - 2021-05-23T09:04:07.062574
    - 720.5
    - 720.88
    - 2161.5
    - 55478.5
  - - 207.97
    - 52.03
    - 207.97
    - 52.03
    - 51.95
    - 52.01
    - 2021-05-23T09:04:07.065574
    - 52.01
    - 52.03
    - 156.03
    - 4004.77
Ingest data into offline and online stores#

This writes to both targets (Parquet and NoSQL).

# save ingest data and print the FeatureSet spec
df = fstore.ingest(quotes_set, quotes)
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.013574 args={'min': 52, 'value': 51.95}
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.020574 args={'min': 52, 'value': 51.97}
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.031574 args={'min': 52, 'value': 51.99}
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.013574 args={'min': 52, 'value': 51.95}
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.020574 args={'min': 52, 'value': 51.97}
info! bid value is smaller than min, key=['MSFT'] time=2021-05-23 09:04:07.031574 args={'min': 52, 'value': 51.99}

Get an offline feature vector for training#

Example of combining features from 3 sources with time travel join of 3 tables with time travel.

Specify a set of features and request the feature vector offline result as a dataframe:

features = [
    "stock-quotes.multi",
    "stock-quotes.asks5_sum_5h as total_ask",
    "stock-quotes.bids_min_1h",
    "stock-quotes.bids_max_1h",
    "stocks.*",
]

vector = fstore.FeatureVector(
    "stocks-vec", features, description="stocks demo feature vector"
)
vector.save()
resp = fstore.get_offline_features(
    vector, entity_rows=trades, entity_timestamp_column="time"
)
resp.to_dataframe()
price quantity multi total_ask bids_min_1h bids_max_1h name exchange
0 51.95 75 155.97 155.94 51.95 51.99 Microsoft Corporation NASDAQ
1 51.95 155 155.97 155.94 51.95 51.99 Microsoft Corporation NASDAQ
2 720.77 100 2161.50 2162.74 720.50 720.50 Alphabet Inc NASDAQ
3 720.92 100 2161.50 2162.74 720.50 720.50 Alphabet Inc NASDAQ
4 98.00 100 293.97 98.01 97.99 97.99 Apple Inc NASDAQ

Initialize an online feature service and use it for real-time inference#

service = fstore.get_online_feature_service("stocks-vec")

Request feature vector statistics, can be used for imputing or validation

service.vector.get_stats_table()
count mean min max std hist unique top freq
multi 8.0 925.27875 155.85 2161.50 1024.751408 [[4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... NaN NaN NaN
total_ask 8.0 617.91875 51.96 2162.74 784.877980 [[4, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0,... NaN NaN NaN
bids_min_1h 8.0 308.41125 51.95 720.50 341.596673 [[4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... NaN NaN NaN
bids_max_1h 8.0 308.42625 51.95 720.50 341.583803 [[4, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,... NaN NaN NaN
name 3.0 NaN NaN NaN NaN NaN 3.0 Alphabet Inc 1.0
exchange 3.0 NaN NaN NaN NaN NaN 1.0 NASDAQ 3.0

Real-time feature vector request

service.get([{"ticker": "GOOG"}, {"ticker": "MSFT"}])
[{'asks5_sum_5h': 2162.74,
  'bids_min_1h': 720.5,
  'bids_max_1h': 720.5,
  'multi': 2161.5,
  'name': 'Alphabet Inc',
  'exchange': 'NASDAQ',
  'total_ask': None},
 {'asks5_sum_5h': 207.97,
  'bids_min_1h': 51.95,
  'bids_max_1h': 52.01,
  'multi': 156.03,
  'name': 'Microsoft Corporation',
  'exchange': 'NASDAQ',
  'total_ask': None}]
service.get([{"ticker": "AAPL"}])
[{'asks5_sum_5h': 98.01,
  'bids_min_1h': 97.99,
  'bids_max_1h': 97.99,
  'multi': 293.97,
  'name': 'Apple Inc',
  'exchange': 'NASDAQ',
  'total_ask': None}]
service.close()

MLflow tracker#

This tutorial demonstrates how to seamlessly integrate and transfer logs from MLflow to MLRun,
creating a unified and powerful platform for your machine learning experiments.

You can combine MLflow and MLRun for a comprehensive solution for managing, tracking, and deploying machine learning models.

This notebook guides you through the process of:

  1. Setting up the integration between MLflow and MLRun.

  2. Extracting data, metrics, and artifacts from MLflow experiments.

  3. Creating MLRun artifacts and projects to organize and manage the transferred data.

  4. Leveraging MLRun's capabilities for model deployment and data processing.

By the end of this tutorial, you will have a understanding of how to establish a smooth flow of data between MLflow and MLRun.

MLRun installation and configuration#

Before running this notebook make sure the mlrun package is installed (pip install mlrun) and that you have configured the access to MLRun service.

# Install MLRun and scikit-learn if not already installed. Run this only once. Restart the notebook after the install!
%pip install mlrun scikit-learn~=1.4 xgboost

Create an MLflow Xgboost function#

The training.py contains just mlflow code, and does not have any dependence on MLRun.

%%writefile training.py

import mlflow
import mlflow.xgboost
import xgboost as xgb
from mlflow import log_metric
from sklearn import datasets
from sklearn.metrics import accuracy_score, log_loss
from sklearn.model_selection import train_test_split

def example_xgb_run():
    # Prepare, train, and test data
    iris = datasets.load_iris()
    X = iris.data
    y = iris.target
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Enable auto logging
    mlflow.xgboost.autolog()

    dtrain = xgb.DMatrix(X_train, label=y_train)
    dtest = xgb.DMatrix(X_test, label=y_test)

    with mlflow.start_run():
        # Train model
        params = {
            "objective": "multi:softprob",
            "num_class": 3,
            "learning_rate": 0.3,
            "eval_metric": "mlogloss",
            "colsample_bytree": 1.0,
            "subsample": 1.0,
            "seed": 42,
        }
        model = xgb.train(params, dtrain, evals=[(dtrain, "train")])
        
        # Evaluate model
        y_proba = model.predict(dtest)
        y_pred = y_proba.argmax(axis=1)
        loss = log_loss(y_test, y_proba)
        acc = accuracy_score(y_test, y_pred)
        
        # Log metrics by hand
        mlflow.log_metrics({"log_loss": loss, "accuracy": acc})
Overwriting training.py

Log the data from MLflow in MLRun#

Change the MLRun configuration to use the tracker#
import mlrun

mlrun.mlconf.external_platform_tracking.enabled = True

To run the tracking, set: mlrun.mlconf.external_platform_tracking.mlflow.match_experiment_to_runtime to True.
This makes the MLRun run-id the same as the MLFlow experiment ID.

Create the project and function#
# Set the tracking
mlrun.mlconf.external_platform_tracking.mlflow.match_experiment_to_runtime = True

# Create a project for this demo:
project = mlrun.get_or_create_project(name="mlflow-tracking-example", context="./")

# Create a MLRun function using the example train file (all the functions must be located in it):
training_func = project.set_function(
    func="training.py",
    name="example-xgb-run",
    kind="job",
    image="mlrun/mlrun",
)
> 2023-12-06 09:01:32,156 [info] Project loaded successfully: {'project_name': 'mlflow-tracking-example'}
> 2023-12-06 09:01:32,164 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User/mlflow'}
Run the function#

After running the function, you can look at the UI and see that all metrics and parameters are logged in MLRun.

# Run the example code using mlrun
train_run = training_func.run(
    local=True,
    handler="example_xgb_run",
)
> 2023-12-06 09:02:10,520 [info] Storing function: {'name': 'example-xgb-run-example-xgb-run', 'uid': '4185accc906648a9a56f73d97829c1d0', 'db': 'http://mlrun-api:8080'}
> 2023-12-06 09:02:10,626 [warning] `mlconf.external_platform_tracking.mlflow.match_experiment_to_runtime` is set to True but the MLFlow experiment name environment variable ('MLFLOW_EXPERIMENT_NAME') is set for using the name: 'example-xgb-run-example-xgb-run'. This name will be overriden with MLRun's runtime name as set in the MLRun configuration: 'example-xgb-run-example-xgb-run'.
[0]	train-mlogloss:0.74723
[1]	train-mlogloss:0.54060
[2]	train-mlogloss:0.40276
[3]	train-mlogloss:0.30789
[4]	train-mlogloss:0.24051
[5]	train-mlogloss:0.19086
[6]	train-mlogloss:0.15471
[7]	train-mlogloss:0.12807
[8]	train-mlogloss:0.10722
[9]	train-mlogloss:0.09053
project uid iter start state name labels inputs parameters results artifacts
mlflow-tracking-example 0 Dec 06 09:02:10 completed example-xgb-run-example-xgb-run
v3io_user=zeevr
kind=local
owner=zeevr
host=jupyter-zeev-8c4f96bdf-6j652
mlflow-user=iguazio
mlflow-run-name=adaptable-perch-39
mlflow-run-id=3290451d92f24ea5988f8debd9d51670
mlflow-experiment-id=175092470379844344
colsample_bytree=1.0
custom_metric=None
early_stopping_rounds=None
eval_metric=mlogloss
learning_rate=0.3
maximize=None
num_boost_round=10
num_class=3
objective=multi:softprob
seed=42
subsample=1.0
verbose_eval=True
accuracy=1.0
log_loss=0.06621863381213823
train-mlogloss=0.09053360810503364
feature_importance_weight_json
feature_importance_weight_png
model

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-12-06 09:02:19,281 [info] Run execution finished: {'status': 'completed', 'name': 'example-xgb-run-example-xgb-run'}
Examine the results#
train_run.outputs
{'accuracy': 1.0,
 'log_loss': 0.06621863381213823,
 'train-mlogloss': 0.09053360810503364,
 'feature_importance_weight_json': 'store://artifacts/mlflow-tracking-example/example-xgb-run-example-xgb-run_feature_importance_weight_json:ad1fec0b0df04083b034e3839460b623',
 'feature_importance_weight_png': 'store://artifacts/mlflow-tracking-example/example-xgb-run-example-xgb-run_feature_importance_weight_png:ad1fec0b0df04083b034e3839460b623',
 'model': 'store://artifacts/mlflow-tracking-example/example-xgb-run-example-xgb-run_model:ad1fec0b0df04083b034e3839460b623'}
train_run.status.results
{'accuracy': 1.0,
 'log_loss': 0.06621863381213823,
 'train-mlogloss': 0.09053360810503364}
train_run.artifact("feature_importance_weight_png").show()
_images/1b6228c27e1bc3f9d49fc175c81ab5cf9926f1c328579523194e5ccddbdd0648.png
You can also examine the results using the UI#

Look at collected artifacts:

_images/mlflow-artifacts.png

And at results:

_images/mlflow-results.png

Use the function for model serving#

Implement the load and predict functions#
%%writefile serving.py

import zipfile
from typing import Any, Dict, List, Union

import mlflow
import numpy as np
import os
import mlrun
from mlrun.serving.v2_serving import V2ModelServer
import xgboost as xgb
import pandas as pd

class MLFlowModelServer(V2ModelServer):
    """
    The MLFlow tracker Model serving class  inherits the V2ModelServer class, resulting in automatic 
    initialization by the model server. It can run locally as part of a nuclio serverless function,
    or as part of a real-time pipeline.
    """

    def load(self):
        """
        loads a model that was logged by the MLFlow tracker model
        """
        # Unzip the model dir and then use mlflow's load function
        model_file, _ = self.get_model(".zip")
        model_path_unzip = model_file.replace(".zip", "")

        with zipfile.ZipFile(model_file, "r") as zip_ref:
            zip_ref.extractall(model_path_unzip)
            
        self.model = mlflow.pyfunc.load_model(model_path_unzip)

    def predict(self, request: Dict[str, Any]) -> list:
        """
        Infer the inputs through the model. The inferred data
        is read from the "inputs" key of the request.

        :param request: The request to the model using xgboost's predict. 
                The input to the model is read from the "inputs" key.

        :return: The model's prediction on the given input.
        """
        
        # Get the inputs and set to accepted type:
        inputs = pd.DataFrame(request["inputs"])

        # Predict using the model's predict function:
        predictions = self.model.predict(inputs)

        # Return as list:
        return predictions.tolist()
Overwriting serving.py
Create the server and serving function#
serving_func = project.set_function(
    func="serving.py",
    name="example-xgb-server",
    kind="serving",
    image="mlrun/mlrun",
)
> 2023-11-07 13:21:19,690 [warning] Failed to add git metadata, ignore if path is not part of a git repo.: {'path': './', 'error': '/User'}
# Add the model
serving_func.add_model(
    "mlflow_xgb_model",
    class_name="MLFlowModelServer",
    model_path=train_run.outputs["model"],
)
<mlrun.serving.states.TaskStep at 0x7f8450794d60>
# Create a mock server
server = serving_func.to_mock_server()
> 2023-11-07 13:21:19,833 [info] model mlflow_xgb_model was loaded
> 2023-11-07 13:21:19,834 [info] Loaded ['mlflow_xgb_model']
Test the model#
# An example taken randomly from the dataset that the model was trained on, each
x = [[5.1, 3.5, 1.4, 0.2]]
result = server.test("/v2/models/mlflow_xgb_model/predict", {"inputs": x})
# Look at the result, it shows the probability of the given example to be each of the
# irises featured in the dataset
result
{'id': '4188f3585d9d42b7b184324f713c9c26',
 'model_name': 'mlflow_xgb_model',
 'outputs': [[0.9505813121795654, 0.025876399129629135, 0.02354232780635357]]}

MLRun cheat sheet#

Table of contents#

MLRun setup#

Docs: Set up your client environment, Installation and setup guide

MLRun server/client overview#

MLRun has two main components, the service and the client (SDK+UI):

  • MLRun service runs over Kubernetes (can also be deployed using local Docker for demo and test purposes) - see installation documentation for more information

  • MLRun client SDK is installed in your development environment via pip and interacts with the service using REST API calls

Remote connection (laptop, CI/CD, etc.)#

Docs: Configure remote environment

Localhost: Create a mlrun.env file for environment variables. MLRUN_DBPATH saves the URL endpoint of the MLRun APIs service endpoint. Since it is localhost, username and access_key are not required:

mlrun config set -a http://localhost:8080
# MLRun DB
MLRUN_DBPATH=<URL endpoint of the MLRun APIs service endpoint; e.g., "https://mlrun-api.default-tenant.app.mycluster.iguazio.com">

Iguazio MLOps Platform (not MLRun CE).

mlrun config set -a https://mlrun-api.default-tenant.app.xxx.iguazio-cd1.com -u joe -k mykey -e 
# this is another env file
V3IO_USERNAME=joe
V3IO_ACCESS_KEY=mykey
MLRUN_DBPATH=https://mlrun-api.default-tenant.app.xxx.iguazio-cd1.com

Connect via MLRun Python SDK:

# Use local service
mlrun.set_environment("http://localhost:8080", artifact_path="./")
# Use remote service
mlrun.set_environment("<remote-service-url>", access_key="xyz", username="joe")

MLRun projects#

Docs: Projects and automation

General workflow#

Docs: Create, save, and use projects

# Create or load a project
project = mlrun.get_or_create_project(name="my-project", context="./")

# Add a function to the project
project.set_function(name='train_model', func='train_model.py', kind='job', image='mlrun/mlrun')

# Add aworkflow (pipeline) to the project
project.set_workflow(name='training_pipeline', workflow_path='straining_pipeline.py')

# Save the project and generate the project.yaml file
project.save()

# Run pipeline via project
project.run(name="training_pipeline", arguments={...})
Git integration#

Docs: Create and use functions

An MLRun project can be backed by a Git repo. Functions consume the repo and pull the code either: once when Docker image is built (production workflow); or at runtime (development workflow).

Pull the repo code once (bake into Docker image)#
project.set_source(source="git://github.com/mlrun/project-archive.git")

fn = project.set_function(
    name="myjob", handler="job_func.job_handler",
    image="mlrun/mlrun", kind="job", with_repo=True,
)

project.build_function(fn)
Pull the repo code at runtime#
project.set_source(source="git://github.com/mlrun/project-archive.git", pull_at_runtime=True)

fn = project.set_function(
    name="nuclio", handler="nuclio_func:nuclio_handler",
    image="mlrun/mlrun", kind="nuclio", with_repo=True,
)
CI/CD integration#
Overview#

Docs: CD/CD automation with Git, Run pipelines with Github Actions, GitLab

Best practice for working with CI/CD is using MLRun Projects with a combination of the following:

  • Git: Single source of truth for source code and deployments via infrastructure as code. Allows for collaboration between multiple developers. An MLRun project can (and should) be tied to a Git repo. One project maps to one Git repo.

  • CI/CD: Main tool for orchestrating production deployments. The CI/CD system should be responsible for deploying latest code changes from Git onto the remote cluster via MLRun Python SDK or CLI.

  • Iguazio/MLRun: Kubernetes-based compute environment for running data analytics, model training, or model deployment tasks. Additionally, the cluster is where all experiment tracking, job information, logs, and more, is located.

See MLRun Projects for more information on Git and CI/CD integration. In practice, this may look something like the following:

Example (GitHub Actions)#

Full example: MLRun project-demo

name: mlrun-project-workflow
on: [issue_comment]

jobs:
  submit-project:
    if: github.event.issue.pull_request != null && startsWith(github.event.comment.body, '/run')
    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v3
    - name: Set up Python 3.9
      uses: actions/setup-python@v4
      with:
        python-version: '3.9'
        architecture: 'x64'
    
    - name: Install mlrun
      run: python -m pip install pip install mlrun
    - name: Submit project
      run: python -m mlrun project ./ --watch --run main ${CMD:5}
      env:
        V3IO_USERNAME: ${{ secrets.V3IO_USERNAME }}
        V3IO_API: ${{ secrets.V3IO_API }}
        V3IO_ACCESS_KEY: ${{ secrets.V3IO_ACCESS_KEY }}
        MLRUN_DBPATH: ${{ secrets.MLRUN_DBPATH }}
        GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} 
        SLACK_WEBHOOK: ${{ secrets.SLACK_WEBHOOK }}
        CMD: ${{ github.event.comment.body}}
Secrets#

Docs: Working with secrets

# Add secrets to the project
project.set_secrets(secrets={'AWS_KEY': '111222333'}, provider="kubernetes")

# Run the job with all secrets (automatically injects all project secrets for non-local runtimes)
project.run_function(fn)

# Retrieve the secret within the job
context.get_secret("AWS_KEY")

MLRun functions#

Essential runtimes#

Docs: Kinds of functions (runtimes)

Job#
# Job - run once to completion
job = project.set_function(name="my-job", func="my_job.py", kind="job", image="mlrun/mlrun", handler="handler")
project.run_function(job)
Nuclio#
# Nuclio - generic real-time function to do something when triggered
nuclio = project.set_function(name="my-nuclio", func="my_nuclio.py", kind="nuclio", image="mlrun/mlrun", handler="handler")
project.deploy_function(nuclio)
Serving#
# Serving - specialized Nuclio function specifically for model serving
serving = project.set_function(name="my-serving", func="my_serving.py", kind="serving", image="mlrun/mlrun", handler="handler")
serving.add_model(key="iris", model_path="https://s3.wasabisys.com/iguazio/models/iris/model.pkl", model_class="ClassifierModel")
project.deploy_function(serving)
Distributed runtimes#

Docs: Kinds of functions (runtimes)

MPIJob (Horovod)#
mpijob = mlrun.code_to_function(name="my-mpijob", filename="my_mpijob.py", kind="mpijob", image="mlrun/mlrun", handler="handler")
mpijob.spec.replicas = 3
mpijob.run()
Dask#
dask = mlrun.new_function(name="my-dask", kind="dask", image="mlrun/ml-base")
dask.spec.remote = True
dask.spec.replicas = 5
dask.spec.service_type = 'NodePort'
dask.with_worker_limits(mem="6G")
dask.with_scheduler_limits(mem="1G")
dask.spec.nthreads = 5
dask.apply(mlrun.mount_v3io())
dask.client
Spark Operator#
import os
read_csv_filepath = os.path.join(os.path.abspath('.'), 'spark_read_csv.py')

spark = mlrun.new_function(kind='spark', command=read_csv_filepath, name='sparkreadcsv') 
spark.with_driver_limits(cpu="1300m")
spark.with_driver_requests(cpu=1, mem="512m") 
spark.with_executor_limits(cpu="1400m")
spark.with_executor_requests(cpu=1, mem="512m")
spark.with_igz_spark() 
spark.spec.replicas = 2 

spark.deploy() # build image
spark.run(artifact_path='/User') # run spark job
Resource management#

Docs: Managing job resources

Requests/limits (MEM/CPU/GPU)#
# Requests - lower bound
fn.with_requests(mem="1G", cpu=1)

# Limits - upper bound
fn.with_limits(mem="2G", cpu=2, gpus=1)
Scaling and auto-scaling#
# Nuclio/serving scaling
fn.spec.replicas = 2
fn.spec.min_replicas = 1
fn.spec.max_replicas = 4
Scale to zero#
# Nuclio/serving scaling
fn.spec.min_replicas = 0    # zero value is mandatory for scale to zero
fn.spec.max_replicas = 2

# Scaling to zero in case of 30 minutes (idle-time duration)
fn.set_config(key="spec.scaleToZero.scaleResources",
              value=[{"metricName":"nuclio_processor_handled_events_total",
                      "windowSize" : "30m",     # default values are 1m, 2m, 5m, 10m, 30m
                      "threshold" : 0}])
Mount persistent storage#
# Mount Iguazio V3IO
fn.apply(mlrun.mount_v3io())

# Mount PVC
fn.apply(mlrun.platforms.mount_pvc(pvc_name="data-claim", volume_name="data", volume_mount_path="/data"))
Pod priority#
fn.with_priority_class(name="igz-workload-medium")
Node selection#
fn.with_node_selection(node_selector={"app.iguazio.com/lifecycle" : "non-preemptible"})
Serving/Nuclio triggers#

Docs: Nuclio Triggers

By default, Nuclio deploys a default HTTP trigger if the function doesn't have one. This is because users typically want to invoke functions through HTTP. However, we provide a way to disable the default HTTP trigger using: function.disable_default_http_trigger()

Also, you can explicitly enable the default HTTP trigger creation with: function.enable_default_http_trigger()

If you didn't set this parameter explicitly, the value is taken from Nuclio platform configuration. Therefore, if you haven't disabled the default HTTP trigger, don't have a custom one, and are unable to invoke the function, we recommend checking the Nuclio platform configuration.

import nuclio
serve = mlrun.import_function('hub://v2_model_server')

# Set amount of workers 
serve.with_http(workers=8, worker_timeout=10)

# V3IO stream trigger
serve.add_v3io_stream_trigger(stream_path='v3io:///projects/myproj/stream1', name='stream', group='serving', seek_to='earliest', shards=1)

# Kafka stream trigger
serve.add_trigger(
    name="kafka",
    spec=nuclio.KafkaTrigger(brokers=["192.168.1.123:39092"], topics=["TOPIC"], partitions=4, consumer_group="serving", initial_offset="earliest")
)

# Cron trigger
serve.add_trigger("cron_interval", spec=nuclio.CronTrigger(interval="10s"))
serve.add_trigger("cron_schedule", spec=nuclio.CronTrigger(schedule="0 9 * * *"))

Note

The worker uses separate worker scope. This means that each worker has a copy of the variables, and all changes are kept within the worker (change by worker x, do not affect worker y).

Building Docker images#

Docs: Build function image, Images and their usage in MLRun

Manually build image#
project.set_function(
   "train_code.py", name="trainer", kind="job",
   image="mlrun/mlrun", handler="train_func", requirements=["pandas==1.3.5"]
)

project.build_function(
    "trainer",
    # Specify base image
    base_image="myrepo/base_image:latest",
    # Run arbitrary commands
    commands= [
        "pip install git+https://github.com/myusername/myrepo.git@mybranch",
        "mkdir -p /some/path && chmod 0777 /some/path",    
    ]
)
Automatically build image#
project.set_function(
   "train_code.py", name="trainer", kind="job",
   image="mlrun/mlrun", handler="train_func", requirements=["pandas==1.3.5"]
)

# auto_build will trigger building the image before running, 
# due to the additional requirements.
project.run_function("trainer", auto_build=True)

Multi-stage workflows (batch pipelines)#

Docs: Running a multi-stage workflow

Write a workflow#
# pipeline.py
from kfp import dsl
import mlrun
import nuclio

# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(
    name="batch-pipeline",
    description="Example of batch pipeline for heart disease dataset"
)
def pipeline(source_url, label_column):
    
    # Get current project
    project = mlrun.get_current_project()
    
    # Ingest the data set
    ingest = mlrun.run_function(
        'get-data',
        handler='prep_data',
        inputs={'source_url': source_url},
        params={'label_column': label_column},
        outputs=["cleaned_data"]
    )
    
    # Train a model   
    train = mlrun.run_function(
        "train",
        handler="train_model",
        inputs={"dataset": ingest.outputs["cleaned_data"]},
        params={"label_column": label_column},
        outputs=['model']
    )
Add workflow to project#
# Functions within the workflow
project.set_function(name='get-data', func='get_data.py', kind='job', image='mlrun/mlrun')
project.set_function(name='train', func='train.py', kind='job', image='mlrun/mlrun')

# Workflow
project.set_workflow(name='main', workflow_path='pipeline.py')

project.save()
Run workflow#

Python SDK

run_id = project.run(
    name="main",
    arguments={
        "source_url" : "store://feature-vectors/heart-disease-classifier/heart-disease-vec:latest",
        "label_column" : "target"
    }
)

CLI

mlrun project --run main \
    --arguments source_url=store://feature-vectors/heart-disease-classifier/heart-disease-vec:latest \
    --arguments label_column=target
Schedule workflow#
run_id = project.run(
    name="main",
    arguments={
        "source_url" : "store://feature-vectors/heart-disease-classifier/heart-disease-vec:latest",
        "label_column" : "target"
    },
    schedule="0 * * * *"
)

Logging#

Docs: MLRun execution context

context.logger.debug(message="Debugging info")              # logging all (debug, info, warning, error)
context.logger.info(message="Something happened")           # logging info, warning and error
context.logger.warning(message="Something might go wrong")  # logging warning and error
context.logger.error(message="Something went wrong")        # logging only error

Note

The real-time (nuclio) function uses default logger level debug (logging all)

Experiment tracking#

Docs: MLRun execution context, Automated experiment tracking, Decorators and auto-logging

Manual logging#
context.log_result(key="accuracy", value=0.934)
context.log_model(key="model", model_file="model.pkl")
context.log_dataset(key="model", df=df, format="csv", index=False)
Track returning values using hints and returns#
  • Pass type hints into the inputs parameter of the run method. Inputs are automatically parsed to their hinted type. If type hints are not in code, they can be passed in the input keys. Hints use the structure: key : type_hint

  • Pass log hints: how to log the returning values from a handler. The log hints are passed via the returns parameter in the run method. A log hint can be passed as a string or a dictionary.

  • Use the returns argument to specify how to log a function's returned values.

def my_handler(df):
    ...
    return processed_df, result
    
log_with_returns_run = my_func.run(
    handler="my_handler",
    inputs={"df: pandas.DataFrame": DATA_PATH},
    returns=["processed_data", "sum"],
    local=True,
)
Automatic logging#
# Auto logging for ML frameworks
from mlrun.frameworks.sklearn import apply_mlrun

apply_mlrun(model=model, model_name="my_model", x_test=X_test, y_test=y_test)
model.fit(X_train, y_train)

# MLRun decorator for input/output parsing
@mlrun.handler(labels={'framework':'scikit-learn'},
               outputs=['prediction:dataset'],
               inputs={"train_data": pd.DataFrame,
                       "predict_input": pd.DataFrame})
def train_and_predict(train_data,
                      predict_input,
                      label_column='label'):

    x = train_data.drop(label_column, axis=1)
    y = train_data[label_column]

    clf = SVC()
    clf.fit(x, y)

    return list(clf.predict(predict_input))

Model inferencing and serving#

Docs: Deploy models and applications

Real-time serving#

Docs: Using built-in model serving classes, Build your own model serving class, Model serving API

serve = mlrun.import_function('hub://v2_model_server')
serve.add_model(key="iris", model_path="https://s3.wasabisys.com/iguazio/models/iris/model.pkl")

# Deploy to local mock server (Development testing)
mock_server = serve.to_mock_server()

# Deploy to serverless function (Production K8s deployment)
addr = serve.deploy()
Batch inferencing#

Docs: Batch inference

batch_inference = mlrun.import_function("hub://batch_inference")
batch_run = project.run_function(
    batch_inference,
    inputs={"dataset": prediction_set_path},
    params={"model": model_artifact.uri},
)

Model monitoring and drift detection#

Docs: Model monitoring overview, Batch inference

Real-time drift detection#
# Log the model with training set
context.log_model("model", model_file="model.pkl", training_set=X_train)

# Enable tracking for the model server
serving_fn = import_function('hub://v2_model_server', project=project_name).apply(auto_mount())
serving_fn.add_model("model", model_path="store://models/project-name/model:latest") # Model path comes from experiment tracking DB
serving_fn.set_tracking()

# Deploy the model server
serving_fn.deploy()
Batch drift detection#
batch_inference = mlrun.import_function("hub://batch_inference")
batch_run = project.run_function(
    batch_inference,
    inputs={
        "dataset": prediction_set_path,
        "sample_set": training_set_path
    },
    params={
        "model": model_artifact.uri,
        "label_columns": "label",
        "perform_drift_analysis" : True
    }
)

Sources and targets#

Abstract underlying storage to easily retrieve and store data from various sources

Docs: Ingest data using the feature store

Sources#

Docs: Sources

from mlrun.datastore.sources import CSVSource, ParquetSource, BigQuerySource, KafkaSource

# CSV
csv_source = CSVSource(name="read", path="/User/getting_started/examples/demo.csv")
csv_df = csv_source.to_dataframe()

# Parquet
from pyspark.sql import SparkSession

session = SparkSession.builder.master("local").getOrCreate()
parquet_source = ParquetSource(name="read", path="v3io://users/admin/getting_started/examples/userdata1.parquet")
spark_df = parquet_source.to_spark_df(session=session)

# BigQuery
bq_source = BigQuerySource(name="read", table="the-psf.pypi.downloads20210328", gcp_project="my_project")
bq_df = bq_source.to_dataframe()

# Kafka
kafka_source = KafkaSource(
    name="read",
    brokers='localhost:9092',
    topics='topic',
    group='serving',
    initial_offset='earliest'
)
kafka_source.add_nuclio_trigger(function=fn)

# Snowflake
snowflake_source = SnowflakeSource(
    name="read",
    query="select * from customer limit 100000",
    url="<url>",
    user="<user>",
    password="<password>",
    database="SNOWFLAKE_SAMPLE_DATA",
    schema="TPCH_SF1",
    warehouse="compute_wh",
)
snowflake_df = snowflake_source.to_dataframe()
Targets#

Docs: Targets, Partitioning on Parquet target

from mlrun.datastore.targets import CSVTarget, ParquetTarget

# CSV
csv_target = CSVTarget(name="write", path="/User/test.csv")
csv_target.write_dataframe(df=csv_df, key_column="id")

# Parquet
pq_target = ParquetTarget(
    name="write",
    path="/User/test.parquet",
    partitioned=True,
    partition_cols=["country"]
)
pq_target.write_dataframe(df=pq_df, key_column="id")

# Redis (see docs for writing online features)
redis_target = RedisNoSqlTarget(name="write", path="redis://1.2.3.4:6379")
redis_target.write_dataframe(df=redis_df)

# Kafka (see docs for writing online features)
kafka_target = KafkaTarget(
    name="write",
    bootstrap_servers='localhost:9092',
    topic='topic',
)
redis_target.write_dataframe(df=kafka_df)

Feature store#

Docs: Feature Store, Feature sets, Feature set transformations, Creating and using feature vectors, Feature store end-to-end demo

Definitions#

Docs: Feature store overview

  • Feature Set: A group of features that can be ingested together and stored in logical group (usually one-to-one with a dataset, stream, table, etc.)

  • Feature Vector: A group of features from different Feature Sets

Engines#

Docs: Feature store overview, Ingest features with Spark

  • storey engine (default) is designed for real-time data (e.g. individual records) that will be transformed using Python functions and classes.

  • pandas engine is designed for batch data that can fit into memory that will be transformed using Pandas dataframes. Pandas is used for testing, and is not recommended for production deployments

  • spark engine is designed for batch data.

Feature sets#

Docs: Feature sets

Basic ingestion#

Docs: Ingest data using the feature store

import mlrun.feature_store as fstore
from mlrun.datastore.sources import ParquetSource

categorical_fset = fstore.FeatureSet(
    name="heart-disease-categorical",
    entities=[fstore.Entity("patient_id")],
    description="Categorical columns for heart disease dataset"
)

categorical_fset.ingest(source=ParquetSource(path="./data/heart_disease_categorical.parquet")
)
Feature set per engine#
from mlrun.datastore.sources import DataFrameSource

# Storey engine
storey_set = fstore.FeatureSet(
    name="heart-disease-storey",
    entities=[fstore.Entity("patient_id")],
    description="Heart disease data via storey engine",
    engine="storey"
)
storey_set.ingest(source=DataFrameSource(df=data))

# Pandas engine
pandas_set = fstore.FeatureSet(
    name="heart-disease-pandas",
    entities=[fstore.Entity("patient_id")],
    description="Heart disease data via pandas engine",
    engine="pandas"
)
pandas_set.ingest(source=DataFrameSource(df=data))

# Spark engine
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Spark function").getOrCreate()

spark_set = fstore.FeatureSet(
    name="heart-disease-spark",
    entities=[fstore.Entity("patient_id")],
    description="Heart disease data via spark engine",
    engine="spark"
)
spark_set.ingest(source=CSVSource(path=v3io_data_path), spark_context=spark)
Ingestion methods#

Docs: Ingest data locally, Ingest data using an MLRun job, Real-time ingestion, Incremental ingestion, Feature store end-to-end demo

# Local
from mlrun.datastore.sources import CSVSource

fs=fstore.FeatureSet("stocks", entities=[fstore.Entity("ticker")])
df = fs.ingest(    
    source=CSVSource("mycsv", path="stocks.csv")
)

# Job
from mlrun.datastore.sources import ParquetSource

fs = fstore.FeatureSet("stocks", entities=[fstore.Entity("ticker")])
df = fs.ingest(    
    source=ParquetSource("mypq", path="stocks.parquet")
    run_config=fstore.RunConfig(image='mlrun/mlrun')
	)

# Real-Time
from mlrun.datastore.sources import HttpSource

fs = fstore.FeatureSet("stocks", entities=[fstore.Entity("ticker")])
url, _ = fs.deploy_ingestion_service(
    source=HttpSource(key_field="ticker"),
    run_config=fstore.RunConfig(image='mlrun/mlrun', kind="serving")
)

# Incremental
cron_trigger = "* */1 * * *" # will run every hour
fs = fstore.FeatureSet("stocks", entities=[fstore.Entity("ticker")])
fset.ingest(
    source=ParquetSource("mypq", path="stocks.parquet", time_field="time", schedule=cron_trigger),
    run_config=fstore.RunConfig(image='mlrun/mlrun')
)
Aggregations#

Docs: add_aggregation(), Aggregations

quotes_set = fstore.FeatureSet("stock-quotes", entities=[fstore.Entity("ticker")])
quotes_set.add_aggregation("bid", ["min", "max"], ["1h"], "10m")
Built-in transformations#

Docs: storey.transformations, Built-in transformations

quotes_set.graph.to("storey.Filter", "filter", _fn="(event['bid'] > 50)")
Custom transformations#

Docs: Custom transformations

Define transformation

# Storey
class MyMapStorey(MapClass):
    def __init__(self, multiplier=1, **kwargs):
        super().__init__(**kwargs)
        self._multiplier = multiplier

    def do(self, event):
        event["multi"] = event["bid"] * self._multiplier
        return event

# Pandas
class MyMapPandas:
    def __init__(self, multiplier=1, **kwargs):
        self._multiplier = multiplier

    def do(self, df):
        df["multi"] = df["bid"] * self._multiplier
        return df

# Spark
class MyMapSpark:
    def __init__(self, multiplier=1, **kwargs):
        self._multiplier = multiplier

    def do(self, df):
        df = df.withColumn("multi", df["bid"] * self._multiplier)
        return df

Use in graph

quotes_set.graph.add_step("MyMapStorey", "multi", after="filter", multiplier=3)
Feature vectors#

Docs: Feature vectors

Basic retrieval#
import mlrun.feature_store as fstore
from mlrun.datastore.targets import ParquetTarget

fvec = fstore.FeatureVector(
    name="heart-disease-vector",
    features=["heart-disease-categorical.*", "heart-disease-continuous.*"],
    description="Heart disease dataset",
)
fvec.save()

# Offline features for training
df = fstore.get_offline_features("iguazio-academy/heart-disease-vector").to_dataframe()

# Materialize offline features to parquet
fstore.get_offline_features("iguazio-academy/heart-disease-vector", target=ParquetTarget())

# Online features for serving
feature_service = fstore.get_online_feature_service(feature_vector="iguazio-academy/heart-disease-vector")
feature_service.get(
    [
        {"patient_id" : "e443544b-8d9e-4f6c-9623-e24b6139aae0"},
        {"patient_id" : "8227d3df-16ab-4452-8ea5-99472362d982"}
    ]
)

Real-time pipelines#

Docs: Real-time serving pipelines, Real-time pipeline use cases, Graph concepts and state machine, Model serving graph, Writing custom steps

Definitions#

Graphs are composed of the following:

  • Step: A step runs a function or class handler or a REST API call

  • Router: A special type of step with routing logic and multiple child routes/models

  • Queue: A queue or stream that accepts data from one or more source steps and publishes to one or more output steps

Graphs have two modes (topologies):

  • Router topology (default): A minimal configuration with a single router and child tasks/routes

  • Flow topology: A full graph/DAG

Simple graph#

Docs: Real-time serving pipelines getting started

Define Python file(s) to orchestrate

# graph.py
def inc(x):
    return x + 1

def mul(x):
    return x * 2

class WithState:
    def __init__(self, name, context, init_val=0):
        self.name = name
        self.context = context
        self.counter = init_val

    def do(self, x):
        self.counter += 1
        print(f"Echo: {self.name}, x: {x}, counter: {self.counter}")
        return x + self.counter

Define MLRun function and graph

import mlrun
fn = project.set_function(
    name="simple-graph", func="graph.py",
    kind="serving", image="mlrun/mlrun"
)
graph = fn.set_topology("flow")

# inc, mul, and WithState are all defined in graph.py
graph.to(name="+1", handler='inc')\
     .to(name="*2", handler='mul')\
     .to(name="(X+counter)", class_name='WithState').respond()

# Local testing
server = fn.to_mock_server()
server.test(body=5)

# K8s deployment
project.deploy_function(fn)

Simple model serving router#

Docs: Example of a simple model serving router

# load the sklearn model serving function and add models to it  
fn = mlrun.import_function('hub://v2_model_server')
fn.add_model("model1", model_path="s3://...")
fn.add_model("model2", model_path="store://...")

# deploy the function to the cluster
project.deploy_function(fn)

# test the live model endpoint
fn.invoke('/v2/models/model1/infer', body={"inputs": [5]})

Custom model serving class#

Docs: Model serving graph

from cloudpickle import load
from typing import List
import numpy as np

import mlrun

class ClassifierModel(mlrun.serving.V2ModelServer):
    def load(self):
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model(".pkl")
        self.model = load(open(model_file, "rb"))

    def predict(self, body: dict) -> List:
        """Generate model predictions from sample."""
        feats = np.asarray(body["inputs"])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()
Advanced data processing and serving ensemble#

Docs: Advanced model serving graph - notebook example

fn = project.set_function(
    name="advanced", func="demo.py", 
    kind="serving", image="mlrun/mlrun"
)
graph = function.set_topology("flow", engine="async")

# use built-in storey class or our custom Echo class to create and link Task steps. Add an error 
# handling step that runs only if the "Echo" step fails
graph.to("storey.Extend", name="enrich", _fn='({"tag": "something"})') \
     .to(class_name="Echo", name="pre-process", some_arg='abc').error_handler(name='catcher', handler='handle_error', full_event=True)

# add an Ensemble router with two child models (routes), the "*" prefix marks it as router class
router = graph.add_step("*mlrun.serving.VotingEnsemble", name="ensemble", after="pre-process")
router.add_route("m1", class_name="ClassifierModel", model_path=path1)
router.add_route("m2", class_name="ClassifierModel", model_path=path2)

# add the final step (after the router), which handles post-processing and response to the client
graph.add_step(class_name="Echo", name="final", after="ensemble").respond()

Hyperparameter tuning#

Docs: Hyperparameter tuning optimization

The following hyperparameter examples use this function:

# hp.py
def hyper_func(context, p1, p2):
    print(f"p1={p1}, p2={p2}, result={p1 * p2}")
    context.log_result("multiplier", p1 * p2)

# MLRun function in project
fn = project.set_function(
    name="hp",
    func="hp.py",
    image="mlrun/mlrun",
    kind="job",
    handler="hyper_func"
)

Note

The selector can be named any value that is logged - in this case multiplier

Grid search (default)#

Docs: Grid Search

Runs all parameter combinations

hp_tuning_run = project.run_function(
    "hp", 
    hyperparams={"p1": [2,4,1], "p2": [10,20]}, 
    selector="max.multiplier"
)
Parallel executors#

Docs: Parallel execution over containers

Dask#

Docs: Running the workers using Dask

# Create Dask cluster
dask_cluster = mlrun.new_function("dask-cluster", kind="dask", image="mlrun/ml-base")
dask_cluster.apply(mlrun.mount_v3io())  # add volume mounts
dask_cluster.spec.service_type = "NodePort"  # open interface to the dask UI dashboard
dask_cluster.spec.replicas = 2  # define two containers
uri = dask_cluster.save()

# Run parallel hyperparameter trials
hp_tuning_run_dask = project.run_function(
    "hp",
    hyperparams={"p1": [2, 4, 1], "p2": [10, 20, 30]},
    selector="max.multiplier",
    hyper_param_options=mlrun.model.HyperParamOptions(
        strategy="grid",
        parallel_runs=4,
        dask_cluster_uri=uri,
        teardown_dask=True,
    ),
)
Nuclio#

Docs: Running the workers using Nuclio

# Create nuclio:mlrun function
fn = project.set_function(
    name='hyper-tst2',
    func="hp.py",
    kind='nuclio:mlrun',
    image='mlrun/mlrun'
)
# (replicas * workers) must be equal to or greater than parallel_runs
fn.spec.replicas = 2
fn.with_http(workers=2)
fn.deploy()

# Run the parallel tasks over the function
hp_tuning_run_dask = project.run_function(
    "hyper-tst2",
    hyperparams={"p1": [2, 4, 1], "p2": [10, 20, 30]},
    selector="max.multiplier",
    hyper_param_options=mlrun.model.HyperParamOptions(
        strategy="grid",
        parallel_runs=4,
        max_errors=3
    ),
    handler="hyper_func"
)

Targeted tutorials#

Each of the following tutorials is a dedicated Jupyter notebook. You can download them by clicking the download icon at the top of each page.

Train, compare, and register Models

Demo of training ML models, hyper-parameters, track and compare experiments, register and use the models.

Serving pre-trained ML/DL models

How to deploy real-time serving pipelines with MLRun Serving and different types of pre-trained ML/DL models.

Projects & automated ML pipeline

How to work with projects, source control (git), CI/CD, to easily build and deploy multi-stage ML pipelines.

Real-time monitoring & drift detection

Demonstrate MLRun Serving pipelines, MLRun model monitoring, and automated drift detection.

Add MLOps to existing code

Turn a Kaggle research notebook to a production ML micro-service with minimal code changes using MLRun.

Basic feature store example (stocks)

Understand MLRun feature store with a simple example: build, transform, and serve features in batch and in real-time.

Batch inference and drift detection

Use MLRun batch inference function (from MLRun Function Hub), run it as a batch job, and generate drift reports.

Advanced real-time pipeline

Demonstrates a multi-step online pipeline with data prep, ensemble, model serving, and post processing.

Feature store end-to-end demo

Use the feature store with data ingestion, model training, model serving, and automated pipeline.

End to end demos#

You can find the different end-to-end demos in the MLRun demos repository: github.com/mlrun/demos.

Cheat sheet#

If you already know the basics, use the cheat sheet as a guide to typical use cases and their flows/SDK.

Running the demos in Open Source MLRun#

By default, these demos work with the online feature store, which is currently not part of the Open Source MLRun default deployment:

  • fraud-prevention-feature-store

  • network-operations

  • azureml_demo

Installation and setup guide #

This guide outlines the steps for installing and running MLRun.

MLRun has two main components, the service and the client (SDK and UI):

  • MLRun service runs over Kubernetes (can also be deployed using local Docker for demo and test purposes). It can orchestrate and integrate with other open source open source frameworks, as shown in the following diagram.

  • MLRun client SDK is installed in your development environment and interacts with the service using REST API calls.

This release of MLRun supports only Python 3.9 for both the server and the client.

mlrun-flow


In this section

Deployment options#

There are several deployment options:

  • Local deployment: Deploy a Docker on your laptop or on a single server. This option is good for testing the waters or when working in a small scale environment. It's limited in terms of computing resources and scale, but simpler for deployment.

  • Kubernetes cluster: Deploy an MLRun server on Kubernetes. This option deploys MLRun on a Kubernetes cluster, which supports elastic scaling. Yet, it is more complex to install as it requires you to install Kubernetes on your own.

  • Amazon Web Services (AWS): Deploy an MLRun server on AWS. This option is the easiest way to install MLRun cluster and use cloud-based services. The MLRun software is free of charge, however, there is a cost for the AWS infrastructure services.

  • Iguazio's Managed Service: A commercial offering by Iguazio. This is the fastest way to explore the full set of MLRun functionalities.
    Note that Iguazio provides a 14 day free trial.

Set up your client#

You can work with your favorite IDE (e.g. Pycharm, VScode, Jupyter, Colab, etc.). Read how to configure your client against the deployed MLRun server in Set up your environment .

Once you have installed and configured MLRun, follow the Quick Start tutorial and additional Tutorials and Examples to learn how to use MLRun to develop and deploy machine learning applications to production.

MLRun client backward compatibility#

Starting from MLRun v1.3.0, the MLRun server is compatible with the client and images of the previous two minor MLRun releases. When you upgrade to v1.3.0, for example, you can continue to use your v1.1- and v1.2-based images, but v1.0-based images are not compatible.

Important

  • Images from 0.9.0 are not compatible with 0.10.0. Backward compatibility starts from 0.10.0.

  • When you upgrade the MLRun major version, for example 0.10.x to 1.0.x, there is no backward compatibility.

  • The feature store is not backward compatible.

  • When you upgrade the platform, for example from 3.2 to 3.3, the clients should be upgraded. There is no guaranteed compatibility with an older MLRun client after a platform upgrade.

See also Images and their usage in MLRun.

Security#

Non-root user support#

By default, MLRun assigns the root user to MLRun runtimes and pods. You can improve the security context by changing the security mode, which is implemented by Iguazio during installation, and applied system-wide:

  • Override: Use the user id of the user that triggered the current run or use the nogroupid for group id. Requires Iguazio v3.5.1.

  • Disabled: Security context is not auto applied (the system applies the root user). (default)

Security context#

If your system is configured in disabled mode, you can apply the security context to individual runtimes/pods by using function.with_security_context, and the job is assigned to the user or to the user's group that ran the job.
(You cannot override the user of individual jobs if the system is configured in override mode.) The options are:

from kubernetes import client as k8s_client

security_context = k8s_client.V1SecurityContext(
            run_as_user=1000,
            run_as_group=3000,
        )
function.with_security_context(security_context)

See the full definition of the V1SecurityContext object.

Some services do not support security context yet:

  • Infrastructure services

    • Kubeflow pipelines core services

  • Services created by MLRun

    • Kaniko, used for building images. (To avoid using Kaniko, use prebuilt images that contain all the requirements.)

    • Spark services

Install MLRun locally using Docker#

You can install and use MLRun and Nuclio locally on your computer. This does not include all the services and elastic scaling capabilities, which you can get with the Kubernetes based deployment, but it is much simpler to start with.

Note

Using Docker is limited to local, Nuclio, serving runtimes, and local pipelines.

Prerequisites#
  • Memory: 8GB

  • Storage: 7GB

Overview#

Use docker compose to install MLRun. It deploys the MLRun service, MLRun UI, Nuclio serverless engine, and optionally the Jupyter server. The MLRun service, MLRun UI, Nuclio, and Jupyter, do not have default resources. This means that they are set with the default cluster/namespace resources limits. These can be modified.

There are two installation options:

In both cases you need to set the SHARED_DIR environment variable to point to a host path for storing MLRun artifacts and DB, for example export SHARED_DIR=~/mlrun-data (or use set SHARED_DIR=c:\mlrun-data in windows). Make sure the directory exists.

You also need to set the HOST_IP variable with your computer IP address (required for Nuclio dashboard). You can select a specific MLRun version with the TAG variable and Nuclio version with the NUCLIO_TAG variable.

Note

Support for running as a non-root user was added in 1.0.5, hence the underlying exposed port was changed. If you want to use previous mlrun versions, modify the mlrun-ui port from 8090 back to 80.

If you are running more than one instance of MLRun, change the exposed port.

Watch the installation:

Use MLRun with your own client#

The following commands install MLRun and Nuclio for work with your own IDE or notebook.

[Download here] the compose.yaml file, save it to the working dir and type:

show the compose.yaml file
services:
  init_nuclio:
    image: alpine:3.18
    command:
      - "/bin/sh"
      - "-c"
      - |
        mkdir -p /etc/nuclio/config/platform; \
        cat << EOF | tee /etc/nuclio/config/platform/platform.yaml
        runtime:
          common:
            env:
              MLRUN_DBPATH: http://${HOST_IP:?err}:8080
        local:
          defaultFunctionContainerNetworkName: mlrun
          defaultFunctionRestartPolicy:
            name: always
            maxRetryCount: 0
          defaultFunctionVolumes:
            - volume:
                name: mlrun-stuff
                hostPath:
                  path: ${SHARED_DIR:?err}
              volumeMount:
                name: mlrun-stuff
                mountPath: /home/jovyan/data/
        logger:
          sinks:
            myStdoutLoggerSink:
              kind: stdout
          system:
            - level: debug
              sink: myStdoutLoggerSink
          functions:
            - level: debug
              sink: myStdoutLoggerSink
        EOF
    volumes:
      - nuclio-platform-config:/etc/nuclio/config

  mlrun-api:
    image: "mlrun/mlrun-api:${TAG:-1.5.1}"
    ports:
      - "8080:8080"
    environment:
      MLRUN_ARTIFACT_PATH: "${SHARED_DIR}/{{project}}"
      # using local storage, meaning files / artifacts are stored locally, so we want to allow access to them
      MLRUN_HTTPDB__REAL_PATH: /data
      MLRUN_HTTPDB__DATA_VOLUME: "${SHARED_DIR}"
      MLRUN_LOG_LEVEL: DEBUG
      MLRUN_NUCLIO_DASHBOARD_URL: http://nuclio:8070
      MLRUN_HTTPDB__DSN: "sqlite:////data/mlrun.db?check_same_thread=false"
      MLRUN_UI__URL: http://localhost:8060
      # not running on k8s meaning no need to store secrets
      MLRUN_SECRET_STORES__KUBERNETES__AUTO_ADD_PROJECT_SECRETS: "false"
      # let mlrun control nuclio resources
      MLRUN_HTTPDB__PROJECTS__FOLLOWERS: "nuclio"
    volumes:
      - "${SHARED_DIR:?err}:/data"
    networks:
      - mlrun

  mlrun-ui:
    image: "mlrun/mlrun-ui:${TAG:-1.5.1}"
    ports:
      - "8060:8090"
    environment:
      MLRUN_API_PROXY_URL: http://mlrun-api:8080
      MLRUN_NUCLIO_MODE: enable
      MLRUN_NUCLIO_API_URL: http://nuclio:8070
      MLRUN_NUCLIO_UI_URL: http://localhost:8070
    networks:
      - mlrun

  nuclio:
    image: "quay.io/nuclio/dashboard:${NUCLIO_TAG:-stable-amd64}"
    ports:
      - "8070:8070"
    environment:
      NUCLIO_DASHBOARD_EXTERNAL_IP_ADDRESSES: "${HOST_IP:?err}"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - nuclio-platform-config:/etc/nuclio/config
    depends_on:
      - init_nuclio
    networks:
      - mlrun

volumes:
  nuclio-platform-config: {}

networks:
  mlrun:
    name: mlrun
export HOST_IP=<your host IP address>
export SHARED_DIR=~/mlrun-data
mkdir $SHARED_DIR -p
docker-compose -f compose.yaml up -d

Your HOST_IP address can be found using the ip addr or ifconfig commands (do not use localhost or 127.0.0.1). It is recommended to select an address that does not change dynamically (for example the IP of the bridge interface).

set HOST_IP=<your host IP address>
set SHARED_DIR=c:\mlrun-data
mkdir %SHARED_DIR%
docker-compose -f compose.yaml up -d

Your HOST_IP address can be found using the ipconfig shell command (do not use localhost or 127.0.0.1). It is recommended to select an address that does not change dynamically (for example the IP of the vEthernet interface).

$Env:HOST_IP=<your host IP address>
$Env:SHARED_DIR="~/mlrun-data"
mkdir $Env:SHARED_DIR
docker-compose -f compose.yaml up -d

Your HOST_IP address can be found using the Get-NetIPConfiguration cmdlet (do not use localhost or 127.0.0.1). It is recommended to select an address that does not change dynamically (for example the IP of the vEthernet interface).

This creates 3 services:

After installing MLRun service, set your client environment to work with the service, by setting the MLRun path env variable to MLRUN_DBPATH=http://localhost:8080 or using .env files (see setting client environment).

Use MLRun with MLRun Jupyter image#

For the quickest experience with MLRun you can deploy MLRun with a pre-integrated Jupyter server loaded with various ready-to-use MLRun examples.

[Download here] the compose.with-jupyter.yaml file, save it to the working dir and type:

services:
  init_nuclio:
    image: alpine:3.18
    command:
      - "/bin/sh"
      - "-c"
      - |
        mkdir -p /etc/nuclio/config/platform; \
        cat << EOF | tee /etc/nuclio/config/platform/platform.yaml
        runtime:
          common:
            env:
              MLRUN_DBPATH: http://${HOST_IP:?err}:8080
        local:
          defaultFunctionContainerNetworkName: mlrun
          defaultFunctionRestartPolicy:
            name: always
            maxRetryCount: 0
          defaultFunctionVolumes:
            - volume:
                name: mlrun-stuff
                hostPath:
                  path: ${SHARED_DIR:?err}
              volumeMount:
                name: mlrun-stuff
                mountPath: /home/jovyan/data/
        logger:
          sinks:
            myStdoutLoggerSink:
              kind: stdout
          system:
            - level: debug
              sink: myStdoutLoggerSink
          functions:
            - level: debug
              sink: myStdoutLoggerSink
        EOF
    volumes:
      - nuclio-platform-config:/etc/nuclio/config

  jupyter:
    image: "mlrun/jupyter:${TAG:-1.5.1}"
    ports:
      - "8080:8080"
      - "8888:8888"
    environment:
      MLRUN_ARTIFACT_PATH: "/home/jovyan/data/{{project}}"
      MLRUN_LOG_LEVEL: DEBUG
      MLRUN_NUCLIO_DASHBOARD_URL: http://nuclio:8070
      MLRUN_HTTPDB__DSN: "sqlite:////home/jovyan/data/mlrun.db?check_same_thread=false"
      MLRUN_UI__URL: http://localhost:8060
      # using local storage, meaning files / artifacts are stored locally, so we want to allow access to them
      MLRUN_HTTPDB__REAL_PATH: "/home/jovyan/data"
      # not running on k8s meaning no need to store secrets
      MLRUN_SECRET_STORES__KUBERNETES__AUTO_ADD_PROJECT_SECRETS: "false"
      # let mlrun control nuclio resources
      MLRUN_HTTPDB__PROJECTS__FOLLOWERS: "nuclio"
    volumes:
      - "${SHARED_DIR:?err}:/home/jovyan/data"
    networks:
      - mlrun

  mlrun-ui:
    image: "mlrun/mlrun-ui:${TAG:-1.5.1}"
    ports:
      - "8060:8090"
    environment:
      MLRUN_API_PROXY_URL: http://jupyter:8080
      MLRUN_NUCLIO_MODE: enable
      MLRUN_NUCLIO_API_URL: http://nuclio:8070
      MLRUN_NUCLIO_UI_URL: http://localhost:8070
    networks:
      - mlrun

  nuclio:
    image: "quay.io/nuclio/dashboard:${NUCLIO_TAG:-stable-amd64}"
    ports:
      - "8070:8070"
    environment:
      NUCLIO_DASHBOARD_EXTERNAL_IP_ADDRESSES: "${HOST_IP:?err}"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - nuclio-platform-config:/etc/nuclio/config
    depends_on:
      - init_nuclio
    networks:
      - mlrun

volumes:
  nuclio-platform-config: {}

networks:
  mlrun:
    name: mlrun
export HOST_IP=<your host IP address>
export SHARED_DIR=~/mlrun-data
mkdir -p $SHARED_DIR
docker-compose -f compose.with-jupyter.yaml up -d

Your HOST_IP address can be found using the ip addr or ifconfig commands (do not use localhost or 127.0.0.1). It is recommended to select an address that does not change dynamically (for example the IP of the bridge interface).

set HOST_IP=<your host IP address>
set SHARED_DIR=c:\mlrun-data
mkdir %SHARED_DIR%
docker-compose -f compose.with-jupyter.yaml up -d

Your HOST_IP address can be found using the ipconfig shell command (do not use localhost or 127.0.0.1). It is recommended to select an address that does not change dynamically (for example the IP of the vEthernet interface).

$Env:HOST_IP=<your host IP address>
$Env:SHARED_DIR="~/mlrun-data"
mkdir $Env:SHARED_DIR
docker-compose -f compose.with-jupyter.yaml up -d

Your HOST_IP address can be found using the Get-NetIPConfiguration cmdlet (do not use localhost or 127.0.0.1). It is recommended to select an address that does not change dynamically (for example the IP of the vEthernet interface).

This creates 4 services:

After the installation, access the Jupyter server (in http://localhost:8888) and run through the quick-start tutorial and demos. You can see the projects, tasks, and artifacts in MLRun UI (in http://localhost:8060).

The Jupyter environment is pre-configured to work with the local MLRun and Nuclio services. You can switch to a remote or managed MLRun cluster by editing the mlrun.env file in the Jupyter files tree.

The artifacts and DB are stored under /home/jovyan/data (/data in Jupyter tree).

Install MLRun on Kubernetes#

Note

These instructions install the community edition, which currently includes MLRun v1.4.0. See the release documentation.

In this section

Prerequisites#
  • Access to a Kubernetes cluster. To install MLRun on your cluster, you must have administrator permissions. MLRun fully supports k8s releases 1.22, 1.23, and 1.26. For local installation on Windows or Mac, Docker Desktop is recommended.

  • The Kubernetes command-line tool (kubectl) compatible with your Kubernetes cluster is installed. Refer to the kubectl installation instructions for more information.

  • Helm 3.6 CLI is installed. Refer to the Helm installation instructions for more information.

  • An accessible docker-registry (such as Docker Hub). The registry's URL and credentials are consumed by the applications via a pre-created secret.

  • Storage:

    • 8Gi

    • Set a default storage class for the kubernetes cluster, in order for the pods to have persistent storage. See the Kubernetes documentation for more information.

  • RAM: A minimum of 8Gi is required for running all the initial MLRun components. The amount of RAM required for running MLRun jobs depends on the job's requirements.

Note

The MLRun Community Edition resources are configured initially with the default cluster/namespace resource limits. You can modify the resources from outside if needed.

Community Edition flavors#

The MLRun CE (Community Edition) includes the following components:

Installing the chart#

Note

These instructions use mlrun as the namespace (-n parameter). You can choose a different namespace in your kubernetes cluster.

Create a namespace for the deployed components:

kubectl create namespace mlrun

Add the Community Edition helm chart repo:

helm repo add mlrun-ce https://mlrun.github.io/ce

Run the following command to ensure that the repo is installed and available:

helm repo list

It should output something like:

NAME        URL
mlrun-ce    https://github.com/mlrun/ce

Update the repo to make sure you're getting the latest chart:

helm repo update

Create a secret with your docker-registry named registry-credentials:

kubectl --namespace mlrun create secret docker-registry registry-credentials \
    --docker-server <your-registry-server> \
    --docker-username <your-username> \
    --docker-password <your-password> \
    --docker-email <your-email>

Note: If using docker hub, the registry server is https://registry.hub.docker.com/. Refer to the Docker ID documentation for creating a user with login to configure in the secret.

Where:

  • <your-registry-server> is your Private Docker Registry FQDN. (https://index.docker.io/v1/ for Docker Hub).

  • <your-username> is your Docker username.

  • <your-password> is your Docker password.

  • <your-email> is your Docker email.

Note

First-time MLRun users experience a relatively longer installation time because all required images are pulled locally for the first time (it takes an average of 10-15 minutes, mostly depending on your internet speed).

To install the chart with the release name mlrun-ce use the following command. Note the reference to the pre-created registry-credentials secret in global.registry.secretName:

helm --namespace mlrun \
    install mlrun-ce \
    --wait \
    --timeout 960s \
    --set global.registry.url=<registry-url> \
    --set global.registry.secretName=registry-credentials \
    --set global.externalHostAddress=<host-machine-address> \
    mlrun-ce/mlrun-ce

Where:

  • <registry-url> is the registry URL that can be authenticated by the registry-credentials secret (e.g., index.docker.io/<your-username> for Docker Hub).

  • <host-machine-address> is the IP address of the host machine (or $(minikube ip) if using minikube).

When the installation is complete, the helm command prints the URLs and ports of all the MLRun CE services.

Note: There is currently a known issue with installing the chart on Macs using Apple silicon (M1/M2). The current pipelines mysql database fails to start. The workaround for now is to opt out of pipelines by installing the chart with the --set pipelines.enabled=false.

Configuring the online feature store#

The MLRun Community Edition now supports the online feature store. To enable it, you need to first deploy a Redis service that is accessible to your MLRun CE cluster. To deploy a Redis service, refer to the Redis documentation.

When you have a Redis service deployed, you can configure MLRun CE to use it by adding the following helm value configuration to your helm install command:

--set mlrun.api.extraEnvKeyValue.MLRUN_REDIS__URL=<redis-address>
Usage#

Your applications are now available in your local browser:

  • Jupyter Notebook - http://<host-machine-address>:30040

  • Nuclio - http://<host-machine-address>:30050

  • MLRun UI - http://<host-machine-address>:30060

  • MLRun API (external) - http://<host-machine-address>:30070

  • MinIO API - http://<host-machine-address>:30080

  • MinIO UI - http://<host-machine-address>:30090

  • Pipeline UI - http://<host-machine-address>:30100

  • Grafana UI - http://<host-machine-address>:30110

Check state

You can check the current state of the installation via the command kubectl -n mlrun get pods, where the main information is in columns Ready and State. If all images have already been pulled locally, typically it takes a minute for all services to start.

Note

You can change the ports by providing values to the helm install command. You can add and configure a Kubernetes ingress-controller for better security and control over external access.

Start working#

Open the Jupyter notebook on jupyter-notebook UI and run the code in the examples/mlrun_basics.ipynb notebook.

Important

Make sure to save your changes in the data folder within the Jupyter Lab. The root folder and any other folders do not retain the changes when you restart the Jupyter Lab.

Configuring the remote environment#

You can use your code on a local machine while running your functions on a remote cluster. Refer to Set up your environment for more information.

Advanced chart configuration#

Configurable values are documented in the values.yaml, and the values.yaml of all sub charts. Override those in the normal methods.

Opt out of components#

The chart installs many components. You may not need them all in your deployment depending on your use cases. To opt out of some of the components, use the following helm values:

...
--set pipelines.enabled=false \
--set kube-prometheus-stack.enabled=false \
--set sparkOperator.enabled=false \
...
Installing on Docker Desktop#

If you are using Docker Desktop, you can install MLRun CE on your local machine. Docker Desktop is available for Mac and Windows. For download information, system requirements, and installation instructions, see:

Configuring Docker Desktop#

Docker Desktop includes a standalone Kubernetes server and client, as well as Docker CLI integration that runs on your machine. The Kubernetes server runs locally within your Docker instance. To enable Kubernetes support and install a standalone instance of Kubernetes running as a Docker container, go to Preferences > Kubernetes and then press Enable Kubernetes. Press Apply & Restart to save the settings and then press Install to confirm. This instantiates the images that are required to run the Kubernetes server as containers, and installs the /usr/local/bin/kubectl command on your machine. For more information, see the Kubernetes documentation.

It's recommended to limit the amount of memory allocated to Kubernetes. If you're using Windows and WSL 2, you can configure global WSL options by placing a .wslconfig file into the root directory of your users folder: C:\Users\<yourUserName>\.wslconfig. Keep in mind that you might need to run wsl --shutdown to shut down the WSL 2 VM and then restart your WSL instance for these changes to take effect.

[wsl2]
memory=8GB # Limits VM memory in WSL 2 to 8 GB

To learn about the various UI options and their usage, see:

Storage resources#

When installing the MLRun Community Edition, several storage resources are created:

  • PVs via default configured storage class: Holds the file system of the stacks pods, including the MySQL database of MLRun, Minio for artifacts and Pipelines Storage and more. These are not deleted when the stack is uninstalled, which allows upgrading without losing data.

  • Container Images in the configured docker-registry: When building and deploying MLRun and Nuclio functions via the MLRun Community Edition, the function images are stored in the given configured docker registry. These images persist in the docker registry and are not deleted.

Uninstalling the chart#

The following command deletes the pods, deployments, config maps, services and roles+role bindings associated with the chart and release.

helm --namespace mlrun uninstall mlrun-ce
Notes on dangling resources#
  • The created CRDs are not deleted by default and should be manually cleaned up.

  • The created PVs and PVCs are not deleted by default and should be manually cleaned up.

  • As stated above, the images in the docker registry are not deleted either and should be cleaned up manually.

  • If you installed the chart in its own namespace, it's also possible to delete the entire namespace to clean up all resources (apart from the docker registry images).

Note on terminating pods and hanging resources#

This chart generates several persistent volume claims that provide persistency (via PVC) out of the box. Upon uninstallation, any hanging / terminating pods hold the PVCs and PVs respectively, as those prevent their safe removal. Since pods that are stuck in terminating state seem to be a never-ending plague in Kubernetes, note this, and remember to clean the remaining PVs and PVCs.

Handing stuck-at-terminating pods:#
kubectl --namespace mlrun delete pod --force --grace-period=0 <pod-name>
Reclaim dangling persistency resources:#

WARNING

This will result in data loss!

# To list PVCs
$ kubectl --namespace mlrun get pvc
...

# To remove a PVC
$ kubectl --namespace mlrun delete pvc <pvc-name>
...

# To list PVs
$ kubectl --namespace mlrun get pv
...

# To remove a PVC
$ kubectl --namespace mlrun delete pvc <pv-name>
...
Upgrading the chart#

To upgrade to the latest version of the chart, first make sure you have the latest helm repo

helm repo update

Then try to upgrade the chart:

helm upgrade --install --reuse-values mlrun-ce —namespace mlrun mlrun-ce/mlrun-ce

If it fails, you should reinstall the chart:

  1. remove current mlrun-ce

mkdir ~/tmp
helm get values -n mlrun mlrun-ce > ~/tmp/mlrun-ce-values.yaml
helm uninstall mlrun-ce
  1. reinstall mlrun-ce, reuse values

helm install -n mlrun --values ~/tmp/mlrun-ce-values.yaml mlrun-ce mlrun-ce/mlrun-ce --devel

Note

If your values have fixed mlrun service versions (e.g.: mlrun:1.3.0) then you might want to remove it from the values file to allow newer chart defaults to kick in

Storing artifacts in AWS S3 storage#

MLRun CE uses a Minio service as shared storage for artifacts, and accesses it using S3 protocol. This means that any path that begins with s3:// is automatically directed by MLRun to the Minio service. The default artifact path is also configured as s3://mlrun/projects/{{run.project}}/artifacts which is a path on the mlrun bucket in the Minio service.

To store artifacts in AWS S3 buckets instead of the local Minio service, these configurations need to be overridden to make s3:// paths lead to AWS buckets instead.

Note

These configurations are only required for AWS S3 storage, due to the usage of the same S3 protocol in Minio. For other storage options (such as GCS, Azure blobs etc.) only the artifact path needs to be modified, and credentials need to be provided.

Setting up S3 credentials and endpoint#

Set up the following project-secrets (refer to Data stores and Project secrets) for any project used:

  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY — S3 credentials

  • S3_ENDPOINT_URL — the AWS S3 endpoint to use, depending on the region. For example:

    S3_ENDPOINT_URL = https://s3.us-east-2.amazonaws.com/
    
Disabling auto-mount#

Before running any MLRun job that writes to S3 bucket, make sure auto-mount is disabled for it, since by default auto-mount adds S3 configurations that point at the Minio service (refer to Function storage for more details on auto-mount). This can be done in one of following ways:

  • Set the client-side MLRun configuration to disable auto-mount. This disables auto-mount for any function run after this command:

    from mlrun.config import config as mlconf
    
    mlconf.storage.auto_mount_type = "none"
    
  • If running MLRun from an IDE, the configuration can be overridden using an environment variable. Set the following environment variable for your IDE environment:

    MLRUN_STORAGE__AUTO_MOUNT_TYPE = "none"
    
  • Disable auto-mount for a specific function. This must be done before running the function for the first time:

    function.spec.disable_auto_mount = True
    
Changing the artifact path#

The artifact path needs to be modified since the bucket name is set to mlrun by default. It is recommended to keep the same path structure as the default, while modifying the bucket name. For example:

s3://<bucket name>/projects/{{run.project}}/artifacts

The artifact path can be set in several ways, refer to Artifact path for more details.

Note

If your values have fixed mlrun service versions (e.g.: mlrun:1.5.0) then you might want to remove it from the values file to allow newer chart defaults to kick in.

Install MLRun on AWS#

For AWS users, the easiest way to install MLRun is to use a native AWS deployment. This option deploys MLRun on an AWS EKS service using a CloudFormation stack.

Note

These instructions install the community edition, which currently includes MLRun v1.4.0. See the release documentation.

In this section

Prerequisites#
  1. An AWS account with permissions that include the ability to:

    • Run a CloudFormation stack

    • Create an EKS cluster

    • Create EC2 instances

    • Create VPC

    • Create S3 buckets

    • Deploy and pull images from ECR

    For the full set of required permissions, download the IAM policy or expand & copy the IAM policy below:

    show the IAM policy
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "BasicServices",
                "Effect": "Allow",
                "Action": [
                    "autoscaling:*",
                    "cloudwatch:*",
                    "elasticloadbalancing:*",
                    "sns:*",
                    "ec2:*",
                    "s3:*",
                    "s3-object-lambda:*",
                    "eks:*",
                    "elasticfilesystem:*",
                    "cloudformation:*",
                    "acm:*",
                    "route53:*"
                ],
                "Resource": "*"
            },
            {
                "Sid": "ServiceLinkedRoles",
                "Effect": "Allow",
                "Action": "iam:CreateServiceLinkedRole",
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "iam:AWSServiceName": [
                            "autoscaling.amazonaws.com",
                            "ec2scheduled.amazonaws.com",
                            "elasticloadbalancing.amazonaws.com",
                            "spot.amazonaws.com",
                            "spotfleet.amazonaws.com",
                            "transitgateway.amazonaws.com"
                        ]
                    }
                }
            },
            {
                "Sid": "IAMPermissions",
                "Effect": "Allow",
                "Action": [
                    "iam:AddRoleToInstanceProfile",
                    "iam:AttachRolePolicy",
                    "iam:TagOpenIDConnectProvider",
                    "iam:CreateInstanceProfile",
                    "iam:CreateOpenIDConnectProvider",
                    "iam:CreateRole",
                    "iam:CreateServiceLinkedRole",
                    "iam:DeleteInstanceProfile",
                    "iam:DeleteOpenIDConnectProvider",
                    "iam:DeleteRole",
                    "iam:DeleteRolePolicy",
                    "iam:DetachRolePolicy",
                    "iam:GenerateServiceLastAccessedDetails",
                    "iam:GetAccessKeyLastUsed",
                    "iam:GetAccountPasswordPolicy",
                    "iam:GetAccountSummary",
                    "iam:GetGroup",
                    "iam:GetInstanceProfile",
                    "iam:GetLoginProfile",
                    "iam:GetOpenIDConnectProvider",
                    "iam:GetPolicy",
                    "iam:GetPolicyVersion",
                    "iam:GetRole",
                    "iam:GetRolePolicy",
                    "iam:GetServiceLastAccessedDetails",
                    "iam:GetUser",
                    "iam:ListAccessKeys",
                    "iam:ListAccountAliases",
                    "iam:ListAttachedGroupPolicies",
                    "iam:ListAttachedRolePolicies",
                    "iam:ListAttachedUserPolicies",
                    "iam:ListGroupPolicies",
                    "iam:ListGroups",
                    "iam:ListGroupsForUser",
                    "iam:ListInstanceProfilesForRole",
                    "iam:ListMFADevices",
                    "iam:ListOpenIDConnectProviders",
                    "iam:ListPolicies",
                    "iam:ListPoliciesGrantingServiceAccess",
                    "iam:ListRolePolicies",
                    "iam:ListRoles",
                    "iam:ListRoleTags",
                    "iam:ListSAMLProviders",
                    "iam:ListSigningCertificates",
                    "iam:ListUserPolicies",
                    "iam:ListUsers",
                    "iam:ListUserTags",
                    "iam:PassRole",
                    "iam:PutRolePolicy",
                    "iam:RemoveRoleFromInstanceProfile",
                    "kms:CreateGrant",
                    "kms:CreateKey",
                    "kms:Decrypt",
                    "kms:DescribeKey",
                    "kms:Encrypt",
                    "kms:GenerateDataKeyWithoutPlaintext",
                    "kms:GetKeyPolicy",
                    "kms:GetKeyRotationStatus",
                    "kms:ListResourceTags",
                    "kms:PutKeyPolicy",
                    "kms:ScheduleKeyDeletion",
                    "kms:TagResource"
                ],
                "Resource": "*"
            },
            {
                "Sid": "AllowLanbda",
                "Effect": "Allow",
                "Action": [
                    "lambda:CreateAlias",
                    "lambda:CreateCodeSigningConfig",
                    "lambda:CreateEventSourceMapping",
                    "lambda:CreateFunction",
                    "lambda:CreateFunctionUrlConfig",
                    "lambda:Delete*",
                    "lambda:Get*",
                    "lambda:InvokeAsync",
                    "lambda:InvokeFunction",
                    "lambda:InvokeFunctionUrl",
                    "lambda:List*",
                    "lambda:PublishLayerVersion",
                    "lambda:PublishVersion",
                    "lambda:PutFunctionCodeSigningConfig",
                    "lambda:PutFunctionConcurrency",
                    "lambda:PutFunctionEventInvokeConfig",
                    "lambda:PutProvisionedConcurrencyConfig",
                    "lambda:TagResource",
                    "lambda:UntagResource",
                    "lambda:UpdateAlias",
                    "lambda:UpdateCodeSigningConfig",
                    "lambda:UpdateEventSourceMapping",
                    "lambda:UpdateFunctionCode",
                    "lambda:UpdateFunctionCodeSigningConfig",
                    "lambda:UpdateFunctionConfiguration",
                    "lambda:UpdateFunctionEventInvokeConfig",
                    "lambda:UpdateFunctionUrlConfig"
                ],
                "Resource": "*"
            },
            {
                "Sid": "CertificateService",
                "Effect": "Allow",
                "Action": "iam:CreateServiceLinkedRole",
                "Resource": "arn:aws:iam::*:role/aws-service-role/acm.amazonaws.com/AWSServiceRoleForCertificateManager*",
                "Condition": {
                    "StringEquals": {
                        "iam:AWSServiceName": "acm.amazonaws.com"
                    }
                }
            },
            {
                "Sid": "DeleteRole",
                "Effect": "Allow",
                "Action": [
                    "iam:DeleteServiceLinkedRole",
                    "iam:GetServiceLinkedRoleDeletionStatus",
                    "iam:GetRole"
                ],
                "Resource": "arn:aws:iam::*:role/aws-service-role/acm.amazonaws.com/AWSServiceRoleForCertificateManager*"
            },
            {
                "Sid": "SSM",
                "Effect": "Allow",
                "Action": [
                    "logs:*",
                    "ssm:AddTagsToResource",
                    "ssm:GetParameter",
                    "ssm:DeleteParameter",
                    "ssm:PutParameter",
                    "cloudtrail:GetTrail",
                    "cloudtrail:ListTrails"
                ],
                "Resource": "*"
            }
        ]
    }
    

    For more information, see how to create a new AWS account and policies and permissions in IAM.

  2. A Route53 domain configured in the same AWS account, and with the full domain name specified in Route 53 hosted DNS domain configuration (See Step 11 below). External domain registration is currently not supported. For more information see What is Amazon Route 53?.

Notes

The MLRun software is free of charge, however, there is a cost for the AWS infrastructure services such as EKS, EC2, S3 and ECR. The actual pricing depends on a large set of factors including, for example, the region, the number of EC2 instances, the amount of storage consumed, and the data transfer costs. Other factors include, for example, reserved instance configuration, saving plan, and AWS credits you have associated with your account. It is recommended to use the AWS pricing calculator to calculate the expected cost, as well as the AWS Cost Explorer to manage the cost, monitor, and set-up alerts.

Post deployment expectations#

The key components deployed on your EKS cluster are:

  • MLRun server (including the feature store and the MLRun graph)

  • MLRun UI

  • Kubeflow pipeline

  • Real time serverless framework (Nuclio)

  • Spark operator

  • Jupyter lab

  • Grafana

Configuration settings#

Make sure you are logged in to the correct AWS account.

Click the button below to deploy MLRun.

_images/aws_launch_stack.png

After clicking the icon, the browser directs you to the CloudFormation stack page in your AWS account, or redirects you to the AWS login page if you are not currently logged in.

Note

You must fill in fields marked as mandatory (m) for the configuration to complete. Fields marked as optional (o) can be left blank.

  1. Stack name (m) — the name of the stack. You cannot continue if left blank. This field becomes the logical id of the stack. Stack name can include letters (A-Z and a-z), numbers (0-9), and dashes (-). For example: "John-1".

Parameters

  1. EKS cluster name (m) — the name of EKS cluster created. The EKS cluster is used to run the MLRun services. For example: "John-1".

VPC network Configuration

  1. Number of Availability Zones (m) — The default is set to 3. Choose from the dropdown to change the number. The minimum is 2.

  2. Availability zones (m) — select a zone from the dropdown. The list is based on the region of the instance. The number of zones must match the number of zones Number of Availability Zones.

  3. Allowed external access CIDR (m) — range of IP addresses allowed to access the cluster. Addresses that are not in this range are not able to access the cluster. Contact your IT manager/network administrator if you are not sure what to fill in here.

Amazon EKS configuration

  1. Additional EKS admin ARN (IAM user) (o) — add an additional admin user to the instance. Users can be added after the stack has been created. For more information see Create a kubeconfig for Amazon EKS.

  2. Instance type (m) — select from the dropdown list. The default is m5.4xlarge. For size considerations see Amazon EC2 Instance Types.

  3. Maximum Number of Nodes (m) — maximum number of nodes in the cluster. The number of nodes combined with the Instance type determines the AWS infrastructure cost.

Amazon EC2 configuration

  1. SSH key name (o) — To access the EC2 instance via SSH, enter an existing key. If left empty, it is possible to access the EC2 instance using the AWS Systems Manager Session Manager. For more information about SSH Keys see Amazon EC2 key pairs and Linux instances.

  2. Provision bastion host (m) — create a bastion host for SSH access to the Kubernetes nodes. The default is enabled. This allows SSH access to your EKS EC2 instances through a public IP.

Iguazio MLRun configuration

  1. Route 53 hosted DNS domain (m) — Enter the name of your registered Route53 domain. Only route53 domains are acceptable.

  2. The URL of your REDIS database (o) — This is only required if you're using Redis with the online feature store. See how to configure the online feature store for more details.

Other parameters

  1. MLRun CE Helm Chart version (m) — the MLRun Community Edition version to install. Leave the default value for the latest CE release.

Capabilities

  1. Check all the capabilities boxes (m).

Press Create Stack to continue the deployment. The stack creates a VPC with an EKS cluster and deploys all the services on top of it.

Note

It could take up to 2 hours for your stack to be created.

Getting started#

When the stack is complete, go to the output tab for the stack you created. There are links for the MLRun UI, Jupyter, and the Kubeconfig command.

It's recommended to go through the quick-start and the other tutorials in the documentation. These tutorials and demos come built-in with Jupyter under the root folder of Jupyter.

Storage resources#

When installing the MLRun Community Edition via Cloud Formation, several storage resources are created:

  • PVs via AWS storage provider: Used to hold the file system of the stacks pods, including the MySQL database of MLRun. These are deleted when the stack is uninstalled.

  • S3 Bucket: A bucket named <EKS cluster name>-<Random string> is created in the AWS account that installs the stack (where <EKS cluster name> is the name of the EKS cluster you chose and <Random string> is part of the CloudFormation stack ID). You can see the bucket name in the output tab of the stack. The bucket is used for MLRun’s artifact storage, and is not deleted when uninstalling the stack. The user must empty the bucket and delete it.

  • Container Images in ECR: When building and deploying MLRun and Nuclio functions via the MLRun Community Edition, the function images are stored in an ECR belonging to the AWS account that installs the stack. These images persist in the account’s ECR and are not deleted either.

Configuring the online feature store#

The feature store can store data on a fast key-value database table for quick serving. This online feature store capability requires an external key-value database.

Currently the MLRun feature store supports the following options:

  • Redis

  • Iguazio key-value database

To use Redis, you must install Redis separately and provide the Redis URL when configuring the AWS CloudFormation stack. Refer to the Redis getting-started page for information about Redis installation.

Streaming support#

For online serving, it is often convenient to use MLRun graph with a streaming engine. This allows managing queues between steps and functions. MLRun supports Kafka streams as well as Iguazio V3IO streams. See the examples on how to configure the MLRun serving graph with Kafka and V3IO.

Cleanup#

To free up the resources used by MLRun:

You may also need to check any external storage that you used.

Set up your environment #

You can write your code on a local machine while running your functions on a remote cluster. This tutorial explains how to set this up.

This release of MLRun supports only Python 3.9 for both the server and the client.

In this section

Prerequisites#

Before you begin, ensure that the following prerequisites are met:

Applications:

  • Python 3.9

  • Recommended pip 22.x+

The MLRun server is based on a Python 3.9 environment. It's recommended to move the client to a Python 3.9 environment as well.

For a Python 3.7 environment for platform versions up to and including v3.5.2, see Set up a Python 3.7 client environment.

MLRun client supported OS#

The MLRun client supports:

  • Linux

  • Mac

  • Windows via WSL

Set up a Python 3.9 client environment#
  1. Basic
    Run pip install mlrun
    This installs MLRun locally with the requirements in the requirements.txt.

Note

To install a specific version, use the command: pip install mlrun==<version>. Replace the <version> placeholder with the MLRun version number.

  1. Advanced

    • If you expect to connect to, or work with, cloud providers (Azure/Google Cloud/S3), you can install additional packages. This is not part of the regular requirements since not all users work with those platforms. Using this option reduces the dependencies and the size of the installation. The additional packages include:

      • pip install mlrun[s3] Install requirements for S3

      • pip install mlrun[azure-blob-storage] Install requirements for Azure blob storage

      • pip install mlrun[google-cloud-storage] Install requirements for Google cloud storage

    • To install all extras, run: pip install mlrun[complete] See the full list here.

  2. Alternatively, if you already installed a previous version of MLRun, upgrade it by running:

    pip install -U mlrun==<version>
    
  3. Ensure that you have remote access to your MLRun service (i.e., to the service URL on the remote Kubernetes cluster).

  4. When installing other python packages on top of MLRun, make sure to install them with mlrun in the same command/requirement file to avoid version conflicts. For example:

    pip install mlrun <other-package>
    

    or

    pip install -r requirements.txt
    

    where requirements.txt contains:

    mlrun
    <other-package>
    

    Do so even if you already have MLRun installed so that pip will take MLRun requirements into consideration when installing the other package.

Configure remote environment#

You have a few options to configure your remote environment:

Using mlrun config set command in MLRun CLI#

Example 1
Run this command in MLRun CLI:

mlrun config set -a http://localhost:8080

It creates the following environment file:

# this is an env file
MLRUN_DBPATH=http://localhost:8080

MLRUN_DBPATH saves the URL endpoint of the MLRun APIs service endpoint. Since it is localhost, username and access_key are not required (as in Example2)

Example 2
Note: Only relevant if your remote service is on an instance of the Iguazio MLOps Platform (not MLRun CE).
Run this command in MLRun CLI:

mlrun config set -a https://mlrun-api.default-tenant.app.xxx.iguazio-cd1.com -u joe -k mykey -e 

It creates the following environment file:

# this is another env file
V3IO_USERNAME=joe
V3IO_ACCESS_KEY=mykey
MLRUN_DBPATH=https://mlrun-api.default-tenant.app.xxx.iguazio-cd1.com

V3IO_USERNAME saves the username of a platform user with access to the MLRun service. V3IO_ACCESS_KEY saves the platform access key.

You can get the platform access key from the platform dashboard: select the user-profile picture or icon from the top right corner of any page, and select Access Keys from the menu. In the Access Keys window, either copy an existing access key or create a new key and copy it. Alternatively, you can get the access key by checking the value of the V3IO_ACCESS_KEY environment variable in a web shell or Jupyter Notebook service.

Note

If the MLRUN_DBPATH points to a remote iguazio cluster and the V3IO_API and/or V3IO_FRAMESD vars are not set, they are inferred from the DBPATH.

Explanation:

The mlrun config set command sets configuration parameters in mlrun default or the specified environment file. By default, it stores all of the configuration into the default environment file, and your own environment file does not need editing. The default environment file is created by default at ~/.mlrun.env.

The set command can work with the following parameters:

  • --env-file or -f to set the url path to the mlrun environment file

  • --api or -a to set the url (local or remote) for MLRun API

  • --username or -u to set the username

  • --access-key or -k to set the access key

  • --artifact-path or -p to set the artifact path

  • --env-vars or -e to set additional environment variables, e.g. -e ENV_NAME=<value>

Using mlrun.set_environment command in MLRun SDK#

You can set the environment using mlrun.set_environment command in MLRun SDK and either use the env_file parameter that saves the path/url to the .env file (which holds MLRun config and other env vars) or use args (without uploading from the environment file), for example:

# Use local service
mlrun.set_environment("http://localhost:8080", artifact_path="./")
# Use remote service
mlrun.set_environment("<remote-service-url>", access_key="xyz", username="joe")

For more explanations read the documentation mlrun.set_environment.

Using your IDE (e.g. PyCharm or VSCode)#

Use these procedures to access MLRun remotely from your IDE. These instructions are for PyCharm and VSCode.

Create environment file#

Create an environment file called mlrun.env in your workspace folder. Copy-paste the configuration below:

# Remote URL to mlrun service
MLRUN_DBPATH=<API endpoint of the MLRun APIs service endpoint; e.g., "https://mlrun-api.default-tenant.app.mycluster.iguazio.com">
# Iguazio platform username
V3IO_USERNAME=<username of a platform user with access to the MLRun service>
# Iguazio V3IO data layer credentials (copy from your user settings)
V3IO_ACCESS_KEY=<platform access key>

Note

If your remote service is on an instance of the Iguazio MLOps Platform, you can get all these parameters from the platform dashboard: select the user-profile picture or icon from the top right corner of any page, and select Remote settings. They are copied to the clipboard.

Note

Make sure that you add .env to your .gitignore file. The environment file contains sensitive information that you should not store in your source control.

Remote environment from PyCharm#

You can use PyCharm with MLRun remote by changing the environment variables configuration.

  1. From the main menu, choose Run | Edit Configurations.

    Edit configurations

  2. To set-up default values for all Python configurations, on the left-hand pane of the run/debug configuration dialog, expand the Templates node and select the Python node. The corresponding configuration template appears in the right-hand pane. Alternatively, you can edit a specific file configuration by choosing the corresponding file on the left-hand pane. Choose the Environment Variables edit box and expand it to edit the environment variables.

    Edit configuration screen

  3. Add the environment variable and value of MLRUN_DBPATH.

    Environment variables

    If the remote service is on an instance of the Iguazio MLOps Platform, also set the environment variables and values of V3IO_USERNAME, and V3IO_ACCESS_KEY.

Remote environment from VScode#

Create a debug configuration in VSCode. Configurations are defined in a launch.json file that's stored in a .vscode folder in your workspace.

To initialize debug configurations, first select the Run view in the sidebar:

run-icon

If you don't yet have any configurations defined, you'll see a button to Run and Debug, as well as a link to create a configuration (launch.json) file:

debug-toolbar

To generate a launch.json file with Python configurations:

  1. Click the create a launch.json file link (circled in the image above) or use the Run > Open configurations menu command.

  2. A configuration menu opens from the Command Palette. Select the type of debug configuration you want for the opened file. For now, in the Select a debug configuration menu that appears, select Python File. Debug configurations menu

Note

Starting a debugging session through the Debug Panel, F5 or Run > Start Debugging, when no configuration exists also brings up the debug configuration menu, but does not create a launch.json file.

  1. The Python extension then creates and opens a launch.json file that contains a pre-defined configuration based on what you previously selected, in this case Python File. You can modify configurations (to add arguments, for example), and also add custom configurations.

    Configuration json

Set environment file in debug configuration#

Add an envFile setting to your configuration with the value of ${workspaceFolder}/mlrun.env

If you created a new configuration in the previous step, your launch.json would look as follows:

{
    // Use IntelliSense to learn about possible attributes.
    // Hover to view descriptions of existing attributes.
    // For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
    "version": "0.2.0",
    "configurations": [
        {
            "name": "Python: Current File",
            "type": "python",
            "request": "launch",
            "program": "${file}",
            "console": "integratedTerminal",
            "envFile": "${workspaceFolder}/mlrun.env"
        }
    ]
}
Setting up a dark site#

Use this procedure for the MLRun package, and any other packages you want to install on a dark site.

To install a package in a dark (air-gapped) site:

  1. Download the packages: conda==23.1.0, pip.

  2. Pack the conda package file and upload it to your dark system.

  3. Install the tar.gz by running:

    conda install -y <package-filename>.tar.gz 
    

Projects and automation#

MLRun Project is a container for all your work on a particular ML application. Projects host functions, workflows, artifacts (datasets, models, etc.), features (sets, vectors), and configuration (parameters, secrets, source, etc.). Projects have owners and members with role-based access control.

mlrun-project


Projects are stored in a GIT or archive and map to IDE projects (in PyCharm, VSCode, etc.), which enables versioning, collaboration, and CI/CD. Projects simplify how you process data, submit jobs, run multi-stage workflows, and deploy real-time pipelines in continuous development or production environments.

project-lifecycle


In this section

Create, save, and use projects#

A project is a container for all the assets, configuration, and code of a particular application. It is the starting point for your work. Projects are stored in a versioned source repository (GIT) or archive and can map to IDE projects (in PyCharm, VSCode, etc.).

mlrun-project


In this section

Creating a project#

Project files (code, configuration, etc.) are stored in a directory (the project context path) and can be pushed to, or loaded from, the source repository. See the following project directory example:

my-project           # Parent directory of the project (context)
├── data             # Project data for local tests or outputs (not tracked by version control)
├── docs             # Project documentation
├── src              # Project source code (functions, libs, workflows)
├── tests            # Unit tests (pytest) for the different functions
├── project.yaml     # MLRun project spec file
├── README.md        # Project README
└── requirements.txt # Default Python requirements file (can have function specific requirements as well)

To define a new project from scratch, use new_project(). You must specify a name. The context dir holds the configuration, code, and workflow files. Its default value is "./", which is the directory the MLRun client runs from. File paths in the project are relative to the context root. There are additional, optional parameters. The user_project flag indicates that the project name is unique per user, and the init_git flag is used to initialize git in the context dir.

import mlrun
project = mlrun.new_project("myproj", "./", user_project=True, 
                            init_git=True, description="my new project")

Projects can also be created from a template (yaml file, zip file, or git repo), allowing users to create reusable skeletons. The content of the zip/tar/git archive is copied into the context dir. The remote attribute can be used to register a remote git repository URL.

Example of creating a new project from a zip template:

# create a project from zip, initialize a local git, and register the git remote path
project = mlrun.new_project("myproj", "./", init_git=True, user_project=True,
                            remote="git://github.com/myorg/some-project.git",
                            from_template="http://mysite/proj.zip")

Adding functions, artifacts, workflow, and config#

Projects host functions, workflows, artifacts (files, datasets, models, etc.), features, and configuration (parameters, secrets, source, etc.). This section explains how to add or register different project elements. For details on the feature store and its elements (sets, vectors) see the feature store documentation.

Adding and registering functions:

Functions with basic attributes such as code, requirements, image, etc. can be registered using the set_function() method. Functions can be created from a single code/notebook file or have access to the entire project context directory. (By adding the with_repo=True flag, the project context is cloned into the function runtime environment.) See the examples:

# register a (single) python file as a function
project.set_function('src/data_prep.py', 'data-prep', image='mlrun/mlrun', handler='prep', kind="job")

# register a notebook file as a function, specify custom image and extra requirements 
project.set_function('src/mynb.ipynb', name='test-function', image="my-org/my-image",
                      handler="run_test", requirements="requirements.txt", kind="job")

# register a module.handler as a function (requires defining the default sources/work dir, if it's not root)
project.spec.workdir = "src"
project.set_function(name="train", handler="training.train",  image="mlrun/mlrun", kind="job", with_repo=True)

See details and examples on how to create and register functions, how to annotate notebooks (to be used as functions), how to run, build, or deploy functions, and how to use them in workflows.

Register artifacts:

Artifacts are used by functions and workflows and are referenced by a key (name) and optional tag (version). Users can define artifact files or objects in the project spec, which are registered during project load or when calling project.register_artifacts(). To register artifacts use the set_artifact()) method. See the examples:

# register a simple file artifact in the project (point to remote object)  
data_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'
project.set_artifact('data', target_path=data_url)

# register a model artifact
project.set_artifact('model', ModelArtifact(model_file="model.pkl"), target_path=model_dir_url)

# register local or remote artifact object (yaml or zip), will be imported on project load
# to generate such a package use `artifact.export(zip_path)`
project.set_artifact('model', 'https://mystuff.com/models/mymodel.zip')

Note

Local file paths are relative to the context dir.

Registering workflows:

Projects contain one or more workflows (pipelines). The workflows can be registered using the set_workflow()) method. Project workflows are executed using the run()) method. See building and running workflows for details.

# Add a multi-stage workflow (./myflow.py) to the project with the name 'main' and save the project 
project.set_workflow('main', "./src/workflow.py")

Set project wide parameters and secrets:

You can define global project parameters and secrets and use them in your functions enabling simple configuration and templates. See the examples:

# Read env vars from dict or file and set as project secrets
project.set_secrets({"SECRET1": "value"})
project.set_secrets(file_path="secrets.env")

project.spec.params = {"x": 5}

Note

Secrets are not loaded automatically (not part of the project.yaml); you need to apply set_secrets() methods manually or use the UI.

Project parameters, secrets and configuration can also be set in the UI, in the relevant project, click the config-btn button at the left bottom corner.

Example, secrets configuration screen:

project-secrets

Save the project:

Use the save() method to store all the definitions (functions, artifacts, workflows, parameters, etc.) in the MLRun DB and in the project.yaml file (for automated loading and CI/CD).

project.save()
show an example project.yaml file

The generated project.yaml for the above project looks like:

ind: project
metadata:
  name: myproj
spec:
  description: my new project
  params:
    x: 5
  functions:
  - url: src/data_prep.py
    name: data-prep
    image: mlrun/mlrun
    handler: prep
  - url: src/mynb.ipynb
    name: test-function
    kind: job
    image: my-org/my-image
    handler: run_test
    requirements: requirements.txt
  - name: train
    kind: job
    image: mlrun/mlrun
    handler: training.train
    with_repo: true
  workflows:
  - path: ./src/workflow.py
    name: main
  artifacts:
  - kind: artifact
    metadata:
      project: myproj
      key: data
    spec:
      target_path: https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv
  source: ''
  workdir: src```

Pushing the project content into git or an archive#

Project code, metadata, and configuration are stored and versioned in source control systems like GIT or archives (zip, tar). This allows loading an entire project (with a specific version) into a development or production environment, or seamlessly integrating with CI/CD frameworks.

project-lifecycle


Note

You must push the updates before you build functions or run workflows that use code from git, since the builder or containers pull the code from the git repo.

Use standard Git commands to push the current project tree into a git archive. Make sure you .save() the project before pushing it

git remote add origin <server>
git commit -m "Commit message"
git push origin master

Alternatively, you can use MLRun SDK calls:

  • create_remote() - to register the remote Git path

  • push() - save project spec (project.yaml) and commit/push updates to remote repo

Note

If you are using containerized Jupyter you might need to first set your Git parameters, e.g. using the following commands and run git push from the terminal once to store your credentials:

git config --global user.email "<my@email.com>"
git config --global user.name "<name>"
git config --global credential.helper store

You can also save the project content and metadata into a local or remote .zip archive, for example:

project.export("../archive1.zip")
project.export("s3://my-bucket/archive1.zip")
project.export(f"v3io://projects/{project.name}/archive1.zip")

Get a project from DB or create it#

If you already have a project saved in the DB and you need to access/use it (for example, from a different notebook or file), use the get_or_create_project() method. It first tries to read the project from the DB, and only if it doesn't exist in the DB it loads/creates it.

Note

If you update the project object from different files/notebooks/users, make sure you .save() your project after a change, and run get_or_create_project to load changes made by others.

Example:

    # load project from the DB (if exists) or the source repo
    project = mlrun.get_or_create_project("myproj", "./", "git://github.com/mlrun/demo-xgb-project.git")
    project.pull("development")  # pull the latest code from git
    project.run("main", arguments={'data': data_url})  # run the workflow "main"

Deleting a project#

See delete_project.

Git best practices#

This section provides an overview of developing and deploying ML applications using MLRun and Git. It covers the following:

Note

This section assumes basic familiarity with version control software such as GitHub, GitLab, etc. If you're new to Git and version control, see the GitHub Hello World documentation.

See also

MLRun and Git Overview#

As a best practice, your MLRun project should be backed by a Git repo. This allows you to keep track of your code in source control as well as utilize your entire code library within your MLRun functions.

The typical lifecycle of a project is as follows:

Many people like to develop locally on their laptops, Jupyter environments, or local IDE before submitting the code to Git and running on the larger cluster. See Set up your client environment for more details.

Loading the code from container vs. loading the code at runtime#

MLRun supports two approaches to loading the code from Git:

  • Loading the code from container (default behavior)
    Before using this option, you must build the function with the build_function method. The image for the MLRun function is built once, and consumes the code in the repo. This is the preferred approach for production workloads. For example:

project.set_source(source="git://github.com/mlrun/project-archive.git")

fn = project.set_function(
    name="myjob", handler="job_func.job_handler",
    image="mlrun/mlrun", kind="job", with_repo=True,
)

project.build_function(fn)
  • Loading the code at runtime
    The MLRun function pulls the source code directly from Git at runtime. This is a simpler approach during development that allows for making code changes without re-building the image each time. For example:

project.set_source(source="git://github.com/mlrun/project-archive.git", pull_at_runtime=True)

fn = project.set_function(
    name="nuclio", handler="nuclio_func:nuclio_handler",
    image="mlrun/mlrun", kind="nuclio", with_repo=True,
)

Common tasks#

Setting up a new MLRun project repo#
  1. Initialize your repo using the command line as per this guide or using your version control software of choice (e.g. GitHub, GitLab, etc.).

git init ...
git add ...
git commit -m ...
git remote add origin ...
git branch -M <BRANCH>
git push -u origin <BRANCH>
  1. Clone the repo to the local environment where the MLRun client is installed (e.g. Jupyter, VSCode, etc.) and navigate to the repo.

Note

It is assumed that your local environment has the required access to pull a private repo.

git clone <MY_REPO>
cd <MY_REPO>
  1. Initialize a new MLRun project with the context pointing to your newly cloned repo.

import mlrun

project = mlrun.get_or_create_project(name="my-super-cool-project", context="./")
  1. Set the MLRun project source with the desired pull_at_runtime behavior (see Loading the code from container vs. loading the code at runtime for more info). Also set GIT_TOKEN in MLRun project secrets for working with private repos.

# Notice the prefix has been changed to git://
project.set_source(source="git://github.com/mlrun/project-archive.git", pull_at_runtime=True)
project.set_secrets(secrets={"GIT_TOKEN" : "XXXXXXXXXXXXXXX"}, provider="kubernetes")
  1. Register any MLRun functions or workflows and save. Make sure with_repo is True in order to add source code to the function.

project.set_function(name='train_model', func='train_model.py', kind='job', image='mlrun/mlrun', with_repo=True)
project.set_workflow(name='training_pipeline', workflow_path='training_pipeline.py')
project.save()
  1. Push additions to Git.

git add ...
git commit -m ...
git push ...
  1. Run the MLRun function/workflow. The source code is added to the function and is available via imports as expected.

project.run_function(function="train_model")
project.run(name="training_pipeline")
Running an existing MLRun project repo#
  1. Clone an existing MLRun project repo to your local environment where the MLRun client is installed (e.g. Jupyter, VSCode, etc.) and navigate to the repo.

git clone <MY_REPO>
cd <MY_REPO>
  1. Load the MLRun project with the context pointing to your newly cloned repo. MLRun is looking for a project.yaml file in the root of the repo.

project = mlrun.load_project(context="./")
  1. Optionally enable pull_at_runtime for easier development. Also set GIT_TOKEN in the MLRun Project secrets for working with private repos.

# source=None will use current Git source
project.set_source(source=None, pull_at_runtime=True)
project.set_secrets(secrets={"GIT_TOKEN" : "XXXXXXXXXXXXXXX"}, provider="kubernetes")
  1. Run the MLRun function/workflow. The source code is added to the function and is available via imports as expected.

project.run_function(function="train_model")
project.run(name="training_pipeline")

Note

If another user previously ran the project in your MLRun environment, ensure that your user has project permissions (otherwise you may not be able to view or run the project).

Pushing changes to the MLRun project repo#
  1. Edit the source code/functions/workflows in some way.

  2. Check-in changes to Git.

git add ...
git commit -m ...
git push ...
  1. If pull_at_runtime=False, re-build the Docker image. If pull_at_runtime=True, skip this step.

import mlrun

project = mlrun.load_project(context="./")
project.build_function("my_updated_function")
  1. Run the MLRun function/workflow. The source code with changes is added to the function and is available via imports as expected.

project.run_function(function="train_model")
project.run(name="training_pipeline")
Utilizing different branches#
  1. Check out the desired branch in the local environment.

git checkout <BRANCH>
  1. Update the desired branch in MLRun project. Optionally, save if the branch should be used for future runs.

project.set_source(
    source="git://github.com/igz-us-sales/mlrun-git-example.git#spanish",
    pull_at_runtime=True
)
project.save()
  1. Run the MLRun function/workflow. The source code from desired branch is added to the function and is available via imports as expected.

project.run_function("greetings")

Load projects#

Project code, metadata, and configuration are stored and versioned in source control systems like Git or archives (zip, tar) and can be loaded into your work environment or CI system with a single SDK or CLI command.

project-lifecycle


The project root (context) directory contains the project.yaml file with the required metadata and links to various project files/objects, and is read during the load process.

In this section

See also details on loading and using projects with CI/CD frameworks.

Load projects using the SDK#

When a project is already created and stored in a local dir, git, or archive, you can quickly load and use it with the load_project() method. load_project uses a local context directory (with initialized git) or clones a remote repo into the local dir and returns a project object.

You need to provide the git/zip/tar archive url. The context dir, by default, is "./", which is the directory the MLRun client runs from. The name can be specified or taken from the project object. The project can also specify secrets (dict with repo credentials), init_git flag (initializes Git in the context dir), clone flag (project is cloned into the context dir, and the local copy is ignored/deleted), and user_project flag (indicates the project name is unique to the user).

Example of loading a project from git, using the default context dir, and running the main workflow:

# load the project and run the 'main' workflow
project = load_project(name="myproj", url="git://github.com/mlrun/project-archive.git")
project.run("main", arguments={'data': data_url})

Note

If the url parameter is not specified it searches for Git repo inside the context dir and uses its metadata, or if the flag init_git=True, it initializes a Git repo in the target context directory.

Note

When working with a private Git, set the project secrets. See MLRun-managed secrets.

After the project object is loaded use the run() method to execute workflows. See details on building and running workflows), and how to run, build, or deploy individual functions.

You can edit or add project elements like functions, workflows, artifacts, etc. (See create and use projects.) Once you make changes use GIT or MLRun commands to push those changes to the archive (See save into git or an archive.)

Load projects using the CLI#

Loading a project from git into ./ :

mlrun project -n myproj --url "git://github.com/mlrun/project-demo.git" .

Running a specific workflow (main) from the project stored in . (current dir):

mlrun project --run main --watch .

CLI usage details

Usage: mlrun project [OPTIONS] [CONTEXT]

Options:
  -n, --name TEXT           project name
  -u, --url TEXT            remote git or archive url
  -r, --run TEXT            run workflow name of .py file
  -a, --arguments TEXT      Kubeflow pipeline arguments name and value tuples
                            (with -r flag), e.g. -a x=6
  -p, --artifact-path TEXT  output artifacts path
  -x, --param TEXT          mlrun project parameter name and value tuples,
                            e.g. -p x=37 -p y='text'
  -s, --secrets TEXT        secrets file=<filename> or env=ENV_KEY1,..
  --db TEXT                 api and db service path/url
  --init-git                for new projects init git context
  -c, --clone               force override/clone into the context dir
  --sync                    sync functions into db
  -w, --watch               wait for pipeline completion (with -r flag)
  -d, --dirty               allow run with uncommitted git changes
  --handler TEXT            workflow function handler name
  --engine TEXT             workflow engine (kfp/local/remote)
  --local                   try to run workflow functions locally
  --timeout INTEGER         timeout in seconds to wait for pipeline completion
                            (used when watch=True)
  --env-file TEXT           path to .env file to load config/variables from
  --ensure-project          ensure the project exists, if not, create project
  --schedule TEXT           To create a schedule define a standard crontab
                            expression string. For using the
                            pre-defined workflow's schedule, set --schedule 'true'

Run, build, and deploy functions#

In this section

Overview#

There is a set of methods used to deploy and run project functions. They can be used interactively or inside a pipeline (e.g. Kubeflow). When used inside a pipeline, each method is automatically mapped to the relevant pipeline engine command.

  • run_function() — Run a local or remote task as part of local or remote batch/scheduled task

  • build_function() — deploy an ML function, build a container with its dependencies for use in runs

  • deploy_function() — deploy real-time/online (nuclio or serving based) functions

Use these methods as project methods. For example:

# run the "train" function in myproject
run = myproject.run_function("train", inputs={"data": data_url})  

The first parameter in all three methods is either the function name (in the project), or a function object, used if you want to specify functions that you imported/created ad hoc, or to modify a function spec. For example:

# import a serving function from the Function Hub and deploy a trained model over it
serving = import_function("hub://v2_model_server", new_name="serving")
serving.spec.replicas = 2
deploy = deploy_function(
  serving,
  models=[{"key": "mymodel", "model_path": train.outputs["model"]}],
)

You can use the get_function() method to get the function object and manipulate it, for example:

trainer = project.get_function("train")
trainer.with_limits(mem="2G", cpu=2, gpus=1)
run = project.run_function("train", inputs={"data": data_url}) 

run_function#

Use the run_function() method to run a local or remote batch/scheduled task. The run_function method accepts various parameters such as name, handler, params, inputs, schedule, etc. Alternatively, you can pass a Task object (see: new_task()) that holds all of the parameters and the advanced options.

Functions can host multiple methods (handlers). You can set the default handler per function. You need to specify which handler you intend to call in the run command. You can pass parameters (arguments) or data inputs (such as datasets, feature-vectors, models, or files) to the functions through the run_function method.

The run_function() command returns an MLRun RunObject object that you can use to track the job and its results. If you pass the parameter watch=True (default), the command blocks until the job completes.

MLRun also supports iterative jobs that can run and track multiple child jobs (for hyperparameter tasks, AutoML, etc.). See Hyperparameter tuning optimization for details and examples.

Read further details on running tasks and getting their results.

Run/simulate functions locally:

Functions can also run and be debugged locally by using the local runtime or by setting the local=True parameter in the run() method (for batch functions).

Usage examples:

# create a project with two functions (local and from Function Hub)
project = mlrun.new_project(project_name, "./proj")
project.set_function("mycode.py", "prep", image="mlrun/mlrun")
project.set_function("hub://auto_trainer", "train")

# run functions (refer to them by name)
run1 = project.run_function("prep", params={"x": 7}, inputs={'data': data_url})
run2 = project.run_function("train", inputs={"dataset": run1.outputs["data"]})
run2.artifact('confusion-matrix').show()

Example with new_task:

import mlrun
project = mlrun.get_or_create_project('example-project')
---

from mlrun import RunTemplate, new_task, mlconf
from os import path
artifact_path = path.join(mlconf.artifact_path, '{{run.uid}}')
def handler(context, param, model_names):
    context.logger.info("Running handler")
    context.set_label('category', 'tests')
    for model_name, file_name in model_names:
        context.log_artifact(model_name, body=param.encode(), local_path=file_name)
----
func = project.set_function("my-func", kind="job", image="mlrun/mlrun")
func.save()
---
task = new_task(name='mytask', handler=handler, artifact_path=artifact_path, project='project-name')
run_object = project.run_function("my-func", local=True, base_task=task)

See mlrun.model.new_task() for a description of the new_task parameters.

build_function#

The build_function() method is used to deploy an ML function and build a container with its dependencies for use in runs.

Example:

# build the "trainer" function image (based on the specified requirements and code repo)
project.build_function("trainer")

The build_function() method accepts different parameters that can add to, or override, the function build spec. You can specify the target or base image extra docker commands, builder environment, and source credentials (builder_env), etc.

See further details and examples in Build function image.

deploy_function#

The deploy_function() method is used to deploy real-time/online (nuclio or serving) functions and pipelines. Read more about Real-time serving pipelines.

Basic example:

# Deploy a real-time nuclio function ("myapi")
deployment = project.deploy_function("myapi")

# invoke the deployed function (using HTTP request) 
resp = deployment.function.invoke("/do")

You can provide the env dict with: extra environment variables; models list to specify specific models and their attributes (in the case of serving functions); builder environment; and source credentials (builder_env).

Example of using deploy_function inside a pipeline, after the train step, to generate a model:

# Deploy the trained model (from the "train" step) as a serverless serving function
serving_fn = mlrun.new_function("serving", image="mlrun/mlrun", kind="serving")
mlrun.deploy_function(
  serving_fn,
  models=[
      {
          "key": model_name,
          "model_path": train.outputs["model"],
          "class_name": 'mlrun.frameworks.sklearn.SklearnModelServer',
      }
  ],
)

Note

If you want to create a simulated (mock) function instead of a real Kubernetes service, set the mock flag is set to True. See deploy_function api.

Default image#

You can set a default image for the project. This image will be used for deploying and running any function that does not have an explicit image assigned, and replaces MLRun's default image of mlrun/mlrun. To set the default image use the set_default_image() method with the name of the image.

The default image is applied to the functions in the process of enriching the function prior to running or deploying. Functions will therefore use the default image set in the project at the time of their execution, not the image that was set when the function was added to the project.

For example:

 project = mlrun.new_project(project_name, "./proj")
 # use v1 of a pre-built image as default
 project.set_default_image("myrepo/my-prebuilt-image:v1")
 # set function without an image, will use the project's default image
 project.set_function("mycode.py", "prep")

 # function will run with the "myrepo/my-prebuilt-image:v1" image
 run1 = project.run_function("prep", params={"x": 7}, inputs={'data': data_url})

 ...

 # replace the default image with a newer v2
 project.set_default_image("myrepo/my-prebuilt-image:v2")
 # function will now run using the v2 version of the image 
 run2 = project.run_function("prep", params={"x": 7}, inputs={'data': data_url})

Read more about Images and their usage in MLRun.

Image build configuration#

Use the set_default_image() function to configure a project to use an existing image. The configuration for building this default image can be contained within the project, by using the build_config() and build_image() functions.

The project build configuration is maintained in the project object. When saving, exporting and importing the project these configurations are carried over with it. This makes it simple to transport a project between systems while ensuring that the needed runtime images are built and are ready for execution.

When using build_config(), build configurations can be passed along with the resulting image name, and these are used to build the image. The image name is assigned following these rules, based on the project configuration and provided parameters:

  1. If provided, the name passed in the image parameter of build_config().

  2. The project's default image name, if configured using set_default_image().

  3. The value set in MLRun's default_project_image_name config parameter - by default this value is .mlrun-project-image-{name} with the project name as template parameter.

For example:

 # Set image config for current project object, using base mlrun image with additional requirements. 
 image_name = ".my-project-image"
 project.build_config(
     image=image_name,
     set_as_default=True,
     with_mlrun=False,
     base_image="mlrun/mlrun",
     requirements=["vaderSentiment"],
 )

 # Export the project configuration. The yaml file will contain the build configuration
 proj_file_path = "~/mlrun/my-project/project.yaml"
 project.export(proj_file_path)

This project can then be imported and the default image can be built:

 # Import the project as a new project with a different name
 new_project = mlrun.load_project("~/mlrun/my-project", name="my-other-project")
 # Build the default image for the project, based on project build config
 new_project.build_image()

 # Set a new function and run it (new function uses the my-project-image image built previously)
 new_project.set_function("sentiment.py", name="scores", kind="job", handler="handler")
 new_project.run_function("scores")

build_image#

The build_image() function builds an image using the existing build configuration. This method can also be used to set the build configuration and build the image based on it - in a single step.

If you set a source for the project (for example, git source) and set pull_at_runtime = False, then the generated image contains the project source in it. For example, this code builds .some-project-image image with the source in it.

project = mlrun.get_or_create_project(
    name="project-name", context="./"
)

project.set_source(
    "git://some/repo",
    pull_at_runtime=False
)

project.build_image(image=".some-project-image")

And now you can run a function based on the project code without having to specify an image:

func = project.set_function(handler="package.function", name="func", kind="job")
func.save()
project.run_function("func", params={...})

When using set_as_default=False any build config provided is still kept in the project object but the generated image name is not set as the default image for this project. For example:

image_name = ".temporary-image"
project.build_image(image=image_name, set_as_default=False)

# Create a function using the temp image name
project.set_function("sentiment.py", name="scores", kind="job", handler="handler", image=image_name)

Build and run workflows/pipelines#

This section shows how to write a batch pipeline so that it can be executed via an MLRun Project. With a batch pipeline, you can use the MLRun Project to execute several Functions in a DAG using the Python SDK or CLI.

This example creates a project with three MLRun functions and a single pipeline that orchestrates them. The pipeline steps are:

  • get-data — Get iris data from sklearn

  • train-model — Train model via sklearn

  • deploy-model — Deploy model to HTTP endpoint

import mlrun
project = mlrun.get_or_create_project(\"iguazio-academy\", context=\"./\")

Add functions to a project#

Add the functions to a project:

project.set_function(name='get-data', func='functions/get_data.py', kind='job', image='mlrun/mlrun')
project.set_function(name='train-model', func='functions/train.py', kind='job', image='mlrun/mlrun'),
project.set_function(name='deploy-model', func='hub://v2_model_server')

Write a pipeline#

Next, define the pipeline that orchestrates the three components. This pipeline is simple, however, you can create very complex pipelines with branches, conditions, and more.

Tip

To pass parameters between steps, use the outputs parameter.

%%writefile pipelines/training_pipeline.py
from kfp import dsl
import mlrun

@dsl.pipeline(
    name=\"batch-pipeline-academy\",
    description=\"Example of batch pipeline for Iguazio Academy\"
)
def pipeline(label_column: str, test_size=0.2):
    
    # Ingest the data set
    ingest = mlrun.run_function(
        'get-data',
        handler='prep_data',
        params={'label_column': label_column},
        outputs=[\"iris_dataset\"]
    )
    
    # Train a model   
    train = mlrun.run_function(
        \"train-model\",
        handler=\"train_model\",
        inputs={\"dataset\": ingest.outputs[\"iris_dataset\"]},
        params={
            \"label_column\": label_column,
            \"test_size\" : test_size
        },
        outputs=['model']
    )
    
    # Deploy the model as a serverless function
    deploy = mlrun.deploy_function(
        \"deploy-model\",
        models=[{\"key\": \"model\", \"model_path\": train.outputs[\"model\"]}]
    )

Add a pipeline to a project#

Add the pipeline to your project:

project.set_workflow(name='train', workflow_path=\"pipelines/training_pipeline.py")
project.save()

Working with secrets#

When executing jobs through MLRun, the code might need access to specific secrets, for example to access data residing on a data-store that requires credentials (such as a private S3 bucket), or many other similar needs.

MLRun provides some facilities that allow handling secrets and passing those secrets to execution jobs. It's important to understand how these facilities work, as this has implications on the level of security they provide and how much exposure they create for your secrets.

In this section

Overview#

There are two main use-cases for providing secrets to an MLRun job. These are:

  • Use MLRun-managed secrets. This is a flow that enables the MLRun user (for example a data scientist or engineer) to create and use secrets through interfaces that MLRun implements and manages.

  • Create secrets externally to MLRun using a Kubernetes secret or some other secret management framework (such as Azure vault), and utilize these secrets from within MLRun to enrich execution jobs. For example, the secrets are created and managed by an IT admin, and the data-scientist only accesses them.

The following sections cover the details of those two use-cases.

MLRun-managed secrets#

The easiest way to pass secrets to MLRun jobs is through the MLRun project secrets mechanism. MLRun jobs automatically gain access to all project secrets defined for the same project. More details are available later in this page.

The following is an example of using project secrets:

# Create project secrets for the myproj project
project = mlrun.get_or_create_project("myproj", "./")
secrets = {'AWS_KEY': '111222333'}
project.set_secrets(secrets=secrets, provider="kubernetes")

# Create and run the MLRun job
function = mlrun.code_to_function(
    name="secret_func",
    filename="my_code.py",
    handler="test_function",
    kind="job",
    image="mlrun/mlrun"
)
function.run()

The handler defined in my_code.py accesses the AWS_KEY secret by using the get_secret() API:

def test_function(context):
    context.logger.info("running function")
    aws_key = context.get_secret("AWS_KEY")
    # Use aws_key to perform processing.
    ...

To create GIT_TOKEN secrets, use this command:

project.set_secrets({"GIT_TOKEN":<git token>}
Using tasks with secrets#

MLRun uses the concept of tasks to encapsulate runtime parameters. Tasks are used to specify execution context such as hyper-parameters. They can also be used to pass details about secrets that are going to be used in the runtime. This allows for control over specific secrets passed to runtimes, and support for the various MLRun secret providers.

To pass secret parameters, use the Task's with_secrets() function. For example, the following command passes specific project-secrets to the execution context:

function = mlrun.code_to_function(
    name="secret_func",
    filename="my_code.py",
    handler="test_function",
    kind="job",
    image="mlrun/mlrun"
)
task = mlrun.new_task().with_secrets("kubernetes", ["AWS_KEY", "DB_PASSWORD"])
run = function.run(task, ...)

The with_secrets() function tells MLRun what secrets the executed code needs to access. The MLRun framework prepares the needed infrastructure to make these secrets available to the runtime, and passes information about them to the execution framework by specifying those secrets in the spec of the runtime. For example, if running a kubernetes job, the secret keys are noted in the generated pod's spec.

The actual details of MLRun's handling of the secrets differ per the secret provider used. The following sections provide more details on these providers and how they handle secrets and their values.

Regardless of the type of secret provider used, the executed code uses the get_secret() API to gain access to the value of the secrets passed to it, as shown in the above example.

Secret providers#

MLRun provides several secret providers. Each of these providers functions differently and have different traits with respect to what secrets can be passed and how they're handled. It's important to understand these parameters to make sure secrets are not compromised and that their secrecy is maintained.

Warning

The Inline, environment and file providers do not guarantee confidentiality of the secret values handled by them, and should only be used for development and demo purposes. The Kubernetes and Azure Vault providers are secure and should be used for any other use-case.

Kubernetes project secrets#

MLRun can use Kubernetes (k8s) secrets to store and retrieve secret values on a per-project basis. This method is supported for all runtimes that generate k8s pods. MLRun creates a k8s secret per project, and stores multiple secret keys within this secret. Project secrets can be created through the MLRun SDK as well as through the MLRun UI.

By default, all jobs in a project automatically get access to all the associated project secrets. There is no need to use with_secrets to provide access to project secrets.

Creating project secrets#

To populate the MLRun k8s project secret with secret values, use the project object's set_secrets() function, which accepts a dictionary of secret values or a file containing a list of secrets. For example:

# Create project secrets for the myproj project.
project = mlrun.get_or_create_project("myproj", "./")
secrets = {'password': 'myPassw0rd', 'AWS_KEY': '111222333'}
project.set_secrets(secrets=secrets, provider="kubernetes")

Warning

This action should not be part of the code committed to git or part of ongoing execution - it is only a setup action, which normally should only be executed once. After the secrets are populated, this code should be removed to protect the confidentiality of the secret values.

The MLRun API does not allow the user to see project secrets values, but it does allow seeing the keys that belong to a given project, assuming the user has permissions on that specific project. See the HTTPRunDB class documentation for additional details.

When MLRun is executed in the Iguazio platform, the secret management APIs are protected by the platform such that only users with permissions to access and modify a specific project can alter its secrets.

Creating secrets in the Projects UI page#

The Settings dialog in the Projects page, accessed with the Settings icon, has a Secrets tab where you can add secrets as key-value pairs. The secrets are automatically available to all jobs belonging to this project. Users with the Editor or Admin role can add, modify, and delete secrets, and assign new secret values. Viewers can only view the secret keys. The values themselves are not visible to any users.

Accessing the secrets#

By default, any runtime not executed locally (local=False) automatically gains access to all the secrets of the project it belongs to, so no configuration is required to enable that. Jobs that are executed locally (local=True) do not have access to the project secrets. It is possible to limit access of an executing job to a subset of these secrets by calling the following function with a list of the secrets to be accessed:

task.with_secrets('kubernetes', ['password', 'AWS_KEY'])

When the job is executed, the MLRun framework adds environment variables to the pod spec whose value is retrieved through the k8s valueFrom option, with secretKeyRef pointing at the secret maintained by MLRun. As a result, this method does not expose the secret values at all, except inside the pod executing the code where the secret value is exposed through an environment variable. This means that even a user with kubectl looking at the pod spec cannot see the secret values.

Users, however, can view the secrets using the following methods:

  • Run kubectl to view the actual contents of the k8s secret.

  • Perform kubectl exec into the running pod, and examine the environment variables.

To maintain the confidentiality of secret values, these operations must be strictly limited across the system by using k8s RBAC and ensuring that elevated permissions are granted to a very limited number of users (very few users have and use elevated permissions).

Accessing secrets in nuclio functions#

Nuclio functions do not have the MLRun context available to retrieve secret values. Secret values need to be retrieved from the environment variable of the same name. For example, to access the AWS_KEY secret in a nuclio function use:

aws_key = os.environ.get("AWS_KEY")
Azure Vault#

MLRun can serve secrets from an Azure key Vault.

Note

Azure key Vaults support 3 types of entities - keys, secrets and certificates. MLRun only supports accessing secret entities.

Setting up access to Azure key vault#

To enable this functionality, a secret must first be created in the k8s cluster that contains the Azure key Vault credentials. This secret should include credentials providing access to your specific Azure key Vault. To configure this, the following steps are needed:

  1. Set up a key vault in your Azure subscription.

  2. Create a service principal in Azure that will be granted access to the key vault. For creating a service principal through the Azure portal follow the steps listed in this page.

  3. Assign a key vault access policy to the service principal, as described in this page.

  4. Create a secret access key for the service principal, following the steps listed in this page. Make sure you have access to the following three identifiers:

    • Directory (tenant) id

    • Application (client) id

    • Secret key

  5. Generate a k8s secret with those details. Use the following command:

    kubectl -n <namespace> create secret generic <azure_key_vault_k8s_secret> \
       --from-literal=secret=<secret key> \
       --from-literal=tenant_id=<tenant id> \
       --from-literal=client_id=<client id>
    

Note

The names of the secret keys must be as shown in the above example, as MLRun queries them by these exact names.

Accessing Azure key vault secrets#

Once these steps are done, use with_secrets in the following manner:

task.with_secrets(
    "azure_vault",
    {
        "name": <azure_key_vault_name>,
        "k8s_secret": <azure_key_vault_k8s_secret>,
        "secrets": [],
    },
)

The name parameter should point at your Azure key Vault name. The secrets parameter is a list of the secret keys to be accessed from that specific vault. If it's empty (as in the example above) then all secrets in the vault can be accessed by their key name.

For example, if the Azure Vault has a secret whose name is MY_AZURE_SECRET and using the above example for with_secrets(), the executed code can use the following statement to access this secret:

azure_secret = context.get_secret("MY_AZURE_SECRET")

In terms of confidentiality, the executed pod has the Azure secret provided by the user mounted to it. This means that the access-keys to the vault are visible to a user that execs into the pod in question. The same security rules should be followed as described in the Kubernetes section above.

Demo/Development secret providers#

The rest of the MLRun secret providers are not secure by design, and should only be used for demonstration or development purposes.

Expand here for additional details.
Inline#

The inline secrets provider is a very basic framework that should mostly be used for testing and demos. The secrets passed by this framework are exposed in the source code creating the MLRun function, as well as in the function spec, and in the generated pod specs. To add inline secrets to a job, perform the following:

task.with_secrets("inline", {"MY_SECRET": "12345"})

As can be seen, even the client code exposes the secret value. If this is used to pass secrets to a job running in a kubernetes pod, the secret is also visible in the pod spec. This means that any user that can run kubectl and is permitted to view pod specs can also see the secret keys and their values.

Environment#

Environment variables are similar to the inline secrets, but their client-side value is not specified directly in code but rather is extracted from a client-side environment variable. For example, if running MLRun on a Jupyter notebook and there are environment variables named MY_SECRET and ANOTHER_SECRET on Jupyter, the following code
passes those secrets to the executed runtime:

task.with_secrets("env", "MY_SECRET, ANOTHER_SECRET")

When generating the runtime execution environment (for example, pod for the job runtime), MLRun retrieves the value of the environment variable and places it in the pod spec. This means that a user with kubectl capabilities who can see pod specs can still see the secret values passed in this manner.

File#

The file provider is used to pass secret values that are stored in a local file. The file needs to be made of lines, each containing a secret and its value separated by =. For example:

# secrets.txt
SECRET1=123456
SECRET2=abcdef

Use the following command to add these secrets:

task.with_secrets("file", "/path/to/file/secrets.txt")

Externally managed secrets#

MLRun provides facilities to map k8s secrets that were created externally to jobs that are executed. To enable that, the spec of the runtime that is created should be modified by mounting secrets to it - either as files or as environment variables containing specific keys from the secret.

In the following examples, assume a k8s secret called my-secret was created in the same k8s namespace where MLRun is running, with two keys in it - secret1 and secret2.

Mapping secrets to environment#

The following example adds these two secret keys as environment variables to an MLRun job:

function = mlrun.code_to_function(
    name="secret_func",
    handler="test_function",
    ...
)

function.set_env_from_secret(
    "SECRET_ENV_VAR_1", secret="my-secret", secret_key="secret1"
)
function.set_env_from_secret(
    "SECRET_ENV_VAR_2", secret="my-secret", secret_key="secret2"
)

This only takes effect for functions executed remotely, as the secret value is injected to the function pod, which does not exist for functions executed locally. Within the function code, the secret values will be exposed as regular environment variables, for example:

# Function handler
def test_function(context):
    # Getting the value in the secret2 key.
    my_secret_value = os.environ.get("SECRET_ENV_VAR_2")
    ...
Mapping secrets as files#

A k8s secret can be mapped as a filesystem folder to the function pod using the mount_secret() function:

# Mount all keys in the secret as files under /mnt/secrets
function.apply(mlrun.platforms.mount_secret("my-secret", "/mnt/secrets/"))

In our example, the two keys in my-secret are created as two files in the function pod, called /mnt/secrets/secret1 and /mnt/secrets/secret2. Reading these files provide the values. It is possible to limit the keys mounted to the function - see the documentation of mount_secret() for more details.

MLRun project bootstrapping with project_setup.py#

Overview#

The project_setup.py script in MLRun automates project initialization and configuration, facilitating seamless setup of MLRun projects by registering functions, workflows, Git sources, Docker images, and more. It ensures consistency by registering and updating all functions and workflows within the project.

Upon loading an MLRun project via get_or_create_project() or load_project(), the system automatically invokes the project_setup.py script.

Note: Ensure the script resides in the root of the project context.

import mlrun

# Load or create an MLRun project
project = mlrun.get_or_create_project("my-project") # project_setup.py called while loading project

Format#

The project_setup.py script returns the updated MLRun project after applying the specified configurations. It should have a setup function which receives an MlrunProject and returns an MlrunProject.

def setup(project: mlrun.projects.MlrunProject) -> mlrun.projects.MlrunProject:
    # ... (setup configurations)

    # Save and return the project:
    project.save()
    return project

Example Usage#

Here's an example directory structure of a project utilizing the project_setup.py script:

.
├── .env
└── src
    ├── functions
    │   ├── data.py
    │   └── train.py
    ├── project_setup.py
    └── workflows
        └── main_workflow.py

The project_setup.py script looks like the following:

import os

import mlrun


def setup(project: mlrun.projects.MlrunProject) -> mlrun.projects.MlrunProject:
    source = project.get_param("source")
    secrets_file = project.get_param("secrets_file")
    default_image = project.get_param("default_image")

    # Set project git/archive source and enable pulling latest code at runtime
    if source:
        print(f"Project Source: {source}")
        project.set_source(project.get_param("source"), pull_at_runtime=True)

    # Create project secrets and also load secrets in local environment
    if secrets_file and os.path.exists(secrets_file):
        project.set_secrets(file_path=secrets_file)
        mlrun.set_env_from_file(secrets_file)

    # Set default project docker image - functions that do not specify image will use this
    if default_image:
        project.set_default_image(default_image)

    # MLRun Functions - note that paths are relative to the project context (./src)
    project.set_function(
        name="get-data",
        func="functions/data.py",
        kind="job",
        handler="get_data",
    )

    project.set_function(
        name="train",
        func="functions/train.py",
        kind="job",
        handler="train_model",
    )

    # MLRun Workflows - note that paths are relative to the project context (./src)
    project.set_workflow("main", "workflows/main_workflow.py")

    # Save and return the project:
    project.save()
    return project

The project can then be loaded using the following code snippet:

project = mlrun.get_or_create_project(
    name="my-project",
    context="./src", # project_setup.py should be in this directory
    parameters={
        "source" : "https://github.com/mlrun/my-repo#main",
        "secrets_file" : ".env",
        "default_image" : "mlrun/mlrun"
    }
)

Common Operations#

Some common operations that can be added to the project_setup.py script include:

Set Project Source#

Set the project source and enable pulling at runtime if specified. See set_source() for more info.

source = project.get_param("source") # https://github.com/mlrun/my-repo#main

project.set_source(source, pull_at_runtime=True)
Export Project to Zip File Archive#

Export the local project directory contents to a zip file archive. Use this in conjunction with setting the project source for rapid iteration without requiring a Git commit for each change. See set_source() and export() for more info.

Note: This requires using the Iguazio v3io data layer or some s3 compliant object storage such as minio.

source = project.get_param("source") # v3io:///bigdata/my_project.zip

project.set_source(source, pull_at_runtime=True)
if ".zip" in source:
    print(f"Exporting project as zip archive to {source}...")
    project.export(source)
Set Existing Default Project Image#

Define the default Docker image for the project. It will be used for functions without a specified image. See set_default_image() for more info.

default_image = project.get_param("default_image") # mlrun/mlrun

if default_image:
    project.set_default_image(default_image)
Build a Docker Image#

Build a Docker image and optionally set it as the project default. See build_image() for more info.

base_image = project.get_param("base_image") # mlrun/mlrun
requirements_file = project.get_param("requirements_file") # requirements.txt

project.build_image(
    base_image=base_image,
    requirements_file=requirements_file,
    set_as_default=True
)
Register Functions#

Register MLRun functions within the project, specifying their names, associated files, kind (e.g., job), and handlers. See set_function() for more info.

project.set_function(
    name="get-data",
    func="data.py",
    kind="job",
    handler="get_data",
)
Define Workflows#

Define MLRun workflows within the project, associating them with specific files. See set_workflow() for more info.

project.set_workflow("main", "main_workflow.py")
Manage Secrets#

Create project secrets by setting them from a specified file path and load them as environment variables in the local environment. See set_secrets() and set_env_from_file() for more info.

secrets_file = project.get_param("secrets_file") # .env

if secrets_file and os.path.exists(secrets_file):
    project.set_secrets(file_path=secrets_file)
    mlrun.set_env_from_file(secrets_file)
Register Project Artifacts#

Register artifacts like models or datasets in the project. Useful for version control and transferring artifacts between environments (e.g. dev, staging, prod) via CI/CD. See set_artifact() and register_artifacts() for more info.

project.set_artifact(
    key="model",
    artifact="artifacts/model:challenger.yaml", # YAML file in project directory
    tag="challenger"
)
project.register_artifacts()
Defining K8s Resource Requirements for Functions#

Add Kubernetes resources by setting requests/limits for a given MLRun function. See CPU, GPU, and memory limits for user jobs for more info.

gpus = project.get_param("num_gpus_per_replica") or 4
cpu = project.get_param("num_cpus_per_replica") or 48
mem = project.get_param("memory_per_replica") or "192Gi"

train_function = project.set_function(
    "trainer.py",
    name="training",
    kind="job",
)
train_function.with_limits(gpus=gpus, cpu=cpu, mem=mem)
train_function.save()

See also:

Functions#

All the executions in MLRun are based on serverless functions. Functions are essentially Python code that can be executed locally or on a Kubernetes cluster. MLRun functions are used to run jobs, deploy models, create pipelines, and more.

There are various kinds of MLRun functions with different capabilities, however, there are commonalities across all functions. In general, an MLRun function looks like the following:

mlrun-architecture

MLRun supports numerous real-time and batch runtimes, described in Kinds of functions (runtimes). The different function runtimes take care of automatically transforming the code and spec to fully managed and elastic services over Kubernetes, which saves significant operational overhead, addresses scalability, and reduces infrastructure costs. The function parameters and capabilities are explained in more detail in Create and use functions.

Function objects are all inclusive, containing the code and all the operational aspects: (image, required packages, pod resource congifuration (replicas, CPU/GPU/memory limits, volumes, Spot vs. On-demand nodes, pod priority, node affinity), storage, environment, metadata definitions, etc.). Each function is versioned and stored in the MLRun database with a unique hash code, and gets a new hash code upon changes.

You can use the MLRun auto-logging to log results and artifacts, and to automatically and seamlessly track machine-learning processes while they execute, such as training a model. See Decorators and auto-logging.

Functions and projects#

Function are members of an MLRun project, a container for all your work on a particular ML application. Once you register a function within a project, you can execute it in your local environment or at scale on a Kubernetes cluster.

The relationship between functions, workflows, and projects, is as follows:

MLRun Function

After the MLRun functions and workflows are created and registered into the project, they are invoked using the project object. This workflow pairs especially well with CI/CD automation with Git.

Function hub#

Since function objects are all-inclusive (code, spec, API, and metadata definitions), they can be stored in, and reused from, a shared and versioned function hub. This means that multiple users can share the same MLRun project and get access to objects associated with it.

MLRun has an open public function hub that stores many pre-developed functions for use in your projects. Read more in Function hub .

Distributed functions#

Many of the runtimes support horizontal scaling across multiple containers. You can specify the number of replicas or the min—max value range (for auto scaling in Dask or Nuclio). When scaling functions, MLRun uses a high-speed messaging protocol and shared storage (volumes, objects, databases, or streams). MLRun runtimes handle the orchestration and monitoring of the distributed task.

runtime-scaling

Hyperparameters#

MLRun also supports iterative tasks for automatic and distributed execution of many tasks with variable parameters (hyperparams). See Hyperparameter tuning optimization.

In this section

Kinds of functions (runtimes)#

When you create an MLRun function you need to specify a runtime kind (e.g. kind='job'). Each runtime supports its own specific attributes (e.g. Jars for Spark, Triggers for Nuclio, Auto-scaling for Dask, etc.).

MLRun supports real-time and batch runtimes.

Real-time runtimes:

  • nuclio - real-time serverless functions over Nuclio

  • serving - deploy models and higher-level real-time Graph (DAG) over one or more Nuclio functions

Batch runtimes:

  • handler - execute python handler (used automatically in notebooks or for debug)

  • local - execute a Python or shell program

  • job - run the code in a Kubernetes Pod

  • dask - run the code as a Dask Distributed job (over Kubernetes)

  • databricks - run code on Databricks cluster (python scripts, Spark etc)

  • mpijob - run distributed jobs and Horovod over the MPI job operator, used mainly for deep learning jobs

  • spark - run the job as a Spark job (using Spark Kubernetes Operator)

  • remote-spark - run the job on a remote Spark service/cluster (e.g. Iguazio Spark service)

Common attributes for Kubernetes based functions

All the Kubernetes-based runtimes (Job, Dask, Spark, Nuclio, MPIJob, Serving) support a common set of spec attributes and methods for setting the Pods:

function.spec attributes (similar to k8s pod spec attributes):

  • volumes

  • volume_mounts

  • env

  • resources

  • replicas

  • image_pull_policy

  • service_account

  • image_pull_secret

common function methods:

  • set_env(name, value)

  • set_envs(env_vars)

  • gpus(gpus, gpu_type)

  • set_env_from_secret(name, secret, secret_key)

The limits methods are different for Spark and Dask:

  • Spark

    • with_driver_limits(mem, cpu, gpu_type)

    • with_executor_limits(mem, cpu, gpu_type)

  • Dask

    • with_scheduler_limits(mem, cpu, gpu_type)

    • with_worker_limits(mem, cpu, gpu_type)

In this section

Function of type job#

You can deploy a model using a job type function, which runs the code in a Kubernetes Pod.

You can create (register) a job function with basic attributes such as code, requirements, image, etc. using the set_function() method. You can also import an existing job function/template from the Function hub .

Functions can be created from a single code, notebook file, or have access to the entire project context directory. (By adding the with_repo=True flag, the project context is cloned into the function runtime environment.)

Examples:

# register a (single) python file as a function
project.set_function('src/data_prep.py', name='data-prep', image='mlrun/mlrun', handler='prep', kind="job")

# register a notebook file as a function, specify custom image and extra requirements 
project.set_function('src/mynb.ipynb', name='test-function', image="my-org/my-image",
                      handler="run_test", requirements=["scikit-learn"], kind="job")

# register a module.handler as a function (requires defining the default sources/work dir, if it's not root)
project.spec.workdir = "src"
project.set_function(name="train", handler="training.train",  image="mlrun/mlrun", kind="job", with_repo=True)

To run the job:

project.run_function("train")

See also

Function of type serving#

Deploying models in MLRun uses the function type serving. You can create a serving function using the set_function() call from a notebook. You can also import an existing serving function/template from the Function hub .

Creating a basic serving model using Scikit-learn#

The following code shows how to create a basic serving model using Scikit-learn.

import os
import urllib.request
import mlrun

model_path = os.path.abspath('sklearn.pkl')

# Download the model file locally
urllib.request.urlretrieve(mlrun.get_sample_path('models/serving/sklearn.pkl'), model_path)

# Set the base project name
project_name_base = 'serving-project'

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name_base, context="./", user_project=True)

serving_function_image = "mlrun/mlrun"
serving_model_class_name = "mlrun.frameworks.sklearn.SklearnModelServer"

# Create a serving function
serving_fn = mlrun.new_function("serving", project=project.name, kind="serving", image=serving_function_image)

# Add a model, the model key can be anything we choose. The class will be the built-in scikit-learn model server class
model_key = "scikit-learn"
serving_fn.add_model(key=model_key,
                    model_path=model_path,
                    class_name=serving_model_class_name)

After the serving function is created, you can test it:

# Test data to send
my_data = {"inputs":[[5.1, 3.5, 1.4, 0.2],[7.7, 3.8, 6.7, 2.2]]}

# Create a mock server in order to test the model
mock_server = serving_fn.to_mock_server()

# Test the serving function
mock_server.test(f"/v2/models/{model_key}/infer", body=my_data)

Similarly, you can deploy the serving function and test it with some data:

# Deploy the serving function
serving_fn.apply(mlrun.auto_mount()).deploy()

# Check the result using the deployed serving function
serving_fn.invoke(path=f'/v2/models/{model_key}/infer',body=my_data)
Using GIT with a serving function#

This example illustrates how to use Git with serving function:

project = mlrun.get_or_create_project("serving-git", "./")
project.set_source(source="git://github.com/<username>/<repo>.git#main", pull_at_runtime=True)
function = project.set_function(name="serving", kind="serving", with_repo=True, func=<python-file>, image="mlrun/mlrun")
function.add_model("serve", <model_path> ,class_name="MyClass")
project.deploy_function(function="serving")

See also

Dask distributed runtime#

_images/dask_horizontal.svg

Dask overview#

Source: Dask docs
Dask is a flexible library for parallel computing in Python.

Dask is composed of two parts:

  1. Dynamic task scheduling optimized for computation. This is similar to Airflow, Luigi, Celery, or Make, but optimized for interactive computational workloads.

  2. “Big Data” collections like parallel arrays, dataframes, and lists that extend common interfaces like NumPy, Pandas, or Python iterators to larger-than-memory or distributed environments. These parallel collections run on top of dynamic task schedulers.

Dask emphasizes the following virtues:

  • Familiar: Provides parallelized NumPy array and Pandas DataFrame objects

  • Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.

  • Native: Enables distributed computing in pure Python with access to the PyData stack.

  • Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms

  • Scales up: Runs resiliently on clusters with 1000s of cores

  • Scales down: Trivial to set up and run on a laptop in a single process

  • Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans
    Dask collections and schedulers

_images/dask-overview.svg

Dask DataFrame mimics Pandas#
import pandas as pd                     import dask.dataframe as dd
df = pd.read_csv('2015-01-01.csv')      df = dd.read_csv('2015-*-*.csv')
df.groupby(df.user_id).value.mean()     df.groupby(df.user_id).value.mean().compute()

Dask Array mimics NumPy - documentation

import numpy as np                       import dask.array as da
f = h5py.File('myfile.hdf5')             f = h5py.File('myfile.hdf5')
x = np.array(f['/small-data'])           x = da.from_array(f['/big-data'],
                                                           chunks=(1000, 1000))
x - x.mean(axis=1)                       x - x.mean(axis=1).compute()

Dask Bag mimics iterators, Toolz, and PySpark - documentation

import dask.bag as db
b = db.read_text('2015-*-*.json.gz').map(json.loads)
b.pluck('name').frequencies().topk(10, lambda pair: pair[1]).compute()

Dask Delayed mimics for loops and wraps custom code - documentation

from dask import delayed
L = []
for fn in filenames:                  # Use for loops to build up computation
    data = delayed(load)(fn)          # Delay execution of function
    L.append(delayed(process)(data))  # Build connections between variables

result = delayed(summarize)(L)
result.compute()

The concurrent.futures interface provides general submission of custom tasks: - documentation

from dask.distributed import Client
client = Client('scheduler:port')

futures = []
for fn in filenames:
    future = client.submit(load, fn)
    futures.append(future)

summary = client.submit(summarize, futures)
summary.result()
Dask.distributed
#

Dask.distributed is a lightweight library for distributed computing in Python. It extends both the concurrent.futures and dask APIs to moderate sized clusters.

_images/dask_dist.png

Motivation#

Distributed serves to complement the existing PyData analysis stack. In particular it meets the following needs:

  • Low latency: Each task suffers about 1ms of overhead. A small computation and network roundtrip can complete in less than 10ms.

  • Peer-to-peer data sharing: Workers communicate with each other to share data. This removes central bottlenecks for data transfer.

  • Complex Scheduling: Supports complex workflows (not just map/filter/reduce) which are necessary for sophisticated algorithms used in nd-arrays, machine learning, image processing, and statistics.

  • Pure Python: Built in Python using well-known technologies. This eases installation, improves efficiency (for Python users), and simplifies debugging.

  • Data Locality: Scheduling algorithms cleverly execute computations where data lives. This minimizes network traffic and improves efficiency.

  • Familiar APIs: Compatible with the concurrent.futures API in the Python standard library. Compatible with dask API for parallel algorithms

  • Easy Setup: As a Pure Python package distributed is pip installable and easy to set up on your own cluster.

Architecture#

Dask.distributed is a centrally managed, distributed, dynamic task scheduler. The central dask-scheduler process coordinates the actions of several dask-worker processes spread across multiple machines and the concurrent requests of several clients.

The scheduler is asynchronous and event driven, simultaneously responding to requests for computation from multiple clients and tracking the progress of multiple workers. The event-driven and asynchronous nature makes it flexible to concurrently handle a variety of workloads coming from multiple users at the same time while also handling a fluid worker population with failures and additions. Workers communicate amongst each other for bulk data transfer over TCP.

Internally the scheduler tracks all work as a constantly changing directed acyclic graph of tasks. A task is a Python function operating on Python objects, which can be the results of other tasks. This graph of tasks grows as users submit more computations, fills out as workers complete tasks, and shrinks as users leave or become disinterested in previous results.

Users interact by connecting a local Python session to the scheduler and submitting work, either by individual calls to the simple interface client.submit(function, *args, **kwargs) or by using the large data collections and parallel algorithms of the parent dask library. The collections in the dask library like dask.array and dask.dataframe provide easy access to sophisticated algorithms and familiar APIs like NumPy and Pandas, while the simple client.submit interface provides users with custom control when they want to break out of canned “big data” abstractions and submit fully custom workloads.

~5X Faster with Dask#

Short example which demonstrates the power of Dask, in this notebook we will preform the following:

  • Generate random text files

  • Process the file by sorting and counting it's content

  • Compare run times

Generate random text files#
import random
import string
import os

from collections import Counter
from dask.distributed import Client

import warnings

warnings.filterwarnings("ignore")
def generate_big_random_letters(filename, size):
    """
    generate big random letters/alphabets to a file
    :param filename: the filename
    :param size: the size in bytes
    :return: void
    """
    chars = "".join([random.choice(string.ascii_letters) for i in range(size)])  # 1

    with open(filename, "w") as f:
        f.write(chars)
    pass
PATH = "/User/howto/dask/random_files"
SIZE = 10000000

for i in range(100):
    generate_big_random_letters(filename=PATH + "/file_" + str(i) + ".txt", size=SIZE)
Setfunction for benchmark#
def count_letters(path):
    """
    count letters in text file
    :param path:  path to file
    """
    # open file in read mode
    file = open(path, "r")

    # read the content of file
    data = file.read()

    # sort file
    sorted_file = sorted(data)

    # count file
    number_of_characters = len(sorted_file)

    return number_of_characters
def process_files(path):
    """
    list file and count letters
    :param path: path to folder with files
    """
    num_list = []
    files = os.listdir(path)

    for file in files:
        cnt = count_letters(os.path.join(path, file))
        num_list.append(cnt)

    l = num_list
    return print("done!")
Sort & count number of letters with Python#
%%time
PATH = "/User/howto/dask/random_files/"
process_files(PATH)
done!
CPU times: user 2min 19s, sys: 9.31 s, total: 2min 29s
Wall time: 2min 32s
Sort & count number of letters with Dask#
# get the dask client address
client = Client()
# list all files in folder
files = [PATH + x for x in os.listdir(PATH)]
%%time
# run the count_letter function on a list of files while using multiple workers
a = client.map(count_letters, files)
CPU times: user 13.2 ms, sys: 983 µs, total: 14.2 ms
Wall time: 12.2 ms
%%time
# gather results
l = client.gather(a)
CPU times: user 3.39 s, sys: 533 ms, total: 3.92 s
Wall time: 40 s
Additional topics#
Running Dask on the cluster with MLRun#

Note

Dask is supported at the Tech Preview level only.

The Dask framework enables you to parallelize your Python code and run it as a distributed process on an Iguazio cluster and dramatically accelerate the performance.
In this notebook you'll learn how to create a Dask cluster and then an MLRun function running as a Dask client.
It also demonstrates how to run parallelize custom algorithm using Dask Delayed option.

For more information on Dask over Kubernetes: https://kubernetes.dask.org/en/latest/.

Set up the environment#
# set mlrun api path and artifact path for logging
import mlrun

project = mlrun.get_or_create_project("dask-demo", "./")
> 2023-02-19 07:48:52,191 [info] Created and saved project dask-demo: {'from_template': None, 'overwrite': False, 'context': './', 'save': True}
> 2023-02-19 07:48:52,194 [info] created project dask-demo and saved in MLRun DB
Create and start Dask cluster#

Dask functions can be local (local workers), or remote (use containers in the cluster). In the case of remote you can specify the number of replicas (optional) or leave blank for auto-scale.
Use the new_function() to define the Dask cluster and set the desired configuration of that clustered function.

If the Dask workers need to access the shared file system, apply a shared volume mount (e.g. via v3io mount).

The Dask function spec has several unique attributes (in addition to the standard job attributes):

  • .remote — bool, use local or clustered dask

  • .replicas — number of desired replicas, keep 0 for auto-scale

  • .min_replicas, .max_replicas — set replicas range for auto-scale

  • .scheduler_timeout — cluster is killed after timeout (inactivity), default is '60 minutes'

  • .nthreads — number of worker threads

If you want to access the Dask dashboard or scheduler from remote you need to use NodePort service type (set .service_type to 'NodePort'), and the external IP need to be specified in the MLRun configuration (mlconf.remote_host). This is set automatically if you are running on an Iguazio cluster.

Specify the kind (dask) and the container image:

# create an mlrun function that will init the dask cluster
dask_cluster_name = "dask-cluster"
dask_cluster = mlrun.new_function(dask_cluster_name, kind="dask", image="mlrun/mlrun")
dask_cluster.apply(mlrun.mount_v3io())
<mlrun.runtimes.daskjob.DaskCluster at 0x7f0dabf52460>
# set range for # of replicas with replicas and max_replicas
dask_cluster.spec.min_replicas = 1
dask_cluster.spec.max_replicas = 4

# set the use of dask remote cluster (distributed)
dask_cluster.spec.remote = True
dask_cluster.spec.service_type = "NodePort"

# set dask memory and cpu limits
dask_cluster.with_worker_requests(mem="2G", cpu="2")
Initialize the Dask Cluster#

When you request the dask cluster client attribute, it verifies that the cluster is up and running:

# init dask client and use the scheduler address as param in the following cell
dask_cluster.client
> 2023-02-19 07:49:07,462 [info] trying dask client at: tcp://mlrun-dask-cluster-bae5cf76-0.default-tenant:8786
> 2023-02-19 07:49:07,516 [info] using remote dask scheduler (mlrun-dask-cluster-bae5cf76-0) at: tcp://mlrun-dask-cluster-bae5cf76-0.default-tenant:8786

Client

Client-e3759c00-b029-11ed-86b2-6684fa230d0c

Connection method: Direct
Dashboard: http://mlrun-dask-cluster-bae5cf76-0.default-tenant:8787/status

Scheduler Info

Scheduler

Scheduler-7641adf7-a399-4465-869c-d479318d6835

Comm: tcp://10.200.196.73:8786 Workers: 0
Dashboard: http://10.200.196.73:8787/status Total threads: 0
Started: Just now Total memory: 0 B

Workers

Creating a function that runs over Dask#
# mlrun: start-code

Import mlrun and dask. Nuclio is only used to convert the code into an MLRun function.

import mlrun
from dask.distributed import Client
from dask import delayed
from dask import dataframe as dd

import warnings
import numpy as np
import os
import mlrun

warnings.filterwarnings("ignore")
Python function code#

This simple function reads a .csv file using dask dataframe. It runs the groupby and describe functions on the dataset, and stores the results as a dataset artifact.

def test_dask(
    context, dataset: mlrun.DataItem, client=None, dask_function: str = None
) -> None:
    # setup dask client from the MLRun dask cluster function
    if dask_function:
        client = mlrun.import_function(dask_function).client
    elif not client:
        client = Client()

    # load the dataitem as dask dataframe (dd)
    df = dataset.as_df(df_module=dd)

    # run describe (get statistics for the dataframe) with dask
    df_describe = df.describe().compute()

    # run groupby and count using dask
    df_grpby = df.groupby("VendorID").count().compute()

    context.log_dataset("describe", df=df_grpby, format="csv", index=True)
    return
# mlrun: end-code
Test the function over Dask#
Load sample data#
DATA_URL = "/User/examples/ytrip.csv"
!mkdir -p /User/examples/
!curl -L "https://s3.wasabisys.com/iguazio/data/Taxi/yellow_tripdata_2019-01_subset.csv" > {DATA_URL}
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 84.9M  100 84.9M    0     0  7136k      0  0:00:12  0:00:12 --:--:-- 6371k
Convert the code to MLRun function#

Use code_to_function to convert the code to MLRun and specify the configuration for the Dask process (e.g. replicas, memory etc.).
Note that the resource configurations are per worker.

# mlrun transforms the code above (up to nuclio: end-code cell) into serverless function
# which runs in k8s pods
fn = mlrun.code_to_function("test_dask", kind="job", handler="test_dask").apply(
    mlrun.mount_v3io()
)
Run the function#
# function URI is db://<project>/<name>
dask_uri = f"db://{project.name}/{dask_cluster_name}"
r = fn.run(
    handler=test_dask,
    inputs={"dataset": DATA_URL},
    params={"dask_function": dask_uri},
    auto_build=True,
)
> 2023-02-19 07:49:27,208 [info] starting run test-dask-test_dask uid=a30942af70f347488daf4f653afd6c63 DB=http://mlrun-api:8080
> 2023-02-19 07:49:27,361 [info] Job is running in the background, pod: test-dask-test-dask-dqdln
Names with underscore '_' are about to be deprecated, use dashes '-' instead. Replacing underscores with dashes.
> 2023-02-19 07:49:35,137 [info] trying dask client at: tcp://mlrun-dask-cluster-bae5cf76-0.default-tenant:8786
> 2023-02-19 07:49:35,163 [info] using remote dask scheduler (mlrun-dask-cluster-bae5cf76-0) at: tcp://mlrun-dask-cluster-bae5cf76-0.default-tenant:8786
remote dashboard: default-tenant.app.vmdev94.lab.iguazeng.com:31886
> 2023-02-19 07:49:45,383 [info] To track results use the CLI: {'info_cmd': 'mlrun get run a30942af70f347488daf4f653afd6c63 -p dask-demo', 'logs_cmd': 'mlrun logs a30942af70f347488daf4f653afd6c63 -p dask-demo'}
> 2023-02-19 07:49:45,384 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/dask-demo/jobs/monitor/a30942af70f347488daf4f653afd6c63/overview'}
> 2023-02-19 07:49:45,384 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
dask-demo 0 Feb 19 07:49:35 completed test-dask-test_dask
v3io_user=dani
kind=job
owner=dani
mlrun/client_version=1.3.0-rc23
mlrun/client_python_version=3.9.16
host=test-dask-test-dask-dqdln
dataset
dask_function=db://dask-demo/dask-cluster
describe

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-02-19 07:49:45,730 [info] run executed, status=completed
Track the progress in the UI#

You can view the progress and detailed information in the MLRun UI by clicking on the uid above.
To track the dask progress: in the Dask UI click the "dashboard link" above the "client" section.

Pipelines using Dask, Kubeflow and MLRun#
Create a project to host functions, jobs and artifacts#

Projects are used to package multiple functions, workflows, and artifacts. Project code and definitions are usually stored in a Git archive.

The following code creates a new project in a local dir and initializes git tracking on it.

import os
import mlrun
import warnings

warnings.filterwarnings("ignore")

# set project name, dir, and artifacts path
project_name = "sk-project-dask"
project_dir = "./"
project.artifact_path = path

# set project
sk_dask_proj = mlrun.get_or_create_project(project_name, project_dir, init_git=True)
> 2022-09-27 17:26:14,808 [info] loaded project sk-project-dask from MLRun DB
Init Dask cluster#
import mlrun

# set up function from local file
dsf = mlrun.new_function(name="mydask", kind="dask", image="mlrun/mlrun")

# set up function specs for dask
dsf.spec.remote = True
dsf.spec.replicas = 5
dsf.spec.service_type = "NodePort"
dsf.with_limits(mem="6G")
dsf.spec.nthreads = 5
# apply mount_v3io over the function so that the k8s pod that runs the function
# can access the data (shared data access)
dsf.apply(mlrun.mount_v3io())
<mlrun.runtimes.daskjob.DaskCluster at 0x7f47fce9c850>
dsf.save()
'db://sk-project-dask/mydask'
# init dask cluster
dsf.client
> 2022-09-27 17:26:25,134 [info] trying dask client at: tcp://mlrun-mydask-d7df9301-d.default-tenant:8786
> 2022-09-27 17:26:25,162 [info] using remote dask scheduler (mlrun-mydask-d7df9301-d) at: tcp://mlrun-mydask-d7df9301-d.default-tenant:8786

Client

Client-83392da2-3e89-11ed-b7e8-82a5d7054c46

Connection method: Direct
Dashboard: http://mlrun-mydask-d7df9301-d.default-tenant:8787/status

Scheduler Info

Scheduler

Scheduler-b8468d53-b900-4041-9982-5e14d5e5eb81

Comm: tcp://10.200.152.178:8786 Workers: 0
Dashboard: http://10.200.152.178:8787/status Total threads: 0
Started: Just now Total memory: 0 B

Workers

Load and run a functions#

Load the function object from .py or .yaml file, or the Function Hub (marketplace).

# load function from the Function Hub
sk_dask_proj.set_function("hub://describe", name="describe")
sk_dask_proj.set_function("hub://sklearn_classifier_dask", name="dask_classifier")
<mlrun.runtimes.kubejob.KubejobRuntime at 0x7f48353d5130>
Create a fully automated ML pipeline#
Add more functions to the project to be used in the pipeline (from the Function Hub)#

Describe data, train and eval model with dask.

Define and save a pipeline#

The following workflow definition is written into a file. It describes a Kubeflow execution graph (DAG) and how functions and data are connected to form an end-to-end pipeline.

  • Describe data.

  • Train, test and evaluate with dask.

Check the code below to see how functions objects are initialized and used (by name) inside the workflow.
The workflow.py file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

Note: The pipeline can include CI steps like building container images and deploying models as illustrated in the following example.

%%writefile workflow.py
import os
from kfp import dsl
import mlrun

# params
funcs = {}
LABELS = "label"
DROP = "congestion_surcharge"
DATA_URL = mlrun.get_sample_path("data/iris/iris_dataset.csv")
DASK_CLIENT = "db://sk-project-dask/mydask"


# init functions are used to configure function resources and local settings
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mlrun.mount_v3io())
        pass


@dsl.pipeline(name="Demo training pipeline", description="Shows how to use mlrun")
def kfpipeline():
    # Describe the data
    describe = funcs["describe"].as_step(
        inputs={"table": DATA_URL},
        params={"dask_function": DASK_CLIENT},
    )

    # Train, test and evaluate:
    train = funcs["dask_classifier"].as_step(
        name="train",
        handler="train_model",
        inputs={"dataset": DATA_URL},
        params={
            "label_column": LABELS,
            "dask_function": DASK_CLIENT,
            "test_size": 0.10,
            "model_pkg_class": "sklearn.ensemble.RandomForestClassifier",
            "drop_cols": DROP,
        },
        outputs=["model", "test_set"],
    )
    train.after(describe)
Overwriting workflow.py
# register the workflow file as "main", embed the workflow code into the project YAML
sk_dask_proj.set_workflow("main", "workflow.py", embed=False)

Save the project definitions to a file (project.yaml). It is recommended to commit all changes to a Git repo.

sk_dask_proj.save()
<mlrun.projects.project.MlrunProject at 0x7f48342e4880>

Run a pipeline workflow#

Use the run method to execute a workflow. You can provide alternative arguments and specify the default target for workflow artifacts.
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks.

Note: The same command can be issued through CLI commands:
mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"

The dirty flag lets you run a project with uncommitted changes (when the notebook is in the same git dir it is always dirty).
The watch flag waits for the pipeline to complete and print results.

artifact_path = os.path.abspath("./pipe/{{workflow.uid}}")
run_id = sk_dask_proj.run(
    "main", arguments={}, artifact_path=artifact_path, dirty=False, watch=True
)
Pipeline running (id=631ad0a3-19f1-4df0-bfa7-6c38c60275e0), click here to view the details in MLRun UI
_images/fe40e47eb8d9e98f01323bdd698714fa348435403e346b4ce55c73c2e6a34a40.svg

Run Results

Workflow 631ad0a3-19f1-4df0-bfa7-6c38c60275e0 finished, state=Succeeded
click the hyper links below to see detailed results
uid start state name parameters results
Sep 27 17:27:09 completed train
label_column=label
dask_function=db://sk-project-dask/mydask
test_size=0.1
model_pkg_class=sklearn.ensemble.RandomForestClassifier
drop_cols=congestion_surcharge
micro=0.9944598337950138
macro=0.9945823158323159
precision-0=1.0
precision-1=0.9166666666666666
precision-2=0.8
recall-0=1.0
recall-1=0.7857142857142857
recall-2=0.9230769230769231
f1-0=1.0
f1-1=0.8461538461538461
f1-2=0.8571428571428571
Sep 27 17:26:42 completed describe
dask_function=db://sk-project-dask/mydask

back to top

Databricks runtime#

The databricks runtime runs on a Databricks cluster (and not in the Iguazio cluster). The function raises a pod on MLRun, which communicates with the Databricks cluster. The requests originate in MLRun and all computing is in the Databricks cluster.

With the databricks runtime, you can send your local file/code as a string to the job, and use a handler as an endpoint for user code. You can optionally send keyword arguments (kwargs) to this job.

You can run the function on:

  • An existing cluster, by including DATABRICKS_CLUSTER_ID

  • A job compute cluster, created and dedicated for this function only.

Params that are not related to a new cluster or an existing cluster:

  • timeout_minutes

  • token_key

  • artifact_json_dir (location where the json file that contains all logged mlrun artifacts is saved, and which is deleted after the run)

Params that are related to a new cluster:

  • spark_version

  • node_type_id

  • num_workers

Example of a job compute cluster#

To create a job compute cluster, omit DATABRICKS_CLUSTER_ID, and set the cluster specs by using the task parameters when running the function. For example:

params['task_parameters'] = {'new_cluster_spec': {'node_type_id': 'm5d.large'}, 'number_of_workers': 2, 'timeout_minutes': 15, `token_key`: non-default-value}

Do not send variables named task_parameters or context since these are utilized by the internal processes of the runtime.

Example of running a Databricks job from a local file#

This example uses an existing cluster: DATABRICKS_CLUSTER_ID.

import os
import mlrun
from mlrun.runtimes.function_reference import FunctionReference
# If using a Databricks data store, for example, set the credentials:
os.environ["DATABRICKS_HOST"] = "DATABRICKS_HOST"
os.environ["DATABRICKS_TOKEN"] = "DATABRICKS_TOKEN"
os.environ["DATABRICKS_CLUSTER_ID"] = "DATABRICKS_CLUSTER_ID"
def add_databricks_env(function):
    job_env = {
        "DATABRICKS_HOST": os.environ["DATABRICKS_HOST"],
        "DATABRICKS_CLUSTER_ID": os.environ.get("DATABRICKS_CLUSTER_ID"),
    }

    for name, val in job_env.items():
        function.spec.env.append({"name": name, "value": val})
project_name = "databricks-runtime-project"
project = mlrun.get_or_create_project(project_name, context="./", user_project=False)

secrets = {"DATABRICKS_TOKEN": os.environ["DATABRICKS_TOKEN"]}

project.set_secrets(secrets)

code = """
def print_kwargs(**kwargs):
    print(f"kwargs: {kwargs}")
"""

function_ref = FunctionReference(
    kind="databricks",
    code=code,
    image="mlrun/mlrun",
    name="databricks-function",
)

function = function_ref.to_function()

add_databricks_env(function=function)

run = function.run(
    handler="print_kwargs",
    project=project_name,
    params={
        "param1": "value1",
        "param2": "value2",
        "task_parameters": {"timeout_minutes": 15},
    },
)
Logging a Databricks response as an artifact#
import numpy as np
import pandas as pd
from pyspark.sql import SparkSession


def main():
    df = pd.DataFrame({"A": np.random.randint(1, 100, 5), "B": np.random.rand(5)})
    path = "/dbfs/path/folder"
    parquet_df_path = f"{path}/df.parquet"
    csv_df_path = f"{path}/df.csv"

    if not os.path.exists(path):
        os.makedirs(path)

    # save df
    df.to_parquet(parquet_df_path)
    df.to_csv(csv_df_path, index=False)

    # log artifact
    mlrun_log_artifact("parquet_artifact", parquet_df_path)
    mlrun_log_artifact("csv_artifact", csv_df_path)

    # spark
    spark = SparkSession.builder.appName("example").getOrCreate()
    spark_df = spark.createDataFrame(df)

    # spark path format:
    spark_parquet_path = "dbfs:///path/folder/spark_df.parquet"
    spark_df.write.mode("overwrite").parquet(spark_parquet_path)
    mlrun_log_artifact("spark_artifact", spark_parquet_path)

    # an illegal artifact does not raise an error, it logs an error log instead, for example:
    # mlrun_log_artifact("illegal_artifact", "/not_exists_path/illegal_df.parquet")
function = mlrun.code_to_function(
    name="databricks-log_artifact",
    kind="databricks",
    project=project_name,
    filename="./databricks_job.py",
    image="mlrun/mlrun",
)
add_databricks_env(function=function)
run = function.run(
    handler="main",
    project=project_name,
)
project.list_artifacts()
MPIJob and Horovod runtime#
Running distributed workloads#

Training a Deep Neural Network is a hard task. With growing datasets, wider and deeper networks, training our Neural Network can require a lot of resources (CPUs / GPUs / Mem and Time).

There are two main reasons why we would like to distribute our Deep Learning workloads:

  1. Model Parallelism — The Model is too big to fit a single GPU.
    In this case the model contains too many parameters to hold within a single GPU.
    To negate this we can use strategies like Parameter Server or slicing the model into slices of consecutive layers which we can fit in a single GPU.
    Both strategies require Synchronization between the layers held on different GPUs / Parameter Server shards.

  2. Data Parallelism — The Dataset is too big to fit a single GPU.
    Using methods like Stochastic Gradient Descent we can send batches of data to our models for gradient estimation. This comes at the cost of longer time to converge since the estimated gradient may not fully represent the actual gradient.
    To increase the likelihood of estimating the actual gradient we could use bigger batches, by sending small batches to different GPUs running the same Neural Network, calculating the batch gradient and then running a Synchronization Step to calculate the average gradient over the batches and update the Neural Networks running on the different GPUs.

It is important to understand that the act of distribution adds extra Synchronization Costs which may vary according to your cluster's configuration.

As the gradients and NN needs to be propagated to each GPU in the cluster every epoch (or a number of steps), Networking can become a bottleneck and sometimes different configurations need to be used for optimal performance.

Scaling Efficiency is the metric used to show by how much each additional GPU should benefit the training process with Horovod showing up to 90% (When running with a well written code and good parameters).

Horovod scaling

How can we distribute our training?#

There are two different cluster configurations (which can be combined) we need to take into account.

  • Multi Node — GPUs are distributed over multiple nodes in the cluster.

  • Multi GPU — GPUs are within a single Node.

In this demo we show a Multi Node Multi GPUData Parallel enabled training using Horovod.
However, you should always try and use the best distribution strategy for your use case (due to the added costs of the distribution itself, ability to run in an optimized way on specific hardware or other considerations that may arise).

How Horovod works?#

Horovod's primary motivation is to make it easy to take a single-GPU training script and successfully scale it to train across many GPUs in parallel. This has two aspects:

  • How much modification does one have to make to a program to make it distributed, and how easy is it to run it?

  • How much faster would it run in distributed mode?

Horovod Supports TensorFlow, Keras, PyTorch, and Apache MXNet.

in MLRun we use Horovod with MPI in order to create cluster resources and allow for optimized networking.
Note: Horovod and MPI may use NCCL when applicable which may require some specific configuration arguments to run optimally.

Horovod uses this MPI and NCCL concepts for distributed computation and messaging to quickly and easily synchronize between the different nodes or GPUs.

Ring Allreduce Strategy

Horovod will run your code on all the given nodes (Specific node can be addressed via hvd.rank()) while using an hvd.DistributedOptimizer wrapper to run the synchronization cycles between the copies of your Neural Network running at each node.

Note: Since all the copies of your Neural Network must be the same, Your workers will adjust themselves to the rate of the slowest worker (simply by waiting for it to finish the epoch and receive its updates). Thus try not to make a specific worker do a lot of additional work on each epoch (Like a lot of saving, extra calculations, etc…) since this can affect the overall training time.

How do we integrate TF2 with Horovod?#

As it's one of the main motivations, integration is fairly easy and requires only a few steps: (You can read the full instructions for all the different frameworks on Horovod's documentation website).

  1. Run hvd.init().

  2. Pin each GPU to a single process. With the typical setup of one GPU per process, set this to local rank. The first process on the server will be allocated the first GPU, the second process will be allocated the second GPU, and so forth.

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')
  1. Scale the learning rate by the number of workers.
    Effective batch size in synchronous distributed training is scaled by the number of workers. An increase in learning rate compensates for the increased batch size.

  2. Wrap the optimizer in hvd.DistributedOptimizer.
    The distributed optimizer delegates gradient computation to the original optimizer, averages gradients using allreduce or allgather, and then applies those averaged gradients.
    For TensorFlow v2, when using a tf.GradientTape, wrap the tape in hvd.DistributedGradientTape instead of wrapping the optimizer.

  3. Broadcast the initial variable states from rank 0 to all other processes.
    This is necessary to ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint.
    For TensorFlow v2, use hvd.broadcast_variables after models and optimizers have been initialized.

  4. Modify your code to save checkpoints only on worker 0 to prevent other workers from corrupting them.
    For TensorFlow v2, construct a tf.train.Checkpoint and only call checkpoint.save() when hvd.rank() == 0.

You can go to Horovod's Documentation to read more about horovod.

Image classification use case#

See the end to end Image Classification with Distributed Training Demo

Spark Operator runtime#

Note

The Spark runtimes spark and remote-spark do not support dbfs, http, or memory data stores.

The spark-on-k8s-operator allows Spark applications to be defined in a declarative manner and supports one-time Spark applications with SparkApplication and cron-scheduled applications with ScheduledSparkApplication.

When sending a request with MLRun to the Spark operator, the request contains your full application configuration including the code and dependencies to run (packaged as a docker image or specified via URIs), the infrastructure parameters, (e.g. the memory, CPU, and storage volume specs to allocate to each Spark executor), and the Spark configuration.

Kubernetes takes this request and starts the Spark driver in a Kubernetes pod (a k8s abstraction, just a docker container in this case). The Spark driver then communicates directly with the Kubernetes master to request executor pods, scaling them up and down at runtime according to the load if dynamic allocation is enabled. Kubernetes takes care of the bin-packing of the pods onto Kubernetes nodes (the physical VMs), and dynamically scales the various node pools to meet the requirements.

When using the Spark operator the resources are allocated per task, meaning that it scales down to zero when the task is done.

Memory limit

The Spark memory limit is calculated inside Spark based on the requests and memory overhead, and uses the spark.kubernetes.memoryOverheadFactor, which is set, by default, to 0.4. (See Running Spark on Kubernetes.) This results in higher memory than what you configure. To control the memory overhead, use:

func.spec.spark_conf["spark.driver.memoryOverhead"] = 0
func.spec.spark_conf["spark.executor.memoryOverhead"] = 0

where: 0 means no additional memory (in addition to what you configure); 100 means 100MiB of additional memory; "1g" means 1GiB of additional memory.

V3IO access

If your runtime should access V3IO, use with_igz_spark(). When calling func.with_igz_spark() the default spec and dependencies are defined.

WARNING

To avoid unexpected behavior, do not override these defaults.

The default spec is:

sj.spec.spark_conf
{'spark.eventLog.enabled': 'true',
'spark.eventLog.dir': 'file:///v3io/users/spark_history_server_logs'} #only added if there is a spark history server configured

And the default dependencies are:

{'jars': ['local:///spark/v3io-libs/v3io-hcfs_2.12.jar',
 'local:///spark/v3io-libs/v3io-spark3-streaming_2.12.jar',
 'local:///spark/v3io-libs/v3io-spark3-object-dataframe_2.12.jar',
 'local:///igz/java/libs/scala-library-2.12.14.jar',
 'local:///spark/jars/jmx_prometheus_javaagent-0.16.1.jar'],
'files': ['local:///igz/java/libs/v3io-pyspark.zip']}
Example of Spark function with Spark operator#
import mlrun
import os

# set up new spark function with spark operator
# command will use our spark code which needs to be located on our file system
# the name param can have only non capital letters (k8s convention)
read_csv_filepath = os.path.join(os.path.abspath("."), "spark_read_csv.py")
sj = mlrun.new_function(kind="spark", command=read_csv_filepath, name="sparkreadcsv")

# set spark driver config (gpu_type & gpus=<number_of_gpus>  supported too)
sj.with_driver_limits(cpu="1300m")
sj.with_driver_requests(cpu=1, mem="512m")

# set spark executor config (gpu_type & gpus=<number_of_gpus> are supported too)
sj.with_executor_limits(cpu="1400m")
sj.with_executor_requests(cpu=1, mem="512m")

# adds fuse, daemon & iguazio's jars support
sj.with_igz_spark()

# Alternately, move volume_mounts to driver and executor-specific fields and leave
# v3io mounts out of executor mounts if mount_v3io_to_executor=False
# sj.with_igz_spark(mount_v3io_to_executor=False)

# set spark driver volume mount
# sj.function.with_driver_host_path_volume("/host/path", "/mount/path")

# set spark executor volume mount
# sj.function.with_executor_host_path_volume("/host/path", "/mount/path")

# add python module
sj.with_requirements(["matplotlib"])

# Number of executors
sj.spec.replicas = 2

# add jars
# sj.spec.deps["jars"] += ["local:///<path to jar>"]
# Rebuilds the image with MLRun - needed in order to support logging artifacts etc.
sj.deploy()
# Run task while setting the artifact path on which the run artifact (in any) will be saved
sj.run(artifact_path="/User")
Spark Code (spark_read_csv.py)#
from pyspark.sql import SparkSession
from mlrun import get_or_create_ctx

context = get_or_create_ctx("spark-function")

# build spark session
spark = SparkSession.builder.appName("Spark job").getOrCreate()

# read csv
df = spark.read.load('iris.csv', format="csv",
                     sep=",", header="true")

# sample for logging
df_to_log = df.describe().toPandas()

# log final report
context.log_dataset("df_sample",
                     df=df_to_log,
                     format="csv")
spark.stop()
Nuclio real-time functions#

Nuclio is a high-performance "serverless" framework focused on data, I/O, and compute intensive workloads. It is well integrated with popular data science tools, such as Jupyter and Kubeflow; supports a variety of data and streaming sources; and supports execution over CPUs and GPUs.

You can use Nuclio through a fully managed application service (in the cloud or on-prem) in the Iguazio MLOps Platform. MLRun serving utilizes serverless Nuclio functions to create multi-stage real-time pipelines.

The underlying Nuclio serverless engine uses a high-performance parallel processing engine that maximizes the utilization of CPUs and GPUs, supports 13 protocols and invocation methods (for example, HTTP, Cron, Kafka, Kinesis), and includes dynamic auto-scaling for HTTP and streaming. Nuclio and MLRun support the full life cycle, including auto-generation of micro-services, APIs, load-balancing, logging, monitoring, and configuration management—such that developers can focus on code, and deploy to production faster with minimal work.

Nuclio is extremely fast: a single function instance can process hundreds of thousands of HTTP requests or data records per second. To learn more about how Nuclio works, see the Nuclio architecture documentation.

Nuclio is secure: Nuclio is integrated with Kaniko to allow a secure and production-ready way of building Docker images at run time.

Read more in the Nuclio documentation and the open-source MLRun library.

Example of Nuclio function#

You can create your own Nuclio function, for example a data processing function. For every Nuclio function, by default, there is one worker. See Number of GPUs.

The following code illustrates an example of an MLRun function, of kind 'nuclio', that can be deployed to the cluster.

Create a file func.py with the code of the function:

def handler(context, event):
    return "Hello"

Create the project and the Nuclio function:

import mlrun
# Create the project
project = mlrun.get_or_create_project("nuclio-project", "./")
# Create a Nuclio function
project.set_function(
    func="func.py",
    image="mlrun/mlrun",
    kind="nuclio",
    name="nuclio-func",
    handler="handler",
)
# Save the function within the project
project.save()
# Deploy the function in the cluster
project.deploy_function("nuclio-func")
Nuclio API gateway#

This example demonstrates making an HTTP request to an HTTPS API Gateway of a Nuclio function using basic/access key authentication.

import mlrun
import nuclio
# Create a project
project = mlrun.get_or_create_project(
    "nuclio-api-gateway-example", context="./", user_project=True
)
# mlrun: start-code
def handler(context, event):
    return "test"
# mlrun: end-code
# Create a simple Nuclio function that gets basic authentication
basic_auth = project.set_function(
    name="basic-auth", handler="handler", image="mlrun/mlrun", kind="nuclio"
)
# Create a simple nuclio function that gets accesss key authentication
access_key_auth = project.set_function(
    name="acces-key", handler="handler", image="mlrun/mlrun", kind="nuclio"
)
project.save()
# Deploy the function
basic_auth.deploy()
access_key_auth.deploy()
Making an HTTP request using basic authentication#
  1. Create an API Gateway in the UI, with authentication basic. Set your desired username and password and choose the basic-auth nuclio function.

  2. Give it a name and copy the endpoint.

  3. Paste the endpoint after the https://.

  4. Change the username and password in the code below.

import requests
from base64 import b64encode


# Authorization token: Encode to Base64 format
# and then decode it to ASCII since python 3 stores it as a byte string
def basic_auth(username, password):
    token = b64encode(f"{username}:{password}".encode("utf-8")).decode("ascii")
    return f"Basic {token}"


# Enter your username and password here
username = "username"
password = "password"

# Enter your API Gateway endpoint here
basic_auth_api_gateway_path = "https://<API GATEWAY ENDPOINT>"

headers = {"Authorization": basic_auth(username, password)}
res = requests.get(url=basic_auth_api_gateway_path, headers=headers, verify=False)
print(res.text)
Making an HTTP request using access key authentication#
  1. Create an API Gateway in the UI, with authentication access key and choose the access-key Nuclio function.

  2. Give it a name and copy the endpoint.

  3. Paste the endpoint after the https://.

  4. In the UI, click the user's top right icon, then copy the access key from there.

  5. Change the access key in the code below.

# Enter your access key here
access_key = "some-access-key"

# Enter your API Gateway endpoint here
access_key_auth_api_gateway_path = "https://<API GATEWAY ENDPOINT>"

headers = {"Cookie": 'session=j:{"sid": "' + access_key + '"}'}
res = requests.get(url=access_key_auth_api_gateway_path, headers=headers, verify=False)
print(res.text)

Create and use functions#

Functions are the basic building blocks of MLRun. They are essentially Python objects that know how to run locally or on a Kubernetes cluster. This section covers how to create and customize an MLRun function, as well as common parameters across all functions.

In this section:

Naming functions - best practice#

When you create a function, you specify its name. When you deploy a function using the SDK, MLRun appends the project name to the function name. Project names are limited to 63 characters. You must ensure that the combined function-project name does not exceed 63 characters. A function created in the UI has a default limit of 56 characters. MLRun adds to nuclio functions, by default, "nuclio-", giving a total of 63 characters.

Creating functions#

The recommended way to create an MLRun function is by using an MLRun project (see create and use projects). The general flow looks like the following:

project = mlrun.get_or_create_project(...)

fn = project.set_function(...)

When creating a function, there are 3 main scenarios:

Note

Using the set_function method of an MLRun project allows for each of these scenarios in a transparent way. Depending on the source passed in, the project registers the function using some lower level functions. For specific use cases, you also have access to the lower level functions new_function(), code_to_function(), and import_function().

Using set_function#

The MLRun project object has a method called set_function(), which is a one-size-fits-all way of creating an MLRun function. This method accepts a variety of sources including Python files, Jupyter Notebooks, Git repos, and more.

Note

The return value of set_function is your MLRun function. You can immediately run it or apply additional configurations like resources, scaling, etc. See Customizing functions for more details.

When using set_function there are a number of common parameters across all function types and creation scenarios. Consider the following example:

fn = project.set_function(
    name="my-function", tag="latest", func="my_function.py",
    image="mlrun/mlrun", kind="job", handler="train_model",
    requirements=["pandas==1.3.5"], with_repo=True
)
  • name: Name of your MLRun function within the given project. This is displayed in the MLRun UI together with the Kubernetes pod name.

  • tag: Tag for your function (much like a Docker image). Omitting this parameter defaults to latest.

  • func: What to run with the MLRun function. This can be a number of things including files (.py, .ipynb, .yaml, etc.), URIs (hub:// prefixed Function Hub URI, db:// prefixed MLRun DB URI), existing MLRun function objects, or None (for current .ipynb file).

  • image: Docker image to use when containerizing the piece of code. If you also specify the requirements parameter to build a new Docker image, the image parameter is used as the base image.

  • kind: Runtime the MLRun function uses. See Kinds of functions (runtimes) for the list of supported batch and real-time runtimes.

  • handler: Default function handler to invoke (e.g. a Python function within your code). This handler can also be overridden when executing the function. For example, project.run_function("func1", handler="f2") executes the f2 Python function, even if the default handler is f1.

  • requirements: Additional Python dependencies needed for the function to run. Using this parameter results in a new Docker image (using the image parameter as a base image). This can be a list of Python dependencies or a path to a requirements.txt file.

  • with_repo: Set to True if the function requires additional files or dependencies within a Git repo or archive file. This Git repo or archive file is specified on a project level via project.set_source(...), which the function consumes. If this parameter is omitted, the default is False.

Building images#

If your MLRun function requires additional libraries or files, you might need to build a new Docker image. You can do this by specifying a base image to use as the image, your requirements via requirements, and (optionally) your source code via with_repo=True (where the source is specified by project.set_source(...)). See more details about images in MLRun images and more information on when a build is required in Build function image.

Note

When using with_repo, the contents of the Git repo or archive are available in the current working directory of your MLRun function during runtime.

A good place to start is one of the default MLRun images:

  • mlrun/mlrun: An MLRun image includes preinstalled OpenMPI and other ML packages. Useful as a base image for simple jobs.

  • mlrun/mlrun-gpu: The same as mlrun/mlrun but for GPUs, including Open MPI.

Dockerfiles for the MLRun images can be found here.

Single source file#

The simplest way to create a function is to use a single file as the source. The code itself is embedded into the MLRun function object. This makes the function quite portable since it does not depend on any external files. You can use any source file supported by MLRun such as Python or Jupyter notebook.

Note

MLRun is not limited to Python. Files of type Bash, Go, etc. are also supported.

Python#

This is the simplest way to create a function out of a given piece of code. Simply pass in the path to the Python file relative to your project context directory.

fn = project.set_function(
    name="python", func="job.py",  kind="job",
    image="mlrun/mlrun", handler="handler"
)
Jupyter Notebook#

This is a great way to create a function out of a Jupyter Notebook. Just pass in the path to the Jupyter Notebook relative to your project context directory. You can use MLRun cell tags to specify which parts of the notebook should be included in the function.

Note

To ensure that the latest changes are included, make sure you save your notebook before creating/updating the function.

You can also create an MLRun function out of the current Jupyter Notebook you are running in. To do this, simply ommit the func parameter in set_function.

Multiple source files#

If your code requires additional files or external libraries, you need to use a source that supports multiple files such as Git, an archive (zip/tar), or V3IO file share. This approach (especially using a Git repo) pairs well with MLRun projects.

To do this, you must:

  • Provide with_repo=True when creating your function via project.set_function(...)

  • Set project source via project.set_source(source=...)

This instructs MLRun to load source code from the git repo/archive/file share associated with the project. There are two ways to load these additional files:

Load code from container#

The function is built once. This is the preferred approach for production workloads. For example:

project.set_source(source="git://github.com/mlrun/project-archive.git")

fn = project.set_function(
    name="myjob", handler="job_func.job_handler",
    image="mlrun/mlrun", kind="job", with_repo=True,
)

project.build_function(fn)
Load code at runtime#

The function pulls the source code at runtime. This is a simpler approach during development that allows for making code changes without re-building the image each time. For example:

archive_url = "https://s3.us-east-1.wasabisys.com/iguazio/project-archive/project-archive.zip"
project.set_source(source=archive_url, pull_at_runtime=True)

fn = project.set_function(
    name="nuclio", handler="nuclio_func:nuclio_handler",
    image="mlrun/mlrun", kind="nuclio", with_repo=True,
)
Load code at runtime using a non-default source#

If your project already has a default source, you can still use a different source at runtime or build of a function. Based on the previous example:

archive_url = "https://s3.us-east-1.wasabisys.com/iguazio/project-archive/project-archive.zip"
project.set_source(source=archive_url, pull_at_runtime=True)

fn = project.set_function(
    name="nuclio", handler="nuclio_func:nuclio_handler",
    image="mlrun/mlrun", kind="nuclio", with_repo=True,
)
fn.with_source_archive("https://s3.us-east-1.wasabisys.com/some/other/archive.zip")

See LocalRuntime.with_source_archive, KubejobRuntime.with_source_archive, RemoteRuntime.with_source_archive.

Import or use an existing function#

If you already have an MLRun function that you want to import, you can do so from multiple locations such as YAML, Function Hub, and MLRun DB.

Note

In the UI, running a batch job from an existing function executes the generated spec merged with the function spec. Therefore, if you remove a function spec, for example env vars, it may re-appear in the final job spec.

YAML#

MLRun functions can be exported to YAML files via fn.export(). These YAML files can then be imported via the following:

fn = project.set_function(name="import", func="function.yaml")
Function Hub#

Functions can also be imported from the MLRun Function Hub: simply import using the name of the function and the hub:// prefix:

Note

By default, the hub:// prefix points to the MLRun Function Hub. You can substitute your own repo. See Using a Git repo as a function hub.

fn = project.set_function(name="describe", func="hub://describe")
MLRun DB#

You can also import functions directly from the MLRun DB. These could be functions that have not been pushed to a git repo, archive, or Function Hub. Import via the name of the function and the db:// prefix:

fn = project.set_function(name="db", func="db://import")
MLRun function#

You can also directly use an existing MLRun function object. This is usually used when more granular control over function parameters is required (e.g. advanced parameters that are not supported by set_function()).

This example uses a real-time serving pipeline (graph).

fn = mlrun.new_function("serving", kind="serving", image="mlrun/mlrun")
graph = serving.set_topology("flow")
graph.to(name="double", handler="mylib.double") \
     .to(name="add3", handler="mylib.add3") \
     .to(name="echo", handler="mylib.echo").respond()

project.set_function(name="serving", func=fn, with_repo=True)
Customizing functions#

Once you have created your MLRun function, there are many customizations you can add, including Memory / CPU / GPU resources, preemption mode, pod priority, node selection, and more. See full details in Configuring runs and functions.

Converting notebooks to function#

MLRun annotations are used to identify the code that needs to be converted into an MLRun function. They provide non-intrusive hints that indicate which parts of your notebook should be considered as the code of the function.

Annotations start a code block using # mlrun: start-code and end a code block(s), with # mlrun: end-code. Use the #mlrun: ignore to exclude items from the code qualified annotations. Make sure that the annotations include anything required for the function to run.

# mlrun: start-code


def sub_handler():
    return "hello world"

The # mlrun: ignore annotation enables you to exclude the cell from the function code.

# mlrun: ignore

# the handler in the code section below will not call this sub_handler
def sub_handler():
    return "I will be ignored!"
def handler(context, event):
    return sub_handler()


# mlrun: end-code

Convert the function with mlrun.code_to_function and run the handler. Notice the returned value under results.

Note

Make sure to save the notebook before running mlrun.code_to_function so that the lateset changes will be reflected in the function.

from mlrun import code_to_function

some_function = code_to_function("some-function-name", kind="job", code_output=".")
some_function.run(name="some-function-name", handler="handler", local=True)
> 2021-11-01 07:42:44,930 [info] starting run some-function-name uid=742e7d6e930c48f3a2f1d6175e971455 DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
default 0 Nov 01 07:42:45 completed some-function-name
v3io_user=admin
kind=
owner=admin
host=jupyter-8459699595-z544v
return=hello world

> to track results use the .show() or .logs() methods or click here to open in UI
> 2021-11-01 07:42:45,214 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f3fc9ed81d0>

In this section

Named annotations#

The # mlrun: start-code and # mlrun: end-code annotations can be used to convert different code sections to different MLRun, functions in the same notebook. To do so add the name of the MLRun function to the end of the annotation as shown in the example below.

# mlrun: start-code my-function-name


def handler(context, event):
    return "hello from my-function"


# mlrun: end-code my-function-name

Convert the function and run the handler. Notice that the handler that is being used and that there is a change in the returned value under results.

my_function = code_to_function("my-function-name", kind="job")
my_function.run(name="my-function-name", handler="handler", local=True)
> 2021-11-01 07:42:53,892 [info] starting run my-function-name uid=e4bbc3cae21042439cc1c3cb9631751c DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
default 0 Nov 01 07:42:54 completed my-function-name
v3io_user=admin
kind=
owner=admin
host=jupyter-8459699595-z544v
return=hello from my-function

> to track results use the .show() or .logs() methods or click here to open in UI
> 2021-11-01 07:42:54,137 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f3fc9ac71d0>

Note

Make sure to use the name given to the code_to_function parameter (name='my-function-name' in the example above) so that all relevant start-code and end-code annotations are included. If none of the annotations are marked with the function's name, all annotations without any name are used.

Multi section function#

You can use the # mlrun: start-code and # mlrun: end-code annotations multiple times in a notebook since the whole notebook is scanned. The annotations can be named like the following example, and they can be nameless. If you choose nameless, remember all nameless annotations in the notebook are used.

# mlrun: start-code multi-section-function-name

function_name = "multi-section-function-name"

# mlrun: end-code multi-section-function-name

Any code between those sections are not included:

function_name = "I will be ignored!"
# mlrun: start-code multi-section-function-name
def handler(context, event):
    return f"hello from {function_name}"
# mlrun: end-code multi-section-function-name
my_multi_section_function = code_to_function("multi-section-function-name", kind="job")
my_multi_section_function.run(
    name="multi-section-function-name", handler="handler", local=True
)
> 2021-11-01 07:43:05,587 [info] starting run multi-section-function-name uid=9ac6a0e977a54980b657bae067c2242a DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
default 0 Nov 01 07:43:05 completed multi-section-function-name
v3io_user=admin
kind=
owner=admin
host=jupyter-8459699595-z544v
return=hello from multi-section-function-name

> to track results use the .show() or .logs() methods or click here to open in UI
> 2021-11-01 07:43:05,834 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f3fc9a24e10>
Annotation's position in code cell#

# mlrun: start-code and # mlrun: end-code annotations are relative to their positions inside the code block. Notice how the assignments to function_name below # mlrun: end-code don't override the assignment between the annotations in the function's context.

# mlrun: start-code part-cell-function


def handler(context, event):
    return f"hello from {function_name}"


function_name = "part-cell-function"

# mlrun: end-code part-cell-function

function_name = "I will be ignored"
my_multi_section_function = code_to_function("part-cell-function", kind="job")
my_multi_section_function.run(name="part-cell-function", handler="handler", local=True)
> 2021-11-01 07:43:14,347 [info] starting run part-cell-function uid=5426e665c7bc4ba492e0a704c5555fb6 DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
default 0 Nov 01 07:43:14 completed part-cell-function
v3io_user=admin
kind=
owner=admin
host=jupyter-8459699595-z544v
return=hello from part-cell-function

> to track results use the .show() or .logs() methods or click here to open in UI
> 2021-11-01 07:43:14,628 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f3fc9a2bf50>
Guidelines#
  • Make sure that every # mlrun: start-code has a corresponding # mlrun: end-code before the next # mlrun: start-code in the notebook.

  • Only one MLRun function can have a nameless annotation per notebook.

  • Do not use multiple # mlrun: start-code nor multiple # mlrun: end-code annotations in a single code cell. Only the first appearance of each is used.

  • Using single annotations:

    • Use a # mlrun: start-code alone, and all code blocks from the annotation to the end of the notebook are included.

    • Use a # mlrun: end-code alone, and all code blocks from the beginning of the notebook to the annotation are included.

Attach storage to functions#

In the vast majority of cases, an MLRun function requires access to storage. This storage might be used to provide inputs to the function including data-sets to process or data-streams that contain input events. Typically, storage is used to store function outputs and result artifacts. For example, trained models or processed data-sets.

Since MLRun functions can be distributed and executed in Kubernetes pods, the storage used would typically be shared, and execution pods would need some added configuration options applied to them so that the function code is able to access the designated storage. These configurations might be k8s volume mounts, specific environment variables that contain configuration and credentials, and other configuration of security settings. These storage configurations are not applicable to functions running locally in the development environment, since they are executed in the local context.

The common types of shared storage are:

  1. v3io storage through API — When running as part of the Iguazio system, MLRun has access to the system's v3io storage through paths such as v3io:///projects/my_projects/file.csv. To enable this type of access, several environment variables need to be configured in the pod that provide the v3io API URL and access keys.

  2. v3io storage through FUSE mount — Some tools cannot utilize the v3io API to access it and need basic filesystem semantics. For that purpose, v3io provides a FUSE (Filesystem in user-space) driver that can be used to mount v3io containers as specific paths in the pod itself. For example /User. To enable this, several specific volume mount configurations need to be applied to the pod spec.

  3. NFS storage access — When MLRun is deployed as open-source, independent of Iguazio, the deployment automatically adds a pod running NFS storage. To access this NFS storage through pods, a kubernetes pvc mount is needed.

  4. Others — As use-cases evolve, other cases of storage access may be needed. This will require various configurations to be applied to function execution pods.

MLRun attempts to offload this storage configuration task from the user by automatically applying the most common storage configuration to functions. As a result, most cases do not require any additional storage configurations before executing a function as a Kubernetes pod. The configurations applied by MLRun are:

  • In an Iguazio system, apply configurations for v3io access through the API.

  • In an open-source deployment where NFS is configured, apply configurations for pvc access to NFS storage.

This MLRun logic is referred to as auto-mount.

In this section

Disabling auto-mount#

In cases where the default storage configuration does not fit the function needs, MLRun allows for function spec modifiers to be manually applied to functions. These modifiers can add various configurations to the function spec, adding environment variables, mounts and additional configurations. MLRun also provides a set of common modifiers that can be used to apply storage configurations. These modifiers can be applied by using the .apply() method on the function and adding the modifier to apply. You can see some examples of this later in this page.

When a different storage configuration is manually applied to a function, MLRun's auto-mount logic is disabled. This prevents conflicts between configurations. The auto-mount logic can also be disabled by setting func.spec.disable_auto_mount = True on any MLRun function.

Modifying the auto-mount default configuration#

The default auto-mount behavior applied by MLRun is controlled by setting MLRun configuration parameters. For example, the logic can be set to automatically mount the v3io FUSE driver on all functions, or perform pvc mount for NFS storage on all functions. The following code demonstrates how to apply the v3io FUSE driver by default:

# Change MLRun auto-mount configuration
import mlrun.mlconf

mlrun.mlconf.storage.auto_mount_type = "v3io_fuse"

Each of the auto-mount supported methods applies a specific modifier function. The supported methods are:

  • v3io_credentials — apply v3io credentials needed for v3io API usage. Applies the v3io_cred() modifier.

  • v3io_fuse — create Fuse driver mount. Applies the mount_v3io() modifier.

  • pvc — create a pvc mount. Applies the mount_pvc() modifier.

  • auto — the default auto-mount logic as described above (either v3io_credentials or pvc).

  • none — perform no auto-mount (same as using disable_auto_mount = True).

The modifier functions executed by auto-mount can be further configured by specifying their parameters. These can be provided in the storage.auto_mount_params configuration parameters. Parameters can be passed as a string made of key=value pairs separated by commas. For example, the following code runs a pvc mount with specific parameters:

mlrun.mlconf.storage.auto_mount_type = "pvc"
pvc_params = {
    "pvc_name": "my_pvc_mount",
    "volume_name": "pvc_volume",
    "volume_mount_path": "/mnt/storage/nfs",
}
mlrun.mlconf.storage.auto_mount_params = ",".join(
    [f"{key}={value}" for key, value in pvc_params.items()]
)

Alternatively, the parameters can be provided as a base64-encoded JSON object, which can be useful when passing complex parameters or strings that contain special characters:

pvc_params_str = base64.b64encode(json.dumps(pvc_params).encode())
mlrun.mlconf.storage.auto_mount_params = pvc_params_str

Images and their usage in MLRun#

Every release of MLRun includes several images for different usages. The build and the infrastructure images are described, and located, in the README. They are also published to dockerhub and quay.io.

This release of MLRun supports only Python 3.9.

In this section

Using images#

See Build function image.

MLRun runtime images#

All images are published to dockerhub and quay.iohttps://quay.io/organization/mlrun.

The images are:

  • mlrun/mlrun: An MLRun image includes preinstalled OpenMPI and other ML packages. Useful as a base image for simple jobs.

  • mlrun/mlrun-gpu: The same as mlrun/mlrun but for GPUs, including Open MPI.

  • mlrun/ml-base: Image for file acquisition, compression, dask jobs, simple training jobs and other utilities.

  • mlrun/jupyter: An image with Jupyter giving a playground to use MLRun in the open source. Built on top of jupyter/scipy-notebook, with the addition of MLRun and several demos and examples.

Note

When using the mlrun or mlrun-gpu image, use PyTorch versions up to and including than 2.0.1, but not higher. You can build your own images with newer CUDA for later releases of PyTorch.

Building MLRun images#

To build all images, run this command from the root directory of the mlrun repository:

MLRUN_VERSION=X MLRUN_DOCKER_REPO=X MLRUN_DOCKER_REGISTRY=X make docker-images

Where:

  • MLRUN_VERSION is used as the tag of the image and also as the version injected into the code (e.g. latest or 0.7.0 or 0.6.5-rc6, defaults to unstable)

  • MLRUN_DOCKER_REPO is the docker repository (defaults to mlrun)

  • MLRUN_DOCKER_REGISTRY is the docker registry (e.g. quay.io/, gcr.io/, defaults to empty (docker hub))

For example, running MLRUN_VERSION=x.y.z make docker-images generates these images:

  • mlrun/mlrun-api:x.y.z

  • mlrun/mlrun:x.y.z

  • mlrun/mlrun-gpu:x.y.z

  • mlrun/jupyter:x.y.z

  • mlrun/ml-base:x.y.z

You can also build only a specific image, for example, make mlrun (builds only the api image).

The possible commands are:

  • mlrun

  • mlrun-gpu

To run an image locally and explore its contents: docker run -it <image-name>:<image-tag> /bin/bash or to load python (or run a script): docker run -it <image-name>:<image-tag> python

Building a docker image using a dockerfile and using it#

This flow describes how to build the image externally, put it your private repo, and use it in MLRun.

  1. Build an image using Dockerfile:

    1. Create a Dockerfile

    FROM mlrun/mlrun:X.X
    RUN pip install package1
    RUN pip install package2
    
    1. Build the image:

    docker build -t your_docker_registry/your_image_name:tag
    
    1. Push the image:

    docker push your_docker_registry/your_image_name:tag
    
  2. Create a secret on K8s level for accessing your registry:

    kubectl --namespace default-tenant create secret docker-registry registry-credentials \
        --docker-server your-docker-registry \
        --docker-username <    > \
        --docker-password <    > \
        --docker-email <    >
    
  3. In the code, use the image you created and provide the secret for pulling it:

    func = prj.set_function(name="func",...)
    func.set_image_pull_configuration(image_pull_secret_name="registry-credentials")
    

    Now when you run the function, the image is used.

MLRun images and external docker images#

There is no difference in the usage between the MLRun images and external docker images. However:

  • MLRun images resolve auto tags: If you specify image="mlrun/mlrun" the API fills in the tag by the client version, e.g. changes it to mlrun/mlrun:1.5.1. So, if the client gets upgraded you'll automatically get a new image tag.

  • Where the data node registry exists, MLRun Appends the registry prefix, so the image loads from the datanode registry. This pulls the image more quickly, and also supports air-gapped sites. When you specify an MLRun image, for example mlrun/mlrun:1.5.1, the actual image used is similar to datanode-registry.iguazio-platform.app.vm/mlrun/mlrun:1.5.1.

These characteristics are great when you’re working in a POC or development environment. But MLRun typically upgrades packages as part of the image, and therefore the default MLRun images can break your product flow.

Working with images in production#

Warning

For production, create your own images to ensure that the image is fixed.

  • Pin the image tag, e.g. image="mlrun/mlrun:1.5.1". This maintains the image tag at the version you specified, even when the client is upgraded. Otherwise, an upgrade of the client would also upgrade the image. (If you specify an external (not MLRun images) docker image, like python, the result is the docker/k8s default behavior, which defaults to latest when the tag is not provided.)

  • Pin the versions of requirements, again to avoid breakages, e.g. pandas==1.4.0. (If you only specify the package name, e.g. pandas, then pip/conda (python's package managers) just pick up the latest version.)

Build function image#

As discussed in Images and their usage in MLRun, MLRun provides pre-built images which contain the components necessary to execute an MLRun runtime. In some cases, however, custom images need to be created. This page details this process and the available options.

When is a build required?#

In many cases an MLRun runtime can be executed without having to build an image. This will be true when the basic MLRun images fulfill all the requirements for the code to execute. It is required to build an image if one of the following is true:

  • The code uses additional Python packages, OS packages, scripts or other configurations that need to be applied

  • The code uses different base-images or different versions of MLRun images than provided by default

  • Executed source code has changed, and the image has the code packaged in it - see here for more details on source code, and using with_code() to avoid re-building the image when the code has changed

  • The code runs nuclio functions, which are packaged as images (the build is triggered by MLRun and executed by nuclio)

The build process in MLRun is based on Kaniko and automated by MLRun - MLRun generates the dockerfile for the build process, and configures Kaniko with parameters needed for the build.

Building images is done through functions provided by the MlrunProject class. By using project functions, the same process is used to build and deploy a stand-alone function or functions serving as steps in a pipeline.

Automatically building images#

MLRun has the capability to auto-detect when a function image needs to first be built. Following is an example that will require building of the image:

project = mlrun.new_project(project_name, "./proj")

project.set_function(
   "train_code.py", 
   name="trainer",
   kind="job",
   image="mlrun/mlrun",
   handler="train_func",
   requirements=["pandas"]
)

# auto_build will trigger building the image before running, 
# due to the additional requirements.
project.run_function("trainer", auto_build=True)

Using the auto_build option is only suitable when the build configuration does not change between runs of the runtime. For example, if during the development process new requirements were added, the auto_build parameter should not be used, and manual build is needed to re-trigger a build of the image.

In the example above, the requirements parameter was used to specify a list of additional Python packages required by the code. This option directly affects the image build process - each requirement is installed using pip as part of the docker-build process. The requirements parameter can also contain a path to a requirements file, making it easier to reuse an existing configuration rather than specify a list of packages.

Manually building an image#

To manually build an image, use the build_function() function, which provides multiple options that control and configure the build process.

Specifying base image#

To use an existing image as the base image for building the image, set the image name in the base_image parameter. Note that this image serves as the base (dockerfile FROM property), and should not to be confused with the resulting image name, as specified in the image parameter.

project.build_function(
   "trainer",
   base_image="myrepo/my_base_image:latest",
)
Running commands#

To run arbitrary commands during the image build, pass them in the commands parameter of build_function(). For example:

github_repo = "myusername/myrepo.git@mybranch"

project.build_function(
   "trainer",
   base_image="myrepo/base_image:latest",
   commands= [
        "pip install git+https://github.com/" + github_repo,
        "mkdir -p /some/path && chmod 0777 /some/path",    
   ]
)

These commands are added as RUN operations to the dockerfile generating the image.

MLRun package deployment#

The with_mlrun and mlrun_version_specifier parameters allow control over the inclusion of the MLRun package in the build process. Depending on the base-image used for the build, the MLRun package may already be available in which case use with_mlrun=False. If not specified, MLRun will attempt to detect this situation - if the image used is one of the default MLRun images released with MLRun, with_mlrun is automatically set to False. If the code execution requires a different version of MLRun than the one used to deploy the function, set the mlrun_version_specifier to point at the specific version needed. This uses the published MLRun images of the specified version instead. For example:

project.build_function(
   "trainer",
   with_mlrun=True,
   mlrun_version_specifier="1.0.0"
)
Working with code repository#

As the code matures and evolves, the code will usually be stored in a git code repository. When the MLRun project is associated with a git repo (see Create, save, and use projects for details), functions can be added by calling set_function() and setting with_repo=True. This indicates that the code of the function should be retrieved from the project code repository.

In this case, the entire code repository will be retrieved from git as part of the image-building process, and cloned into the built image. This is recommended when the function relies on code spread across multiple files and also is usually preferred for production code, since it means that the code of the function is stable, and further modifications to the code will not cause instability in deployed images.

During the development phase it may be desired to retrieve the code in runtime, rather than re-build the function image every time the code changes. To enable this, use set_source() which gets a path to the source (can be a git repository or a tar or zip file) and set pull_at_runtime=True.

Using a private Docker registry#

By default, images are pushed to the registry configured during MLRun deployment, using the configured registry credentials.

To push resulting images to a different registry, specify the registry URL in the image parameter. If the registry requires credentials, create a k8s secret containing these credentials, and pass its name in the secret_name parameter.

When using ECR as registry, MLRun uses Kaniko's ECR credentials helper, in which case the secret provided should contain AWS credentials needed to create ECR repositories, as described here. MLRun detects automatically that the registry is an ECR registry based on its URL and configures Kaniko to use the ECR helper. For example:

# AWS credentials stored in a k8s secret -
# kubectl create secret generic ecr-credentials --from-file=<path to .aws/credentials>

project.build_function(
    "trainer",
    image="<aws_account_id>.dkr.ecr.us-east-2.amazonaws.com/myrepo/image:v1",
    secret_name="ecr-credentials",
)

When using an ECR registry and not providing a secret name, MLRun assumes that an EC2 instance role is used to authorize access to ECR. In this case MLRun clears out AWS credentials provided by project-secrets or environment variables (AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY) from the Kaniko pod used for building the image. Otherwise Kaniko would attempt to use these credentials for ECR access instead of using the instance role. This means it's not possible to build an image with both ECR access via instance role and S3 access using a different set of credentials. To build this image, the instance role that has access to ECR must have the permissions required to access S3.

Using self-signed registry#

If you need to build your function and push the resulting container image to an external Docker registry that uses a self-signed SSL certificate, you can use Kaniko with the --skip-tls-verify flag. When using this flag, Kaniko ignores the SSL certificate verification while pulling base images and/or pushing the final built image to the registry over HTTPS.

Caution: Using the --skip-tls-verify flag poses security risks since it bypasses SSL certificate validation. Only use this flag in trusted environments or with private registries where you are confident in the security of the network connections.

To use this flag, pass it in the extra_args parameter, for example:

project.build_function(
    ...
    extra_args="--skip-tls-verify",
)
Build environment variables#

It is possible to pass environment variables that will be set in the Kaniko pod that executes the build. This may be useful to pass important information needed for the build process. The variables are passed as a dictionary in the builder_env parameter, for example:

project.build_function(
   ...
   builder_env={"GIT_TOKEN": token},
)
Extra arguments#

It is also possible to pass custom arguments and flags to Kaniko. The extra_args parameter can be utilized in build_image(), build_function(), or during the deployment of the function. It provides a way to fine-tune the Kaniko build process according to your specific needs.

You can provide the extra_args as a string in the format of a CLI command line, just as you would when using Kaniko directly, for example:

project.build_function(
    ...
    extra_args="--build arg GIT_TOKEN=token --skip-tls-verify",
)

Note that when building an image in MLRun, project secrets are automatically passed to the builder pod as environment variables whose name is the secret key.

Deploying nuclio functions#

When using nuclio functions, the image build process is done by nuclio as part of the deployment of the function. Most of the configurations mentioned in this page are available for nuclio functions as well. To deploy a nuclio function, use deploy_function() instead of using build_function() and run_function().

Creating default Spark runtime images#

When using Spark to execute code, either using a Spark service (remote-spark) or the Spark operator, an image is required that contains both Spark binaries and dependencies, and MLRun code and dependencies. This image is used in the following scenarios:

  1. For remote-spark, the image is used to run the initial MLRun code which will submit the Spark job using the remote Spark service

  2. For Spark operator, the image is used for both the driver and the executor pods used to execute the Spark job

This image needs to be created any time a new version of Spark or MLRun is being used, to ensure that jobs are executed with the correct versions of both products.

To prepare this image, MLRun provides the following facilities:

# For remote Spark
from mlrun.runtimes import RemoteSparkRuntime
RemoteSparkRuntime.deploy_default_image()

# For Spark operator
from mlrun.runtimes import Spark3Runtime
Spark3Runtime.deploy_default_image()

Node affinity#

You can assign a node or a node group for services or for jobs executed by a service. When specified, the service or the pods of a function can only run on nodes whose labels match the node selector entries configured for the specific service. If node selection for the service is not specified, the selection criteria defaults to the Kubernetes default behavior, and jobs run on a random node.

For MLRun and Nuclio, you can also specify node selectors on a per-job basis. The default node selectors (defined at the service level) are applied to all jobs unless you specifically override them for an individual job.

You can configure node affinity for:

  • Jupyter

  • Presto (The node selection also affects any additional services that are directly affected by Presto, for example hive and mariadb, which are created if Enable hive is checked in the Presto service.)

  • Grafana

  • Shell

  • MLRun (default value applied to all jobs that can be overwritten for individual jobs)

  • Nuclio (default value applied to all jobs that can be overwritten for individual jobs)

See more about Kubernetes nodeSelector.

UI configuration#

Configure node selection on the service level in the service's Custom Parameters tab, under Resources, by adding or removing Key:Value pairs. For MLRun and Nuclio, this is the default node selection for all MLRun jobs and Nuclio functions.

You can also configure the node selection for individual MLRun jobs by going to Platform dashboard | Projects | New Job | Resources | Node selector, and adding or removing Key:Value pairs. Configure the node selection for individual Nuclio functions when creating a function in the Confguration tab, under Resources, by adding Key:Value pairs.

SDK configuration#

Configure node selection by adding the key:value pairs in your Jupyter notebook formatted as a Python dictionary.
For example:

import mlrun
import os
train_fn = mlrun.code_to_function('training', 
                            kind='job', 
                            handler='my_training_function') 
train_fn.with_preemption_mode(mode="prevent") 
train_fn.run(inputs={"dataset" :my_data})

            
# Add node selection
func.with_node_selection(node_selector={name})

See with_node_selection.

Function hub #

This section demonstrates how to import a function from the hub into your project, and provides some basic instructions on how to run the function and view the results.

In this section

MLRun function hub#

The MLRun function hub has a wide range of functions that can be used for a variety of use cases. There are functions for ETL, data preparation, training (ML & Deep learning), serving, alerts and notifications and more. Each function has a docstring that explains how to use it. In addition, the functions are associated with categories to make it easier for you to find the relevant one.

Functions can be easily imported into your project and therefore help you to speed up your development cycle by reusing built-in code.

The function hub is located here.
You can search and filter the categories and kinds to find a function that meets your needs.

Hub

Custom function hub#

You can create your own function hub, and connect it to MLRun. Then you can import functions (with their tags) from your custom hub.

Create a custom hub#

You can either fork the MLRun function hub repo and add to it your Git repo, or create a hub from scratch. Read CONTRIBUTING.md to learn how to create a function.

Note

Make sure your hub source is accessible via github (private is also possible).

To create a function hub from scratch, the hub structure must be the same as the MLRun hub.

The hierarchy must be:

  • functions directory

    • channels directories

      • some-function-1

      • some-function-2

      • some-function-n

        • version-1

        • version-n

        • latest

          • src

          • static (optional)

            • html files

Add a custom hub to the MLRun database#

When you add a hub, specify order=-1 to add it to the top of the list. The list order is relevant when loading a function: if you don't specify a hub name, MLRun starts searching for the function with the last added hub. If you want to add a hub but not at the top of the list, view the current list using list_hub_source(). The MLRun function hub is always the last in the list (and cannot be modified).

To add a hub, run:

import mlrun.common.schemas

# Add a custom hub to the top of the list
private_source = mlrun.common.schemas.IndexedHubSource(
    order=-1,
	source=mlrun.common.schemas.HubSource(
		metadata=mlrun.common.schemas.HubObjectMetadata(
		name="private", description="a private hub"
		),
		spec=mlrun.common.schemas.HubSourceSpec(
		path="https://mlrun.github.io/marketplace", channel="development"
		),
	)
)

db.create_hub_source(private_source)
Setting the project configuration#

The first step for each project is to set the project name and path:

from os import path, getenv
from mlrun import new_project

project_name = 'load-func'
project_path = path.abspath('conf')
project = new_project(project_name, project_path, init_git=True)

print(f'Project path: {project_path}\nProject name: {project_name}')
Set the artifacts path #

The artifact path is the default path for saving all the artifacts that the functions generate:

from mlrun import mlconf

# Target location for storing pipeline artifacts
artifact_path = path.abspath('jobs')
# MLRun DB path or API service URL
mlconf.dbpath = mlconf.dbpath or 'http://mlrun-api:8080'

print(f'Artifacts path: {artifact_path}\nMLRun DB path: {mlconf.dbpath}')
Loading functions from the hub#

Run project.set_function to add or update a function object to the project.

set_function(func, name='', kind='', image=None, with_repo=None)

Partial list of parameters:

  • func — function object or spec/code url.

  • name — name of the function (under the project).

  • kind — runtime kind e.g. job, nuclio, spark, dask, mpijob. Default: job.

  • image — docker image to be used, can also be specified in the function object/yaml.

  • with_repo — add (clone) the current repo to the build source.

See all the parameters in set_function() API documentation.

Load function example #

The describe function analyzes a csv or parquet file for data analysis. To load the describe function from the MLRun function hub:

project.set_function('hub://describe', 'describe')

To load the same function from your custom hub:

project.set_function('hub://<hub-name>/describe', 'describe')

Caution

If you don't specify a hub name at all, the algorithm searches for the function in all the hubs, giving preference to newly defined hubs. Therefore, if you have multiple hubs, best practice is to explicitly mention the hub name.

After loading the function, create a function object named, for example, my_describe:

my_describe = project.func('describe')
View the function params#

To view the parameters, run the function with .doc():

my_describe.doc()
    function: describe
    describe and visualizes dataset stats
    default handler: summarize
    entry points:
      summarize: Summarize a table
        context(MLClientCtx)  - the function context, default=
        table(DataItem)  - MLRun input pointing to pandas dataframe (csv/parquet file path), default=
        label_column(str)  - ground truth column label, default=None
        class_labels(List[str])  - label for each class in tables and plots, default=[]
        plot_hist(bool)  - (True) set this to False for large tables, default=True
        plots_dest(str)  - destination folder of summary plots (relative to artifact_path), default=plots
        update_dataset  - when the table is a registered dataset update the charts in-place, default=False
Running the function#

Use the run method to run the function.

When working with functions, pay attention to the following:

  • Input vs. params — for sending data items to a function, send it via "inputs" and not as params.

  • Working with artifacts — Artifacts from each run are stored in the artifact_path, which can be set globally with the environment variable (MLRUN_ARTIFACT_PATH) or with the config. If it's not already set you can create a directory and use it in the runs. Using {{run.uid}} in the path creates a unique directory per run. When using pipelines you can use the {{workflow.uid}} template option.

This example runs the describe function. This function analyzes a dataset (in this case it's a csv file) and generates HTML files (e.g. correlation, histogram) and saves them under the artifact path.

DATA_URL = 'https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'

my_describe.run(name='describe',
                inputs={'table': DATA_URL},
                artifact_path=artifact_path)
Saving the artifacts in a unique folder for each run #
out = mlconf.artifact_path or path.abspath('./data')
my_describe.run(name='describe',
                inputs={'table': DATA_URL},
                artifact_path=path.join(out, '{{run.uid}}'))
Viewing the jobs & the artifacts #

There are few options to view the outputs of the jobs you ran:

  • In Jupyter the result of the job is displayed in the Jupyter notebook. When you click on the artifacts it displays its content in Jupyter.

  • In the MLRun UI, under the project name, you can view the job that was running as well as the artifacts it generated.

Using a Git repo as a function hub#

You can save functions in a Git repo, and use this repo as your own function hub. This repo structure must conform with:

  • The name of the function YAML must be named function.yaml. You can use the export method to create the function yaml file.

  • The .yaml file must stored in a path like this: /function-name/function.yaml (e.g /func/function.yaml), for example: https://raw.githubusercontent.com/user-name/repo-name/function-name/function.yaml.

  • If you have additional files, for example a source file or a notebook example, they can be stored in the same folder as the function.yaml.

Tip

You can use Git tags for function versioning in Git. For example, to import a function named func that has a v1 tag:

import_func_1 = mlrun.import_function('hub://func:v1')

Create and export an MLRun function from a file#

You can use the function tag to tag the function in MLRun. It is not related to the Git tag. For example, this function has a 'version1' tag in MLRun and a 'v1' tag in Git.

function = project.set_function(
    name="func-hub",
    tag="version1",
    handler="func",
    image="mlrun/mlrun",
    func="./my-hub/func/func.py",
    kind="job",
)

Export the function to a YAML file

function.export("./my-hub/func/function.yaml")
Import and run the function from your repo#

You can use a import function from your "Git repo function hub" by pointing to it with its full URL, for example: https://raw.githubusercontent.com/user-name/repo-name/tag/name/function.yaml.

Working with tags

Assume there are multiple versions in Git: v1, v2, etc. You specify which version you want by appending :tag# to the hub path. The path must be to a folder that contains the function.yaml file in the func directory.

Private repo

If working from a private repo, set:
project.set_secret({"HTTPS_AUTH_TOKEN":<Http-Token, e.g. GIT-TOKEN})

# Import the v1 tag from git:
import_func_1 = project.set_function(
    "https://raw.githubusercontent.com/user-name/repo-name/v1/func/function.yaml",
    name="<function-name>",
)

# print the results
print(import_func_1.to_yaml())

# Run the function:
import_func_1.run()

Data and artifacts#

One of the biggest challenge in distributed systems is handling data given the different access methods, APIs, and authentication mechanisms across types and providers.

Working with the abstractions enable you to securely access different data sources through a single API, many continuance methods (e.g. to/from DataFrame, get, download, list, …), automated data movement, and versioning.

MLRun provides these main abstractions to access structured and unstructured data:

In this section

Data stores#

A data store defines a storage provider (e.g. file system, S3, Azure blob, Iguazio v3io, etc.).

In this section

Shared data stores#

MLRun supports multiple data stores. (More can easily added by extending the DataStore class.) Data stores are referred to using the schema prefix (e.g. s3://my-bucket/path). The currently supported schemas and their urls:

  • files — local/shared file paths, format: /file-dir/path/to/file (Unix) or C:/dir/file (Windows)

  • http, https — read data from HTTP sources (read-only), format: https://host/path/to/file (Not supported by runtimes spark and remote-spark)

  • s3 — S3 objects (AWS or other endpoints), format: s3://<bucket>/path/to/file

  • v3io, v3ios — Iguazio v3io data fabric, format: v3io://[<remote-host>]/<data-container>/path/to/file

  • az — Azure Blob storage, format: az://<container>/path/to/file

  • dbfs — Databricks storage, format: dbfs://path/to/file (Not supported by runtimes spark and remote-spark)

  • gs, gcs — Google Cloud Storage objects, format: gs://<bucket>/path/to/file

  • store — MLRun versioned artifacts (see Artifacts), format: store://artifacts/<project>/<artifact-name>[:tag]

  • memory — in memory data registry for passing data within the same process, format memory://key, use mlrun.datastore.set_in_memory_item(key, value) to register in memory data items (byte buffers or DataFrames). (Not supported by all Spark runtimes)

Storage credentials and parameters#

Data stores might require connection credentials. These can be provided through environment variables or project/job context secrets. The exact credentials depend on the type of the data store. They are listed in the following sections. Each parameter specified can be provided as an environment variable, or as a project-secret that has the same key as the name of the parameter.

MLRun jobs that are executed remotely run in independent pods, with their own environment. When setting an environment variable in the development environment (for example Jupyter), this has no effect on the executing pods. Therefore, before executing jobs that require access to storage credentials, these need to be provided by assigning environment variables to the MLRun runtime itself, assigning secrets to it, or placing the variables in project-secrets.

You can also use data store profiles to provide credentials.

Warning

Passing secrets as environment variables to runtimes is discouraged, as they are exposed in the pod spec. Refer to Working with secrets for details on secret handling in MLRun.

For example, running a function locally:

# Access object in AWS S3, in the "input-data" bucket 
source_url = "s3://input-data/input_data.csv"

os.environ["AWS_ACCESS_KEY_ID"] = "<access key ID>"
os.environ["AWS_SECRET_ACCESS_KEY"] = "<access key>"

# Execute a function that reads from the object pointed at by source_url.
# When running locally, the function can use the local environment variables.
local_run = func.run(name='aws_func', inputs={'source_url': source_url}, local=True)

Running the same function remotely:

# Executing the function remotely using env variables (not recommended!)
func.set_env("AWS_ACCESS_KEY_ID", "<access key ID>").set_env("AWS_SECRET_ACCESS_KEY", "<access key>")
remote_run = func.run(name='aws_func', inputs={'source_url': source_url})

# Using project-secrets (recommended) - project secrets are automatically mounted to project functions
secrets = {"AWS_ACCESS_KEY_ID": "<access key ID>", "AWS_SECRET_ACCESS_KEY": "<access key>"}
db = mlrun.get_run_db()
db.create_project_secrets(project=project_name, provider="kubernetes", secrets=secrets)

remote_run = func.run(name='aws_func', inputs={'source_url': source_url})

The following sections list the credentials and configuration parameters applicable to each storage type.

v3io#

When running in an Iguazio system, MLRun automatically configures the executed functions to use v3io storage, and passes the needed parameters (such as access-key) for authentication. Refer to the auto-mount section for more details on this process.

In some cases, the v3io configuration needs to be overridden. The following parameters can be configured:

  • V3IO_API — URL pointing to the v3io web-API service.

  • V3IO_ACCESS_KEY — access key used to authenticate with the web API.

  • V3IO_USERNAME — the user-name authenticating with v3io. While not strictly required when using an access-key to authenticate, it is used in several use-cases, such as resolving paths to the home-directory.

Azure Blob storage#

The Azure Blob storage can utilize several methods of authentication. Each requires a different set of parameters as listed here:

Authentication method

Parameters

Connection string

AZURE_STORAGE_CONNECTION_STRING

SAS token

AZURE_STORAGE_ACCOUNT_NAME
AZURE_STORAGE_SAS_TOKEN

Account key

AZURE_STORAGE_ACCOUNT_NAME
AZURE_STORAGE_KEY

Service principal with a client secret

AZURE_STORAGE_ACCOUNT_NAME
AZURE_STORAGE_CLIENT_ID
AZURE_STORAGE_CLIENT_SECRET
AZURE_STORAGE_TENANT_ID

Note

The AZURE_STORAGE_CONNECTION_STRING configuration uses the BlobServiceClient to access objects. This has limited functionality and cannot be used to access Azure Datalake storage objects. In this case use one of the other authentication methods that use the fsspec mechanism.

Google cloud storage#
  • GOOGLE_APPLICATION_CREDENTIALS — path to the application credentials to use (in the form of a JSON file). This can be used if this file is located in a location on shared storage, accessible to pods executing MLRun jobs.

  • GCP_CREDENTIALS — when the credentials file cannot be mounted to the pod, this secret or environment variable may contain the contents of this file. If configured in the function pod, MLRun dumps its contents to a temporary file and points GOOGLE_APPLICATION_CREDENTIALS at it. An exception is BigQuerySource, which passes GCP_CREDENTIALS's contents directly to the query engine.

Databricks file system#

Note

Not supported by the spark and remote-spark runtimes.

S3#
  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEYaccess key parameters

  • S3_ENDPOINT_URL — the S3 endpoint to use. If not specified, it defaults to AWS. For example, to access a storage bucket in Wasabi storage, use S3_ENDPOINT_URL = "https://s3.wasabisys.com"

  • MLRUN_AWS_ROLE_ARNIAM role to assume. Connect to AWS using the secret key and access key, and assume the role whose ARN is provided. The ARN must be of the format arn:aws:iam::<account-of-role-to-assume>:role/<name-of-role>

  • AWS_PROFILE — name of credentials profile from a local AWS credentials file. When using a profile, the authentication secrets (if defined) are ignored, and credentials are retrieved from the file. This option should be used for local development where AWS credentials already exist (created by aws CLI, for example)

Using data store profiles#

Notes

  • Datastore profile does not support: v3io (datastore, or source/target), snowflake source, DBFS for spark runtimes, Dask runtime.

  • Datastore profiles are not part of a project export/import.

You can use a data store profile to manage datastore credentials. A data store profile holds all the information required to address an external data source, including credentials. You can create multiple profiles for one datasource, for example, two different Redis data stores with different credentials. Targets, sources, and artifacts, can all use the data store profile by using the ds://<profile-name> convention. After you create a profile object, you make it available on remote pods by calling project.register_datastore_profile.

Create a data store profile in the context of a project. Example of creating a Redis datastore profile:

  1. Create the profile, for example:
    profile = DatastoreProfileRedis(name="profile-name", endpoint_url="redis://11.22.33.44:6379", username="user", password="password") The username and password parameters are optional.

  2. Register it within the project:
    project.register_datastore_profile(profile)

  3. Use the profile by specifying the 'ds' URI scheme. For example:
    RedisNoSqlTarget(path="ds://profile-name/a/b")
    If you want to use a profile from a different project, you can specify it explicitly in the URI using the format:
    RedisNoSqlTarget(path="ds://another_project@profile-name")

To access a profile from the client/sdk, register the profile locally by calling register_temporary_client_datastore_profile() with a profile object. You can also choose to retrieve the public information of an already registered profile by calling project.get_datastore_profile() and then adding the private credentials before registering it locally. For example, using Redis:

redis_profile = project.get_datastore_profile("my_profile")
local_redis_profile = DatastoreProfileRedis(redis_profile.name, redis_profile.endpoint_url, username="mylocaluser", password="mylocalpassword")
register_temporary_client_datastore_profile(local_redis_profile)
Azure data store profile#
profile = DatastoreProfileAzureBlob(name="profile-name",connection_string=connection_string)
ParquetTarget(path="ds://profile-name/az_blob/path/to/parquet.pq")

DatastoreProfileAzureBlob init parameters:

  • name — Name of the profile.

  • connection_string — The Azure connection string that points at a storage account. For privacy reasons, it's tagged as a private attribute, and its default value is None. The equivalent to this parameter in environment authentication is "AZURE_STORAGE_CONNECTION_STRING". for example:
    DefaultEndpointsProtocol=https;AccountName=myAcct;AccountKey=XXXX;EndpointSuffix=core.windows.net

The following variables allow alternative methods of authentication. All of these variables require account_name.

  • account_name — This parameter represents the name of the Azure Storage account. Each Azure Storage account has a unique name, and it serves as a globally-unique identifier for the storage account within the Azure cloud. The equivalent to this parameter in environment authentication is "AZURE_STORAGE_ACCOUNT_NAME".

  • account_key — The storage account key is a security credential associated with an Azure Storage account. It is a primary access key used for authentication and authorization purposes. This key is sensitive information and is kept confidential. The equivalent to this parameter in environment authentication is "AZURE_STORAGE_ACCOUNT_KEY".

  • sas_token — Shared Access Signature (SAS) token for time-bound access. This token is ensitive information. Equivalent to "AZURE_STORAGE_SAS_TOKEN" in environment authentication.

Authentication against Azure services using a Service Principal:

  • client_id — This variable holds the client ID associated with an Azure Active Directory (AAD) application, which represents the Service Principal. In Azure, a Service Principal is used for non-interactive authentication, allowing applications to access Azure resources. The equivalent to this parameter in environment authentication is "AZURE_STORAGE_CLIENT_ID".

  • client_secret — This variable stores the client secret associated with the Azure AD application (Service Principal). The client secret is a credential that proves the identity of the application when it requests access to Azure resources. This key is sensitive information and is kept confidential. The equivalent to this parameter in environment authentication is "AZURE_STORAGE_CLIENT_SECRET".

  • tenant_id — This variable holds the Azure AD tenant ID, which uniquely identifies the organization or directory in Azure Active Directory. The equivalent to this parameter in environment authentication is "AZURE_STORAGE_TENANT_ID". credential authentication:

  • credential — TokenCredential or SAS token. The credentials with which to authenticate. This variable is sensitive information and is kept confidential.

DBFS data store profile#
profile = DatastoreProfileDBFS(name="profile-name", endpoint_url="abc-d1e2345f-a6b2.cloud.databricks.com", token=token)
ParquetTarget(path="ds://profile-name/path/to/parquet.pq")

DatastoreProfileDBFS init parameters:

  • name — Name of the profile.

  • endpoint_url — A string representing the endpoint URL of the DBFS service. The equivalent to this parameter in environment authentication is "DATABRICKS_HOST".

  • token — A string representing the secret key used for authentication to the DBFS service. For privacy reasons, it's tagged as a private attribute, and its default value is None. The equivalent to this parameter in environment authentication is "DATABRICKS_TOKEN".

GCS data store profile#
profile = DatastoreProfileGCS(name="profile-name",credentials_path="/local_path/to/gcs_credentials.json")
ParquetTarget(path="ds://profile-name/gcs_bucket/path/to/parquet.pq")

DatastoreProfileGCS init parameters:

  • name — Name of the profile.

  • credentials_path — A string representing the local JSON file path that contains the authentication parameters required by the GCS API. The equivalent to this parameter in environment authentication is "GOOGLE_APPLICATION_CREDENTIALS."

  • gcp_credentials — A JSON in a string format representing the authentication parameters required by GCS API. For privacy reasons, it's tagged as a private attribute, and its default value is None. The equivalent to this parameter in environment authentication is "GCP_CREDENTIALS".

The code prioritizes gcp_credentials over credentials_path.

Kafka data store profile#
profile = DatastoreProfileKafkaTarget(name="profile-name",bootstrap_servers="localhost", topic="topic_name")
target = KafkaTarget(path="ds://profile-name")

DatastoreProfileKafkaTarget class parameters:

  • name — Name of the profile

  • bootstrap_servers — A string representing the 'bootstrap servers' for Kafka. These are the initial contact points you use to discover the full set of servers in the Kafka cluster, typically provided in the format host1:port1,host2:port2,....

  • topic — A string that denotes the Kafka topic to which data will be sent or from which data will be received.

  • kwargs_public — This is a dictionary (Dict) meant to hold a collection of key-value pairs that could represent settings or configurations deemed public. These pairs are subsequently passed as parameters to the underlying kafka.KafkaConsumer() constructor. The default value for kwargs_public is None.

  • kwargs_private — This dictionary (Dict) is designed to store key-value pairs, typically representing configurations that are of a private or sensitive nature. These pairs are also passed as parameters to the underlying kafka.KafkaConsumer() constructor. It defaults to None.

profile = DatastoreProfileKafkaSource(name="profile-name",bootstrap_servers="localhost", topic="topic_name")
target = KafkaSource(path="ds://profile-name")

DatastoreProfileKafkaSource class parameters:

  • name — Name of the profile

  • brokers — This parameter can either be a single string or a list of strings representing the Kafka brokers. Brokers serve as the contact points for clients to connect to the Kafka cluster.

  • topics — A string or list of strings that denote the Kafka topics from which data will be sourced or read.

  • group — A string representing the consumer group name. Consumer groups are used in Kafka to allow multiple consumers to coordinate and consume messages from topics. The default consumer group is set to "serving".

  • initial_offset — A string that defines the starting point for the Kafka consumer. It can be set to "earliest" to start consuming from the beginning of the topic, or "latest" to start consuming new messages only. The default is "earliest".

  • partitions — This can either be a single string or a list of strings representing the specific partitions from which the consumer should read. If not specified, the consumer can read from all partitions.

  • sasl_user — A string representing the username for SASL authentication, if required by the Kafka cluster. It's tagged as private for security reasons.

  • sasl_pass — A string representing the password for SASL authentication, correlating with the sasl_user. It's tagged as private for security considerations.

  • kwargs_public — This is a dictionary (Dict) that holds a collection of key-value pairs used to represent settings or configurations deemed public. These pairs are subsequently passed as parameters to the underlying kafka.KafkaProducer() constructor. It defaults to None.

  • kwargs_private — This dictionary (Dict) is used to store key-value pairs, typically representing configurations that are of a private or sensitive nature. These pairs are subsequently passed as parameters to the underlying kafka.KafkaProducer() constructor. It defaults to None.

Redis data store profile#
profile = DatastoreProfileRedis(name="profile-name", endpoint_url="redis://11.22.33.44:6379", username="user", password="password")
RedisNoSqlTarget(path="ds://profile-name/a/b")
S3 data store profile#
profile = DatastoreProfileS3(name="profile-name")
ParquetTarget(path="ds://profile-name/aws_bucket/path/to/parquet.pq")

DatastoreProfileS3 init parameters:

  • name — Name of the profile

  • endpoint_url — A string representing the endpoint URL of the S3 service. It's typically required for non-AWS S3-compatible services. If not provided, the default is None. The equivalent to this parameter in environment authentication is env["S3_ENDPOINT_URL"].

  • force_non_anonymous — A string that determines whether to force non-anonymous access to the S3 bucket. The default value is None, meaning the behavior is not explicitly set. The equivalent to this parameter in environment authentication is - force_non_anonymous — A string that determines whether to force non-anonymous access to the S3 bucket. The default value is None, meaning the behavior is not explicitly set. The equivalent to this parameter in environment authentication is env["S3_NON_ANONYMOUS"].

  • profile_name — A string representing the name of the profile. This might be used to refer to specific named configurations for connecting to S3. The default value is None. The equivalent to this parameter in environment authentication is env["AWS_PROFILE"].

  • assume_role_arn — A string representing the Amazon Resource Name (ARN) of the role to assume when interacting with the S3 service. This can be useful for granting temporary permissions. By default, it is set to None. The equivalent to this parameter in environment authentication is env["MLRUN_AWS_ROLE_ARN"]

  • access_key_id — A string representing the access key used for authentication to the S3 service. It's one of the credentials parts when you're not using anonymous access or IAM roles. For privacy reasons, it's tagged as a private attribute, and its default value is None. The equivalent to this parameter in environment authentication is env["AWS_ACCESS_KEY_ID"].

  • secret_key — A string representing the secret key, which pairs with the access key, used for authentication to the S3 service. It's the second part of the credentials when not using anonymous access or IAM roles. It's also tagged as private for privacy and security reasons. The default value is None. The equivalent to this parameter in environment authentication is env["AWS_SECRET_ACCESS_KEY"].

See also#

The methods get_datastore_profile() and list_datastore_profiles() only return public information about the profiles. Access to private attributes is restricted to applications running in Kubernetes pods.

Data items#

A data item can be one item or a collection of items (file, dir, table, etc.).

When running jobs or pipelines, data is passed using the DataItem objects. Data items objects abstract away the data backend implementation, provide a set of convenience methods (.as_df, .get, .show, …), and enable auto logging/versioning of data and metadata.

Example function:

# Save this code as a .py file:
import mlrun

def prep_data(context, source_url: mlrun.DataItem, label_column='label'):
    # Convert the DataItem to a Pandas DataFrame
    df = source_url.as_df()
    df = df.drop(label_column, axis=1).dropna()
    context.log_dataset('cleaned_data', df=df, index=False, format='csv')

Creating a project, setting the function into it, defining the URL with the data and running the function:

source_url = mlrun.get_sample_path('data/batch-predict/training_set.parquet')
project = mlrun.get_or_create_project("data-items", "./", user_project=True)
data_prep_func = project.set_function("data-prep.py", name="data-prep", kind="job", image="mlrun/mlrun", handler="prep_data")
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler=prep_data,
                                   inputs={'source_url': source_url},
                                   params={'label_column': 'label'})

To call the function with an input you can use the inputs dictionary attribute. To pass a simple parameter, use the params dictionary attribute. The input value is the specific item uri (per data store schema) as explained in Shared data stores.

From v1.3, DataItem objects are automatically parsed to the hinted type when a type hint is available.

Reading the data results from the run, you can easily get a run output artifact as a DataItem (so that you can view/use the artifact) using:

# read the data locally as a Dataframe
prep_data_run.artifact('cleaned_data').as_df()

The DataItem supports multiple convenience methods such as:

  • get(), put() - to read/write data

  • download(), upload() - to download/upload files

  • as_df() - to convert the data to a DataFrame object

  • local - to get a local file link to the data (that is downloaded locally if needed)

  • listdir(), stat - file system like methods

  • meta - access to the artifact metadata (in case of an artifact uri)

  • show() - visualizes the data in Jupyter (as image, html, etc.)

See the DataItem class documentation for details.

In order to get a DataItem object from a url use get_dataitem() or get_object() (returns the DataItem.get()).

For example:

df = mlrun.get_dataitem('s3://demo-data/mydata.csv').as_df()
print(mlrun.get_object('https://my-site/data.json'))

Artifacts#

An artifact is any data that is produced and/or consumed by functions, jobs, or pipelines.

There are several types of Artifacts. The type of the Artifact is reflected in the kind attribute of each Artifact. These types are also used for grouping the artifacts in the UI. The main kinds of artifacts are:

  • Files — Files, directories, images, figures, and plotlines

  • Datasets — Any data, such as tables and DataFrames

  • Models — All trained models

  • Feature Store Objects — Feature sets and feature vectors

Artifacts metadata is stored in the MLRun database.

In this section

See also:

Viewing artifacts#

Artifacts that are stored in certain paths (see Artifact path) can be viewed and managed in the UI. In the Project page, select the type of artifact you want to view from the left-hand menu: Feature Store (for feature-sets, feature-vectors, and features), Datasets, Models, and Artifacts (holds everything not in the other categories).

Example dataset artifact screen:

projects-artifacts

Artifacts that were generated by an MLRun job can also be viewed from the Jobs > Artifacts tab.

You can search the artifacts based on time and labels, and you can filter the artifacts by tag type. For each artifact, you can view its content, its location, the artifact type, labels, the producer of the artifact, the artifact owner, last update date, and type-specific information. You can download the artifact. You can also tag and remove tags from artifacts using the UI.

Artifact path#

Any path that is supported by MLRun can be used to store artifacts. However, only artifacts that are stored in paths that are system-configured as "allowed" in the MLRun service are visible in the UI. These are:

  • MLRun < 1.2: The allowed paths include only v3io paths

  • MLRun 1.2 and higher: Allows cloud storage paths — v3io://, s3://, az://, gcs://, gs://.
    http:// paths are not visible due to security reasons.

  • MLRun 1.5 adds support for dbfs (Databricks file system): dbfs://

Jobs use the default or job specific artifact_path parameter to determine where the artifacts are stored. The default artifact_path can be specified at the cluster level, client level, project level, or job level (at that precedence order), or can be specified as a parameter in the specific log operation.

You can set the default artifact_path for your environment using the set_environment() function.

You can override the default artifact_path configuration by setting the artifact_path parameter of the MlrunProject object, setting the artifact path for objects belonging to that project. You can use variables in the artifacts path, such as {{project}} for the name of the running project or {{run.uid}} for the current job/pipeline run UID. (The default artifacts path uses {{project}}.) The following example configures the artifacts path to an artifacts directory in the current active directory (./artifacts)

project.artifact_path='./artifacts'

For Iguazio MLOps Platform users

In the platform, the default artifacts path is the /artifacts directory in the predefined “projects” data container: /v3io/projects/<project name>/artifacts (for example, /v3io/projects/myproject/artifacts for a “myproject” project).

Saving artifacts in run-specific paths#

When you specify {{run.uid}}, the artifacts for each job are stored in a dedicated directory for each executed job. Under the artifact path, you should see the source-data file in a new directory whose name is derived from the unique run ID. Otherwise, the same artifacts directory is used in all runs, and the artifacts for newer runs override those from the previous runs.

As previously explained, set_environment returns a tuple with the project name and artifacts path. You can optionally save your environment's artifacts path to a variable, as demonstrated in the previous steps. You can then use the artifacts-path variable to extract paths to task-specific artifact subdirectories. For example, the following code extracts the path to the artifacts directory of a training task, and saves the path to a training_artifacts variable:

from os import path
training_artifacts = path.join(artifact_path, 'training')

Note

The artifacts path uses data store URLs, which are not necessarily local file paths (for example, s3://bucket/path). Be careful not to use such paths with general file utilities.

Artifact URIs, versioning, and metadata#

Artifacts have unique URIs in the form store://<type>/<project>/<key/path>[:tag]. The URI is automatically generated by log_artifact and can be used as input to jobs, functions, pipelines, etc.

Artifacts are versioned. Each unique version has a unique IDs (uid) and can have a tag label.
When the tag is not specified, it uses the latest version.

Artifact metadata and objects can be accessed through the SDK or downloaded from the UI (as YAML files). They host common and object specific metadata such as:

  • Common metadata: name, project, updated, version info

  • How they were produced (user, job, pipeline, etc.)

  • Lineage data (sources used to produce that artifact)

  • Information about formats, schema, sample data

  • Links to other artifacts (e.g. a model can point to a chart)

  • Type-specific attributes

Artifacts can be obtained via the SDK through type specific APIs or using generic artifact APIs such as:

Example artifact URLs:

store://artifacts/default/my-table
store://artifacts/sk-project/train-model:e95f757e-7959-4d66-b500-9f6cdb1f0bc7
store://feature-sets/stocks/quotes:v2
store://feature-vectors/stocks/enriched-ticker

Back to top

Model Artifacts#

An essential piece of artifact management and versioning is storing a model version. This allows the users to experiment with different models and compare their performance, without having to worry about losing their previous results.

The simplest way to store a model named my_model is with the following code:

from pickle import dumps
model_data = dumps(model)
context.log_model(key='my_model', body=model_data, model_file='my_model.pkl')

You can also store any related metrics by providing a dictionary in the metrics parameter, such as metrics={'accuracy': 0.9}. Furthermore, any additional data that you would like to store along with the model can be specified in the extra_data parameter. For example extra_data={'confusion': confusion.target_path}

A convenient utility method, eval_model_v2, which calculates mode metrics is available in mlrun.utils.

See example below for a simple model trained using scikit-learn (normally, you would send the data as input to the function). The last 2 lines evaluate the model and log the model.

from sklearn import linear_model
from sklearn import datasets
from sklearn.model_selection import train_test_split
from pickle import dumps

from mlrun.execution import MLClientCtx
from mlrun.mlutils import eval_model_v2

def train_iris(context: MLClientCtx):

    # Basic scikit-learn iris SVM model
    X, y = datasets.load_iris(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42)
    model = linear_model.LogisticRegression(max_iter=10000)
    model.fit(X_train, y_train)
    
    # Evaluate model results and get the evaluation metrics
    eval_metrics = eval_model_v2(context, X_test, y_test, model)
    
    # Log model
    context.log_model("model",
                      body=dumps(model),
                      artifact_path=context.artifact_subpath("models"),
                      extra_data=eval_metrics, 
                      model_file="model.pkl",
                      metrics=context.results,
                      labels={"class": "sklearn.linear_model.LogisticRegression"})

Save the code above to train_iris.py. The following code loads the function and runs it as a job. See the quick-start page to learn how to create the project and set the artifact path.

from mlrun import code_to_function

gen_func = code_to_function(name='train_iris',
                            filename='train_iris.py',
                            handler='train_iris',
                            kind='job',
                            image='mlrun/mlrun')

train_iris_func = project.set_function(gen_func).apply(auto_mount())

train_iris = train_iris_func.run(name='train_iris',
                                 handler='train_iris',
                                 artifact_path=artifact_path)

You can now use get_model to read the model and run it. This function will get the model file, metadata, and extra data. The input can be either the path of the model, or the directory where the model resides. If you provide a directory, the function will search for the model file (by default it searches for .pkl files)

The following example gets the model from models_path and test data in test_set with the expected label provided as a column of the test data. The name of the column containing the expected label is provided in label_column. The example then retrieves the models, runs the model with the test data and updates the model with the metrics and results of the test data.

from pickle import load

from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem
from mlrun.artifacts import get_model, update_model
from mlrun.mlutils import eval_model_v2

def test_model(context: MLClientCtx,
               models_path: DataItem,
               test_set: DataItem,
               label_column: str):

    if models_path is None:
        models_path = context.artifact_subpath("models")
    xtest = test_set.as_df()
    ytest = xtest.pop(label_column)

    model_file, model_obj, _ = get_model(models_path)
    model = load(open(model_file, 'rb'))

    extra_data = eval_model_v2(context, xtest, ytest.values, model)
    update_model(model_artifact=model_obj, extra_data=extra_data, 
                 metrics=context.results, key_prefix='validation-')

To run the code, place the code above in test_model.py and use the following snippet. The model from the previous step is provided as the models_path:

from mlrun.platforms import auto_mount
gen_func = code_to_function(name='test_model',
                            filename='test_model.py',
                            handler='test_model',
                            kind='job',
                            image='mlrun/mlrun')

func = project.set_function(gen_func).apply(auto_mount())

run = func.run(name='test_model',
                handler='test_model',
                params={'label_column': 'label'},
                inputs={'models_path': train_iris.outputs['model'],
                        'test_set': 'https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv'}),
                artifact_path=artifact_path)

Feature store#

A feature store provides a single pane of glass for sharing all available features across the organization along with their metadata. The MLRun feature store supports security, versioning, and data snapshots, enabling better data lineage, compliance, and manageability.

As illustrated in the diagram below, feature stores provide a mechanism (Feature Sets) to read data from various online or offline sources, conduct a set of data transformations, and persist the data in online and offline storage. Features are stored and cataloged along with all their metadata (schema, labels, statistics, etc.), allowing users to compose Feature Vectors and use them for training or serving. The feature vectors are generated when needed, taking into account data versioning and time correctness (time traveling). Different function kinds (Nuclio, Spark, Dask) are used for feature retrieval, real-time engines for serving, and batch for training.


feature-store

In this section

Feature store overview#

In machine-learning scenarios, generating a new feature, called feature engineering, takes a tremendous amount of work. The same features must be used both for training, based on historical data, and for the model prediction based on the online or real-time data. This creates a significant additional engineering effort, and leads to model inaccuracy when the online and offline features do not match. Furthermore, monitoring solutions must be built to track features and results, and to send alerts upon data or model drift.

Consider a scenario in which you train a model and one of its features is a comparison of the current amount to the average amount spent during the last 3 months by the same person. Creating such a feature is easy when you have the full dataset in training, but for serving this feature must be calculated in an online manner. The "brute-force" way to address this is to have an ML engineer create an online pipeline that re-implements all the feature calculations that comprise the offline process. This is not just time-consuming and error-prone, but very difficult to maintain over time, and results in a lengthy deployment time. This is exacerbated when having to deal with thousands of features, and an increasing number of data engineers and data scientists that are creating and using the features.

Challenges managing features

With MLRun's feature store you can easily define features during the training, which are deployable to serving, without having to define all the "glue" code. You simply create the necessary building blocks to define features and integration, with offline and online storage systems to access the features.

Feature store diagram

The feature store is comprised of the following:

  • Feature — In machine-learning, a feature is an individual measurable property or characteristic of a phenomenon being observed. This can be raw data (e.g., transaction amount, image pixel, etc.) or a calculation derived from one or more other features (e.g., deviation from average, pattern on image, etc.).

  • Feature sets — A grouping of features that are ingested together and stored in a logical group. Feature sets take data from offline or online sources, build a list of features through a set of transformations, and store the resulting features, along with the associated metadata and statistics. For example, transactions could be grouped by the ID of a person performing the transfer or by the device identifier used to perform the transaction. You can also define in the timestamp source in the feature set, and ingest data into a feature set.

  • Execution — A set of operations performed on the data while it is ingested. The transformation graph contains steps that represent data sources and targets, and can also include steps that transform and enrich the data that is passed through the feature set. For a deeper dive, see Feature set transformations.

  • Feature vectors — A set of features, taken from one or more feature sets. The feature vector is defined prior to model training and serves as the input to the model training process. During model serving, the feature values in the vector are obtained from an online service.

How the feature store works#

How feature store works

The common flow when working with the feature store is to first define the feature set with its source, transformation graph, and targets. (See the supported Sources and targets.) MLRun's robust transformation engine performs complex operations with just a few lines of Python code. To test the execution process, call the infer method with a sample DataFrame. This runs all operations in memory without storing the results.

Once the graph is defined, it's time to ingest the data. You can ingest data directly from a DataFrame, by calling the feature set ingest method. You can also define an ingestion process that runs as a Kubernetes job. This is useful if there is a large ingestion process, or if there is a recurrent ingestion and you want to schedule the job.

MLRun can also leverage Nuclio to perform real-time ingestion by calling the deploy_ingestion_service function. This means that during serving you can update feature values, and not just read them. For example, you can update a sliding window aggregation as part of a model serving process.

The next step is to define the feature vector. Call the get_offline_features() function to join together features across different feature sets.

Ingestion engines#

MLRun supports several ingestion engines:

  • storey engine (default) is designed for real-time data (e.g. individual records) that will be transformed using Python functions and classes

  • pandas engine is designed for batch data that can fit into memory that will be transformed using Pandas dataframes. Pandas is used for testing, and is not recommended for production deployments

  • spark engine is designed for batch data.

See also transformation — engine support.

Training and serving using the feature store#


feature-store-training

Next, extract a versioned offline static dataset for training, based on the parquet target defined in the feature sets. You can train a model with the feature vector data by providing the input in the form of 'store://feature-vectors/{project}/{feature_vector_name}'.

Training functions generate models and various model statistics. Use MLRun's auto logging capabilities to store the models along with all the relevant data, metadata, and measurements.

MLRun can apply all the MLOps functionality by using the framework specific apply_mlrun() method, which manages the training process and automatically logs all the framework specific model details, data, metadata, and metrics.

The training job automatically generates a set of results and versioned artifacts (run train_run.outputs to view the job outputs).

After you validate the feature vector, use the online feature service, based on the nosql target defined in the feature set, for real-time serving. For serving, you define a serving class derived from mlrun.serving.V2ModelServer. In the class load method, call the get_online_feature_service() function with the vector name, which returns a feature service object. In the class preprocess method, call the feature service get method to get the values of those features.

This feature store centric process, using one computation graph definition for a feature set, gives you an automatic online and offline implementation for the feature vectors with data versioning, both in terms of the actual graph that was used to calculate each data point, and the offline datasets that were created to train each model.

See more information in training with the feature store and Serving with the feature store.

Feature sets#

In MLRun, a group of features can be ingested together and stored in logical group called feature set. Feature sets take data from offline or online sources, build a list of features through a set of transformations, and store the resulting features along with the associated metadata and statistics.
A feature set can be viewed as a database table with multiple material implementations for batch and real-time access, along with the data pipeline definitions used to produce the features.

The feature set object contains the following information:

  • Metadata — General information which is helpful for search and organization. Examples are project, name, owner, last update, description, labels, etc.

  • Key attributes — Entity, timestamp key (optional), label column.

  • Features — The list of features along with their schema, metadata, validation policies and statistics.

  • Source — The online or offline data source definitions and ingestion policy (file, database, stream, http endpoint, etc.). See the source descriptions.

  • Transformation — The data transformation pipeline (e.g. aggregation, enrichment etc.).

  • Target stores — The type (i.e. parquet/csv or key value), location and status for the feature set materialized data. See the target descriptions.

  • Function — The type (storey, pandas, spark) and attributes of the data pipeline serverless functions.

In this section

See also:

Create a feature set#

Create a FeatureSet with the base definitions:

  • name — The feature set name is a unique name within a project.

  • entities — Each feature set must be associated with one or more index columns. When joining feature sets, the key columns are determined by the relations field if it exists, and otherwise by the entities.

Caution

Avoid using timestamps or bool as entities.

  • timestamp_key — (optional) Used for specifying the time field when joining by time.

  • engine — The processing engine type:

    • spark — Good for simple batch transformations

    • pandas — Good for simple batch transformations

    • storey — Default. Stream processing engine that can handle complex workflows and real-time sources. (Some advanced functionalities are in the Beta state.)
      See more about transformations.

  • label_column — Name of the label column (the one holding the target (y) values).

  • relations — (optional) Dictionary that indicates all of the relations between current feature set to other featuresets . It looks like: {"<my_column_name>":Entity, ...}. If the feature_set relations is None, the join is done based on feature_set entities. Relevant only for Dask and storey (local) engines. See more about joins in Using joins in an offline feature vector.

Example:

#Create a basic feature set example
stocks_set = FeatureSet("stocks", entities=[Entity("ticker")])
Create a feature set in the UI#
  1. Select a project and press Feature store, then press Create Set.

  2. After completing the form, press Save and Ingest to start the process, or Save to save the set for later ingestion.

Create a feature set without ingesting its data#

You can define and register a feature set (and use it in a feature vector) without ingesting its data into MLRun offline targets. This supports all batch sources.

The use-case for this is when you have a large amount of data in a remote storage that is ready to be consumed by a model-training pipeline. When this feature is enabled on a feature set, data is not saved to the offline target during ingestion. Instead, when get_offline_features is called on a vector containing that feature set, that data is read directly from the source. Online targets are still ingested, and their value represents a timeslice of the offline source. Transformations are not allowed when this feature is enabled: no computation graph, no aggregations, etc. Enable this feature by including passthrough=True in the feature set definition. All three ingestion engines (Storey, Spark, Pandas) are supported, as well as the retrieval engines "local" and "spark".

Typical code, from defining the feature set through ingesting its data:

# Flag the feature set as passthrough
my_fset = fstore.FeatureSet("my_fset", entities=[Entity("patient_id)], timestamp_key="timestamp", passthrough=True) 
csv_source = CSVSource("my_csv", path="data.csv"), time_field="timestamp")
# Ingest the source data, but only to online/nosql target
my_fset.ingest(csv_source) 
vector = fstore.FeatureVector("myvector", features=[f"my_fset"])
# Read the offline data directly from the csv source
resp = vector.get_offline_features(entity_timestamp_column="timestamp", with_indexes=True)  

Add transformations#

Define the data processing steps using a transformations graph (DAG).

A feature set data pipeline takes raw data from online or offline sources and transforms it to meaningful features. The MLRun feature store supports three processing engines (storey, pandas, spark) that can run in the client (e.g. Notebook) for interactive development or in elastic serverless functions for production and scale.

The data pipeline is defined using MLRun graph (DAG) language. Graph steps can be pre-defined operators (such as aggregate, filter, encode, map, join, impute, etc.) or custom python classes/functions. Read more about the graph in Real-time serving pipelines (graphs).

The results from the transformation pipeline are stored in one or more material targets. Data for offline access, such as training, is usually stored in Parquet files. Data for online access such as serving is stored in the Iguazio NoSQL DB (NoSqlTarget). You can use the default targets or add/replace with additional custom targets.

Graph example (storey engine):

import mlrun.feature_store as fstore
feature_set = fstore.FeatureSet("measurements", entities=[Entity(key)], timestamp_key="timestamp")
# Define the computational graph including the custom functions
feature_set.graph.to(DropColumns(drop_columns))\
                 .to(RenameColumns(mapping={'bad': 'bed'}))
feature_set.add_aggregation('hr', ['avg'], ["1h"])
feature_set.plot()
feature_set.ingest(data_df)

Graph example (pandas engine):

def myfunc1(df, context=None):
    df = df.drop(columns=["exchange"])
    return df

stocks_set = fstore.FeatureSet("stocks", entities=[Entity("ticker")], engine="pandas")
stocks_set.graph.to(name="s1", handler="myfunc1")
df = stocks_set.ingest(stocks_df)

The graph steps can use built-in transformation classes, simple python classes, or function handlers.

See more details in Feature set transformations.

Simulate and debug the data pipeline with a small dataset#

During the development phase it's pretty common to check the feature set definition and to simulate the creation of the feature set before ingesting the entire dataset, since ingesting the entire feature set can take time.
This allows you to get a preview of the results (in the returned dataframe). The simulation method is called preview. It previews in the source data schema, as well as processing the graph logic (assuming there is one) on a small subset of data. The preview operation also learns the feature set schema and does statistical analysis on the result by default.

df = fstore.preview(quotes_set, quotes)

# print the featue statistics
print(quotes_set.get_stats_table())

Feature set transformations#

A feature set contains an execution graph of operations that are performed when data is ingested, or when simulating data flow for inferring its metadata. This graph utilizes MLRun's Real-time serving pipelines (graphs).

The graph contains steps that represent data sources and targets, and may also contain steps whose purpose is transformations and enrichment of the data passed through the feature set. These transformations can be provided in one of three ways:

  • Aggregations — MLRun supports adding aggregate features to a feature set through the add_aggregation() function.

  • Built-in transformations — MLRun is equipped with a set of transformations provided through the storey.transformations package. These transformations can be added to the execution graph to perform common operations and transformations.

  • Custom transformations — You can extend the built-in functionality by adding new classes that perform any custom operation and use them in the serving graph.

Once a feature-set is created, its internal execution graph can be observed by calling the feature-set's plot() function, which generates a graphviz plot based on the internal graph. This is very useful when running within a Jupyter notebook, and produces a graph such as the following example:


feature-store-graph

This plot shows various transformations and aggregations being used as part of the feature-set processing, as well as the targets where results are saved to (in this case two targets). Feature-sets can also be observed in the MLRun UI, where the full graph can be seen and specific step properties can be observed:


ui-feature-set-graph

For a full end-to-end example of feature-store and usage of the functionality described in this page, refer to the feature store example.

In this section

Aggregations#

Aggregations, being a common tool in data preparation and ML feature engineering, are available directly through the MLRun FeatureSet class. These transformations add a new feature to the feature-set, which is created by performing an aggregate function over the feature's values.

If the name parameter is not specified, features are generated in the format {column_name}_{operation}_{window}.
If you supply the optional name parameter, features are generated in the format {name}_{operation}_{window}.

Feature names, which are generated internally, must match this regex pattern to be treated as aggregations: .*_[a-z]+_[0-9]+[smhd]$,
where [a-z]+ is the name of an aggregation.

Warning

You must ensure that your features will not conflict with the automatically generated feature names. For example, when using add_aggregation() on a feature X, you may get a generated feature name of X_count_1h. But if your dataset already contains X_count_1h, this would result in either unreliable aggregations or errors.

If either the pattern or the condition is not met, the feature is treated as a static (or "regular") feature.

These features can be fed into predictive models or can be used for additional processing and feature generation.

Notes

  • Internally, the graph step that is created to perform these aggregations is named "Aggregates". If more than one aggregation steps are needed, a unique name must be provided to each, using the step_name parameter.

  • The timestamp column must be part of the feature set definition (for aggregation).

Aggregations that are supported using this function are:

  • count

  • sum

  • sqr (sum of squares)

  • max

  • min

  • first

  • last

  • avg

  • stdvar (variance)

  • stddev (standard deviation)

For full description of this function, see the add_aggregation() documentation.

Windows#

You can use aggregation for time-based sliding windows and fixed windows. In general, sliding windows are used for real time data, while fixed windows are used for historical aggregations.

A window can be measured in years, days, hours, seconds, minutes. A window can be a single window, e.g. ‘1h’, ‘1d’, or a list of same unit windows e.g. [‘1h’, ‘6h’]. If you define the time period (in addition to the window), then you have a sliding window. If you don't define the time period, then the time period and the window are the same. All time windows are aligned to the epoch (1970-01-01T00:00:00Z).

  • Sliding window

    Sliding windows are fixed-size, overlapping, windows (defined by windows) that are evaluated at a sliding interval (defined by period).
    The period size must be an integral divisor of the window size.

    The following figure illustrates sliding windows of size 20 seconds, and periods of 10 seconds. Since the period is less than the window size, the windows contain overlapping data. In this example, events E4-E6 are in Windows 1 and 2. When Window 2 is evaluated at time t = 30 seconds, events E4-E6 are dropped from the event queue.

    _images/sliding-window.png

    The following code illustrates a feature-set that contains stock trading data including the specific bid price for each bid at any given time. You can add aggregate features that show the minimal and maximal bidding price over all the bids in the last 60 minutes, evaluated (sliding) at a 10 minute interval, per stock ticker (which is the entity in question).

    import mlrun.feature_store as fstore
    # create a new feature set
    quotes_set = fstore.FeatureSet("stock-quotes", entities=[fstore.Entity("ticker")])
    quotes_set.add_aggregation("bid", ["min", "max"], ["1h"], "10m", name="price")
    

    This code generates two new features: bid_min_1h and bid_max_1h every 10 minutes.

  • Fixed window

    A fixed window has a fixed-size, is non-overlapping, and gapless. A fixed time window is used for aggregating over a time period, (or day of the week). For example, how busy is a restaurant between 1 and 2 pm.
    When using a fixed window, each record in an in-application stream belongs to a specific window. The record is processed only once (when the query processes the window to which the record belongs).

    _images/fixed-window.png

    To define a fixed window, omit the time period. Using the above example, but for a fixed window:

    import mlrun.feature_store as fstore
    # create a new feature set
    quotes_set = fstore.FeatureSet("stock-quotes", entities=[fstore.Entity("ticker")])
    quotes_set.add_aggregation("bid", ["min", "max"], ["1h"] name="price")
    

    This code generates two new features: bid_min_1h and bid_max_1h once per hour.

Built-in transformations#

MLRun, and the associated storey package, have a built-in library of transformation functions that can be applied as steps in the feature-set's internal execution graph. To add steps to the graph, reference them from the FeatureSet object by using the graph property. Then, new steps can be added to the graph using the functions in storey.transformations (follow the link to browse the documentation and the list of existing functions). The transformations are also accessible directly from the storey module.

See the built-in steps.

Note

Internally, MLRun makes use of functions defined in the storey package for various purposes. When creating a feature-set and configuring it with sources and targets, what MLRun does behind the scenes is to add steps to the execution graph that wraps methods and classes that perform the actions. When defining an async execution graph, storey classes are used. For example, when defining a Parquet data-target in MLRun, a graph step is created that wraps storey's ParquetTarget function.

To use a function:

  1. Access the graph from the feature-set object, using the graph property.

  2. Add steps to the graph using the various graph functions, such as to(). The function object passed to the step should point at the transformation function being used.

The following is an example for adding a simple filter to the graph, that drops any bid that is lower than 50USD:

quotes_set.graph.to("storey.Filter", "filter", _fn="(event['bid'] > 50)")

In the example above, the parameter _fn denotes a callable expression that is passed to the storey.Filter class as the parameter fn. The callable parameter can also be a Python function, in which case there's no need for parentheses around it. This call generates a step in the graph called filter that calls the expression provided with the event being propagated through the graph as the data is fed to the feature-set.

Custom transformations#

When a transformation is needed that is not provided by the built-in functions, new classes that implement transformations can be created and added to the execution graph. Such classes should extend the MapClass class, and the actual transformation should be implemented within their do() function, which receives an event and returns the event after performing transformations and manipulations on it. For example, consider the following code:

class MyMap(MapClass):
    def __init__(self, multiplier=1, **kwargs):
        super().__init__(**kwargs)
        self._multiplier = multiplier

    def do(self, event):
        event["multi"] = event["bid"] * self._multiplier
        return event

The MyMap class can then be used to construct graph steps, in the same way as shown above for built-in functions:

quotes_set.graph.add_step("MyMap", "multi", after="filter", multiplier=3)

This uses the add_step function of the graph to add a step called multi utilizing MyMap after the filter step that was added previously. The class is initialized with a multiplier of 3.

Supporting multiple engines#

MLRun supports multiple processing engines for executing graphs. These engines differ in the way they invoke graph steps. When implementing custom transformations, the code has to support all engines that are expected to run it.

Note

The vast majority of MLRun's built-in transformations support all engines. The support matrix is available here.

The following are the main differences between transformation steps executing on different engines:

  • storey - the step receives a single event (either as a dictionary or as an Event object, depending on whether full_event is configured for the step). The step is expected to process the event and return the modified event.

  • spark - the step receives a Spark dataframe object. Steps are expected to add their processing and calculations to the dataframe (either in-place or not) and return the resulting dataframe without materializing the data.

  • pandas - the step receives a Pandas dataframe, processes it, and returns the dataframe.

To support multiple engines, extend the MLRunStep class with a custom transformation. This class allows implementing engine-specific code by overriding the following methods: _do_storey(), _do_pandas() and _do_spark(). To add support for a given engine, the relevant do method needs to be implemented.

When a graph is executed, each step is a single instance of the relevant class that gets invoked as events flow through the graph. For spark and pandas engines, this only happens once per ingestion, since the entire data-frame is fed to the graph. For the storey engine the same instance's _do_storey() function will be invoked per input row. As the graph is initialized, this class instance can receive global parameters in its __init__ method that determines its behavior.

The following example class multiplies a feature by a value and adds it to the event. (For simplicity, data type checks and validations were omitted as well as needed imports.) Note that the class also extends StepToDict - this class implements generic serialization of graph steps to a python dictionary. This functionality allows passing instances of this class to graph.to() and graph.add_step():

class MultiplyFeature(StepToDict, MLRunStep):
    def __init__(self, feature: str, value: int, **kwargs):
        super().__init__(**kwargs)
        self._feature = feature
        self._value = value
        self._new_feature = f"{feature}_times_{value}"

    def _do_storey(self, event):
        # event is a single row represented by a dictionary
        event[self._new_feature] = event[self._feature] * self._value  
        return event

    def _do_pandas(self, event):
        # event is a pandas.DataFrame
        event[self._new_feature] = event[self._feature].multiply(self._value)
        return event

    def _do_spark(self, event):
        # event is a pyspark.sql.DataFrame
        return event.withColumn(self._new_feature, 
                                col(self._feature) * lit(self._value)
                                )

The following example uses this step in a feature-set graph with the pandas engine. This example adds a feature called number1_times_4 with the value of the number1 feature multiplied by 4. Note how the global parameters are passed when creating the graph step:

import mlrun.feature_store as fstore

feature_set = fstore.FeatureSet("fs-new", 
                                entities=[fstore.Entity("id")], 
                                engine="pandas",
                                )
# Adding multiply step, with specific parameters
feature_set.graph.to(MultiplyFeature(feature="number1", value=4))
df_pandas = feature_set.ingest(data)

Data transformation steps#

The following table lists the available data-transformation steps. The next table details the ingestion engines support of these steps.

Class name

Description

Storey

Spark

Pandas

#mlrun.feature_store.FeatureSet.add_aggregation()

Aggregates the data into the table object provided for later persistence, and outputs an event enriched with the requested aggregation features.

Y
Not supported with online target SQLTarget

Y

N

mlrun.feature_store.steps.DateExtractor()

Extract a date-time component.

Y

N
Supports part extract (ex. day_of_week) but does not support boolean (ex. is_leap_year)

Y

mlrun.feature_store.steps.DropFeatures()

Drop features from feature list.

Y

Y

Y

mlrun.feature_store.steps.Imputer()

Replace None values with default values.

Y

Y

Y

mlrun.feature_store.steps.MapValues()

Map column values to new values.

Y

Y

Y

mlrun.feature_store.steps.OneHotEncoder()

Create new binary fields, one per category (one hot encoded).

Y

Y

Y

mlrun.feature_store.steps.SetEventMetadata()

Set the event metadata (id, key, timestamp) from the event body.

Y

N

N

mlrun.feature_store.steps.FeaturesetValidator()

Validate feature values according to the feature set validation policy

Y

N

Y

Creating and using feature vectors#

You can define a group of features from different feature sets as a FeatureVector.
Feature vectors are used as an input for models, allowing you to define the feature vector once, and in turn create and track the datasets created from it or the online manifestation of the vector for real-time prediction needs.

The feature vector handles all the merging logic for you using an asof merge type merge that accounts for both the time and the entity. It ensures that all the latest relevant data is fetched, without concerns about "seeing the future" or other types of common time-related errors.

After a feature vector is saved, it can be used to create both offline (static) datasets and online (real-time) instances to supply as input to a machine learning model.

In this section

Creating a feature vector#

The feature vector object minimally holds the following information:

  • Name — the feature vector's name as will be later addressed in the store reference store://feature_vectors/<project>/<feature-vector-name> and the UI (after saving the vector).

  • Description — a string description of the feature vector.

  • Features — a list of features that comprise the feature vector.
    The feature list is defined by specifying the <feature-set>.<feature-name> for specific features or <feature-set>.* for all of the feature set's features.

  • Label feature — the feature that is the label for this specific feature vector, as a <feature-set>.<feature-name> string specification. In classification tasks, the label_feature may contain the expected label of each record, and be compared with the model output when training or evaluating a model.

Example of creating a feature vector:

import mlrun.feature_store as fstore

# Feature vector definitions
feature_vector_name = 'example-fv'
feature_vector_description = 'Example feature vector'
features = ['data_source_1.*', 
            'data_source_2.feature_1', 
            'data_source_2.feature_2',
            'data_source_3.*']
label_feature = 'label_source_1.label_feature'

# Feature vector creation
fv = fstore.FeatureVector(name=feature_vector_name,
                          features=features,
                          label_feature=label_feature,
                          description=feature_vector_description)

# Save the feature vector in the MLRun DB
# so it can be referenced by the `store://`
# and show in the UI
fv.save()

After saving the feature vector, it appears in the UI:

feature-store-vector-line

You can also view some metadata about the feature vector, including all the features, their types, a preview, and statistics:

feature-store-vector-screen
Feature vectors with different entities and complex joins#

Note

Tech Preview

You can define a feature vector that joins between different feature sets not using the same entity and with a "complex" join types. The join types can differ for different feature set combinations. This configuration supports online and offline feature vectors.

You can define relations within a feature set in two ways:

  • Explicitly defining relations within the feature set itself.

  • Specifying relations in the context of a feature vector by passing them through the relations parameter (FeatureVector). This is a dictionary specifying the relations between feature sets in the feature vector. The keys of the dictionary are feature set names, and the values are both dictionaries whose keys represent column names (of the feature set), and they represent the target entities to join with. The relations take precedence over the relations that were specified on the feature sets themselves. If a specific feature set is not mentioned as a key in relations, the function falls back to using the default relations defined in the feature set.

You can define a graph using the join_graph parameter (FeatureVector()), which defines the join type. You can use the graph to define complex joins and pass on the relations to the vector. Currently, only one branch (DAG) is supported. This means that operations involving brackets are not available.

You can merge two feature sets when the left one has more entities, only if all the entities of the right feature set exist in the left feature set's entities.

When using a left join, you must explicitly specify whether you want to perform an as_of join or not. The left join type is the only one that implements the "as_of" join.

An example, assuming three feature sets: [fs1, fs2, fs3]:

join_graph = JoinGraph(first_feature_set=fs_1).inner(fs_2).outer(fs_3)
vector = FeatureVector("myvector", features, 
                        join_graph=join_graph, 
                        relation={fs_1:{'col_1':'entity_2'}}) # the relation between fs1-> fs3 / fs2-> fs3 is already defined or they have                           the same entity 

Using an offline feature vector#

Use the feature store's get_offline_features() function to produce a dataset from the feature vector. It creates the dataset (asynchronously if possible), saves it to the requested target, and returns an OfflineVectorResponse.
Due to the async nature of this action, the response object contains an fv_response.status indicator that, once completed, could be directly turned into a dataframe, parquet or a csv.

get_offline_features supports Storey, Dask, Spark Operator, and Remote Spark.

See get_offline_features() for the list of parameters it expects to receive,

You can create a feature vector that comprises different feature sets, while joining the data based on specific fields and not the entity. For example:

  • Feature set A is a transaction feature set and one of the fields is email.

  • Feature set B is feature set with the fields email and count distinct. You can build a feature vector that comprises fields in feature set A and get the count distinct for the email from feature set B. The join in this case is based on the email column.

Here's an example of a new dataset from a Parquet target:

# Import the Parquet Target, so you can build your dataset from a parquet file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector based on vector and parquet target
offline_fv = fstore.get_offline_features(feature_vector_name, target=ParquetTarget())

# Return dataset
dataset = offline_fv.to_dataframe()

After you create an offline feature vector with a static target (such as ParquetTarget()) the reference to this dataset is saved as part of the feature vector's metadata and can be referenced directly through the store as a function input using store://feature-vectors/{project}/{feature_vector_name}.

For example:

fn = mlrun.import_function('hub://sklearn-classifier').apply(auto_mount())

# Define the training task, including the feature vector and label
task = mlrun.new_task('training', 
                      inputs={'dataset': f'store://feature-vectors/{project}/{feature_vector_name}'},
                      params={'label_column': 'label'}
                     )

# Run the function
run = fn.run(task)

See a full example of using the offline feature vector to create an ML model in part 2 of the end-to-end demo.

You can use get_offline_features for a feature vector whose data is not ingested. See Create a feature set without ingesting its data.

Using joins in an offline feature vector#

You can create a join for:

  • Feature sets that have a common entity

  • Feature sets that do not have a common entity

Feature sets that have a common entity

In this case, the join is performed on the common entity.

employees_set_entity = fs.Entity("id")
employees_set = fs.FeatureSet(
    "employees",
    entities=[employees_set_entity],
)
employees_set.set_targets(targets=["parquet"], with_defaults=False)
fs.ingest(employees_set, employees)

mini_employees_set = fs.FeatureSet(
    "mini-employees",
    entities=[employees_set_entity],
    },
)
mini_employees_set.set_targets(targets=["parquet"], with_defaults=False)
fs.ingest(mini_employees_set, employees_mini)

features = ["employees.name as n", "mini-employees.name as mini_name"]

vector = fs.FeatureVector(
    "mini-emp-vec", features, description="Employees feature vector"
)
vector.save()

resp = fs.get_offline_features(
    vector,
    engine_args=engine_args,
    with_indexes=True,
)

Feature sets that do not have a common entity

In this case, you define the relations between the features set with the argument: relations={column_name(str): Entity}
and you include this dictionary when initializing the feature set.

departments_set_entity = fs.Entity("d_id")
departments_set = fs.FeatureSet(
    "departments",
    entities=[departments_set_entity],
)

departments_set.set_targets(targets=["parquet"], with_defaults=False)
fs.ingest(departments_set, departments)

employees_set_entity = fs.Entity("id")
employees_set = fs.FeatureSet(
    "employees",
    entities=[employees_set_entity],
    relations={"department_id": departments_set_entity},  # dictionary where the key is str identifying a column/feature on this feature-set, and the dictionary value is an Entity object on another feature-set
)
employees_set.set_targets(targets=["parquet"], with_defaults=False)
fs.ingest(employees_set, employees)
features = ["employees.name as emp_name", "departments.name as dep_name"]

vector = fs.FeatureVector(
    "employees-vec", features, description="Employees feature vector"
)

resp = fs.get_offline_features(
    vector,
    engine_args=engine_args,
    with_indexes=False,
)

Using an online feature vector#

The online feature vector provides real-time feature vectors to the model using the latest data available.

First create an Online Feature Service using get_online_feature_service(). Then feed the Entity of the feature vector to the service and receive the latest feature vector.

To create the OnlineVectorService you only need to pass it the feature vector's store reference.

import mlrun.feature_store as fstore

# Create the Feature Vector Online Service
feature_vector = 'store://feature-vectors/{project}/{feature_vector_name}'
svc = fstore.get_online_feature_service(feature_vector)

The online feature service supports value imputing (substitute NaN/Inf values with statistical or constant value). You can set the impute_policy parameter with the imputing policy, and specify which constant or statistical value will be used instead of NaN/Inf value. This can be defined per column or for all the columns ("*"). The replaced value can be a fixed number for constants or $mean, $max, $min, $std, $count for statistical values. "*" is used to specify the default for all features, for example:

svc = fstore.get_online_feature_service(feature_vector, impute_policy={"*": "$mean", "age": 33})

To use the online feature service you need to supply a list of entities you want to get the feature vectors for. The service returns the feature vectors as a dictionary of {<feature-name>: <feature-value>} or simply a list of values as numpy arrays.

For example:

# Define the wanted entities
entities = [{<feature-vector-entity-column-name>: <entity>}]

# Get the feature vectors from the service
svc.get(entities)

The entities can be a list of dictionaries as shown in the example, or a list of lists where the values in the internal list correspond to the entity values (e.g. entities = [["Joe"], ["Mike"]]). The .get() method returns a dict by default. If you want to return an ordered list of values, set the as_list parameter to True. The list input is required by many ML frameworks and this eliminates additional glue logic.

When defining a graph using the join_graph parameter (FeatureVector()), the get_online_feature_service uses QueryByKey on the kv store: all join types in the graph turn into left joins. Consequently, the function performs joins using the latest events for each required entity within each feature set.

You can use the parameter entity_keys to join features by relations, instead of common entities. You define the relations, and the starting place. See get_online_feature_service().

See a full example of using the online feature service inside a serving function in part 3 of the end-to-end demo.

Sources and targets#

Sources#

For batch ingestion the feature store supports dataframes and files (i.e. csv & parquet).
The files can reside on S3, NFS, SQL (for example, MYSQL), Azure blob storage, or the Iguazio platform. MLRun also supports Google BigQuery as a data source.

For real time ingestion the source can be http, Kafka, MySQL, or V3IO stream, etc. When defining a source, it maps to nuclio event triggers.

You can also create a custom source to access various databases or data sources.

Class name

Description

storey

spark

pandas

BigQuerySource

Batch. Reads Google BigQuery query results as input source for a flow.

N

Y

Y

SnowFlakeSource

Batch. Reads Snowflake query results as input source for a flow

N

Y

N

SQLSource

Batch. Reads SQL query results as input source for a flow

Y

N

Y

CSVSource

Batch. Reads a CSV file as input source for a flow.

Y

Y

Y

DataframeSource

Batch. Reads data frame as input source for a flow.

Y

N

N

ParquetSource

Batch. Reads the Parquet file/dir as the input source for a flow.

Y

Y

Y

HttpSource

Event-based. Sets the HTTP-endpoint source for the flow.

Y

N

N

Apache Kafka source and Confluent Kafka source

Event-based. Sets the kafka source for the flow.

Y

N

N

StreamSource

Event-based. Sets the stream source for the flow. If the stream doesn’t exist it creates it.

Y

N

N

S3/Azure source#

When working with S3/Azure, there are additional requirements. Use: pip install mlrun[s3]; pip install mlrun[azure-blob-storage]; or pip install mlrun[google-cloud-storage] to install them.

  • Azure: define the environment variable AZURE_STORAGE_CONNECTION_STRING

  • S3: define AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_BUCKET

SQL source#

Note

Tech Preview

Limitation

Do not use SQL reserved words as entity names. See more details in Keywords and Reserved Words.

SQLSource can be used for both batch ingestion and real time ingestion. It supports storey but does not support Spark. To configure either, pass the db_uri or overwrite the MLRUN_SQL__URL env var, in this format:
mysql+pymysql://<username>:<password>@<host>:<port>/<db_name>, for example:

source = SQLSource(table_name='my_table', 
                     db_path="mysql+pymysql://abc:abc@localhost:3306/my_db", 
                     key_field='key',
                     parse_dates=['timestamp'])
 
 feature_set = fs.FeatureSet("my_fs", entities=[fs.Entity('key')],)
 feature_set.set_targets([])
 df = fs.ingest(feature_set, source=source)

Apache Kafka source#

Example:

from mlrun.datastore.sources import KafkaSource


with open('/v3io/bigdata/name.crt') as x: 
    caCert = x.read()  
caCert

kafka_source = KafkaSource(
            brokers=['default-tenant.app.vmdev76.lab.iguazeng.com:9092'],
            topics="stocks-topic",
            initial_offset="earliest",
            group="my_group",
        )
        
run_config = fstore.RunConfig(local=False).apply(mlrun.auto_mount())

stocks_set_endpoint = stocks_set.deploy_ingestion_service(source=kafka_source,run_config=run_config)

Confluent Kafka source#

Note

Tech Preview

Example:

from mlrun.datastore.sources import KafkaSource


with open('/v3io/bigdata/name.crt') as x: 
    caCert = x.read()  
caCert


kafka_source = KafkaSource(
        brokers=['server-1:9092', 
        'server-2:9092', 
        'server-3:9092', 
        'server-4:9092', 
        'server-5:9092'],
        topics=["topic-name"],
        initial_offset="earliest",
        group="test",
        attributes={"sasl" : {
                      "enable": True,
                      "password" : "pword",
                      "user" : "user",
                      "handshake" : True,
                      "mechanism" : "SCRAM-SHA-256"},
                    "tls" : {
                      "enable": True,
                      "insecureSkipVerify" : False
                    },            
                   "caCert" : caCert}
    )
    
run_config = fstore.RunConfig(local=False).apply(mlrun.auto_mount())

stocks_set_endpoint = stocks_set.deploy_ingestion_service(source=kafka_source,run_config=run_config)

Targets#

By default, the feature sets are saved in parquet and the Iguazio NoSQL DB (NoSqlTarget).
The Parquet file is ideal for fetching large set of data for training while the key value is ideal for an online application since it supports low latency data retrieval based on key access.

Note

When working with the Iguazio MLOps platform the default feature set storage location is under the "Projects" container: <project name>/fs/.. folder. The default location can be modified in mlrun config or specified per ingest operation. The parquet/csv files can be stored in NFS, S3, Azure blob storage, Redis, SQL, and on Iguazio DB/FS.

Class name

Description

storey

spark

pandas

CSVTarget

Offline. Writes events to a CSV file.

Y

Y

Y

KafkaTarget

Offline. Writes all incoming events into a Kafka stream.

Y

N

N

ParquetTarget

Offline. The Parquet target storage driver, used to materialize feature set/vector data into parquet files.

Y

Y

Y

StreamTarget

Offline. Writes all incoming events into a V3IO stream.

Y

N

N

NoSqlTarget

Online. Persists the data in V3IO table to its associated storage by key .

Y

Y

Y

RedisNoSqlTarget

Online. Persists the data in Redis table to its associated storage by key.

Y

Y

N

SqlTarget

Online. Persists the data in SQL table to its associated storage by key.

Y

N

Y

ParquetTarget#

ParquetTarget() is the default target for offline data. The Parquet file is ideal for fetching large sets of data for training.

Partitioning#

When writing data to a ParquetTarget, you can use partitioning. Partitioning organizes data in Parquet files by dividing large data sets into smaller and more manageable pieces. The data is divided into separate files according to specific criteria, for example: date, time, or specific values in a column. Partitioning, when configured correctly, improves read performance by reducing the amount of data that needs to be processed for any given function, for example, when reading back a limited time range with get_offline_features().

When using the pandas engine for ingestion, pandas incurs a maximum limit of 1024 partitions on each ingestion. If the data being ingested spans over more than 1024 partitions, the ingestion fails. Decrease the number of partitions by filtering the time (for example, using start_filter/end_filter of the ParquetSource()), and/or increasing the time_partitioning_granularity.

Storey processes the data row by row (as a streaming engine, it doesn't get all the data up front, so it needs to process row by row). These rows are batched together according to the partitions defined, and they are written to each partition separately. (Therefore, storey does not have the 1024 partitions limitation.)

Spark does not have the partitions limitation, either.

Configure partitioning with:

  • partitioned — Optional. Whether to partition the file. False by default. If True without passing any other partition fields, the data is partitioned by /year/month/day/hour.

  • key_bucketing_number — Optional. None by default: does not partition by key. 0 partitions by the key as is. Any other number "X" creates X partitions and hashes the keys to one of them.

  • partition_cols — Optional. Name of columns from the data to partition by.

  • time_partitioning_granularity — Optional. The smallest time unit to partition the data by, in the format /year/month/day/hour (default). For example “hour” yields the smallest possible partitions.

For example:

  • ParquetTarget() partitions by year/month/day/hour/

  • ParquetTarget(partition_cols=[]) writes to a directory without partitioning

  • ParquetTarget(partition_cols=["col1", "col2"]) partitions by col1/col2/

  • ParquetTarget(time_partitioning_granularity="day") partitions by year/month/day/

  • ParquetTarget(partition_cols=["col1", "col2"], time_partitioning_granularity="day") partitions by col1/col2/year/month/day/

Disable partitioning with:

  • ParquetTarget(partitioned=False)

NoSql target#

The NoSqlTarget() is a V3IO key-value based target. It is the default target for online (real-time) data. It supports low latency data retrieval based on key access, making it ideal for online applications.

The combination of a NoSQL target with the storey engine does not support features of type string with a value containing both quote (') and double-quote (").

Redis target#

Note

Tech Preview

The Redis online target is called, in MLRun, RedisNoSqlTarget. The functionality of the RedisNoSqlTarget is identical to the NoSqlTarget except for:

  • The RedisNoSqlTarget accepts the path parameter in the form: <redis|rediss>://<host>[:port] For example: rediss://localhost:6379 creates a redis target, where:

    • The client/server protocol (rediss) is TLS protected (vs. "redis" if no TLS is established)

    • The server location is localhost port 6379.

  • If the path parameter is not set, it tries to fetch it from the MLRUN_REDIS__URL environment variable.

  • You cannot pass the username/password as part of the URL. If you want to provide the username/password, use secrets as: <prefix_>REDIS_USER <prefix_>REDIS_PASSWORD where <prefix> is the optional RedisNoSqlTarget credentials_prefix parameter.

  • Two types of Redis servers are supported: StandAlone and Cluster (no need to specify the server type in the config).

  • A feature set supports one online target only. Therefore RedisNoSqlTarget and NoSqlTarget cannot be used as two targets of the same feature set.

The K8s secrets are not available when executing locally (from the sdk). Therefore, if RedisNoSqlTarget with secret is used, You must add the secret as an env-var.

To use the Redis online target store, you can either change the default to be parquet and Redis, or you can specify the Redis target explicitly each time with the path parameter, for example:
RedisNoSqlTarget(path ="redis://1.2.3.4:6379")

SQL target#

Note

Tech Preview

Limitation

Do not use SQL reserved words as entity names. See more details in Keywords and Reserved Words.

The SQLTarget online target supports storey but does not support Spark. Aggregations are not supported.
To configure, pass the db_uri or overwrite the MLRUN_SQL__URL env var, in this format:
mysql+pymysql://<username>:<password>@<host>:<port>/<db_name>

You can pass the schema and the name of the table you want to create or the name of an existing table, for example:

 target = SQLTarget(
            table_name='my_table',
            schema= {'id': string, 'age': int, 'time': pd.Timestamp, ...}
            create_table=True,
            primary_key_column='id',
            parse_dates=["time"],
        )
feature_set = fs.FeatureSet("my_fs", entities=[fs.Entity('id')],)
fs.ingest(feature_set, source=df, targets=[target])

Feature store end-to-end demo#

This demo shows the usage of MLRun and the feature store:

Fraud prevention, specifically, is a challenge since it requires processing raw transactions and events in real-time and being able to quickly respond and block transactions before they occur. Consider, for example, a case where you would like to evaluate the average transaction amount. When training the model, it is common to take a DataFrame and just calculate the average. However, when dealing with real-time/online scenarios, this average has to be calculated incrementally.

This demo illustrates how to Ingest different data sources to the Feature Store. Specifically, it covers two types of data:

  • Transactions: Monetary activity between two parties to transfer funds.

  • Events: Activity performed by a party, such as login or password change.

_images/feature_store_demo_diagram.png

The demo walks through creation of an ingestion pipeline for each data source with all the needed preprocessing and validation. It runs the pipeline locally within the notebook and then launches a real-time function to ingest live data or schedule a cron to run the task when needed.

Following the ingestion, you create a feature vector, select the most relevant features and create a final model. Then you deploy the model and showcase the feature vector and model serving.

Part 1: Data ingestion#

Note

This demo works with the online feature store, which is currently not part of the Open Source default deployment.

This demo showcases financial fraud prevention using the MLRun feature store to define complex features that help identify fraud. Fraud prevention specifically is a challenge because it requires processing raw transaction and events in real-time, and being able to quickly respond and block transactions before they occur.

To address this, you create a development pipeline and a production pipeline. Both pipelines share the same feature engineering and model code, but serve data very differently. Furthermore, you automate the data and model monitoring process, identify drift and trigger retraining in a CI/CD pipeline. This process is described in the diagram below:

Feature store demo diagram - fraud prevention

By the end of this tutorial you’ll learn how to:

  • Create an ingestion pipeline for each data source.

  • Define preprocessing, aggregation and validation of the pipeline.

  • Run the pipeline locally within the notebook.

  • Launch a real-time function to ingest live data.

  • Schedule a cron to run the task when needed.

The raw data is described as follows:

TRANSACTIONS

USER EVENTS

age

age group value 0-6. Some values are marked as U for unknown

source

The party/entity related to the event

gender

A character to define the gender

event

event, such as login or password change

zipcodeOri

ZIP code of the person originating the transaction

timestamp

The date and time of the event

zipMerchant

ZIP code of the merchant receiving the transaction

category

category of the transaction (e.g., transportation, food, etc.)

amount

the total amount of the transaction

fraud

whether the transaction is fraudulent

timestamp

the date and time in which the transaction took place

source

the ID of the party/entity performing the transaction

target

the ID of the party/entity receiving the transaction

device

the device ID used to perform the transaction

This notebook introduces how to Ingest different data sources to the Feature Store.

The following FeatureSets are created:

  • Transactions: Monetary transactions between a source and a target.

  • Events: Account events such as account login or a password change.

  • Label: Fraud label for the data.

!/User/align_mlrun.sh
Both server & client are aligned (1.3.0rc23).
project_name = "fraud-demo"
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2023-02-15 14:40:30,932 [info] loaded project fraud-demo from ./ and saved in MLRun DB
Step 1 - Fetch, process and ingest the datasets#
1.1 - Transactions#
Transactions#
Hide code cell content
# Helper functions to adjust the timestamps of our data
# while keeping the order of the selected events and
# the relative distance from one event to the other


def date_adjustment(sample, data_max, new_max, old_data_period, new_data_period):
    """
    Adjust a specific sample's date according to the original and new time periods
    """
    sample_dates_scale = (data_max - sample) / old_data_period
    sample_delta = new_data_period * sample_dates_scale
    new_sample_ts = new_max - sample_delta
    return new_sample_ts


def adjust_data_timespan(
    dataframe, timestamp_col="timestamp", new_period="2d", new_max_date_str="now"
):
    """
    Adjust the dataframe timestamps to the new time period
    """
    # Calculate old time period
    data_min = dataframe.timestamp.min()
    data_max = dataframe.timestamp.max()
    old_data_period = data_max - data_min

    # Set new time period
    new_time_period = pd.Timedelta(new_period)
    new_max = pd.Timestamp(new_max_date_str)
    new_min = new_max - new_time_period
    new_data_period = new_max - new_min

    # Apply the timestamp change
    df = dataframe.copy()
    df[timestamp_col] = df[timestamp_col].apply(
        lambda x: date_adjustment(
            x, data_max, new_max, old_data_period, new_data_period
        )
    )
    return df
import pandas as pd

# Fetch the transactions dataset from the server
transactions_data = pd.read_csv(
    "https://s3.wasabisys.com/iguazio/data/fraud-demo-mlrun-fs-docs/data.csv",
    parse_dates=["timestamp"],
)

# use only first 50k
transactions_data = transactions_data.sort_values(by="source", axis=0)[:10000]

# Adjust the samples timestamp for the past 2 days
transactions_data = adjust_data_timespan(transactions_data, new_period="2d")

# Sorting after adjusting timestamps
transactions_data = transactions_data.sort_values(by="timestamp", axis=0)

# Preview
transactions_data.head(3)
step age gender zipcodeOri zipMerchant category amount fraud timestamp source target device
274633 91 5 F 28007 28007 es_transportation 26.92 0 2023-02-13 14:41:37.388791000 C1022153336 M1823072687 33832bb8607545df97632a7ab02d69c4
286902 94 2 M 28007 28007 es_transportation 48.22 0 2023-02-13 14:41:55.682416913 C1006176917 M348934600 fadd829c49e74ffa86c8da3be75ada53
416998 131 3 M 28007 28007 es_transportation 17.56 0 2023-02-13 14:42:00.789586939 C1010936270 M348934600 58d0422a50bc40c89d2b4977b2f1beea
Transactions - create a feature set and preprocessing pipeline#

Create the feature set (data pipeline) definition for the credit transaction processing that describes the offline/online data transformations and aggregations.
The feature store automatically adds an offline parquet target and an online NoSQL target by using set_targets().

The data pipeline consists of:

  • Extracting the data components (hour, day of week)

  • Mapping the age values

  • One hot encoding for the transaction category and the gender

  • Aggregating the amount (avg, sum, count, max over 2/12/24 hour time windows)

  • Aggregating the transactions per category (over 14 days time windows)

  • Writing the results to offline (Parquet) and online (NoSQL) targets

# Import MLRun's Feature Store
import mlrun.feature_store as fstore
from mlrun.feature_store.steps import OneHotEncoder, MapValues, DateExtractor
# Define the transactions FeatureSet
transaction_set = fstore.FeatureSet(
    "transactions",
    entities=[fstore.Entity("source")],
    timestamp_key="timestamp",
    description="transactions feature set",
)
# Define and add value mapping
main_categories = [
    "es_transportation",
    "es_health",
    "es_otherservices",
    "es_food",
    "es_hotelservices",
    "es_barsandrestaurants",
    "es_tech",
    "es_sportsandtoys",
    "es_wellnessandbeauty",
    "es_hyper",
    "es_fashion",
    "es_home",
    "es_contents",
    "es_travel",
    "es_leisure",
]

# One Hot Encode the newly defined mappings
one_hot_encoder_mapping = {
    "category": main_categories,
    "gender": list(transactions_data.gender.unique()),
}

# Define the graph steps
transaction_set.graph.to(
    DateExtractor(parts=["hour", "day_of_week"], timestamp_col="timestamp")
).to(MapValues(mapping={"age": {"U": "0"}}, with_original_features=True)).to(
    OneHotEncoder(mapping=one_hot_encoder_mapping)
)


# Add aggregations for 2, 12, and 24 hour time windows
transaction_set.add_aggregation(
    name="amount",
    column="amount",
    operations=["avg", "sum", "count", "max"],
    windows=["2h", "12h", "24h"],
    period="1h",
)


# Add the category aggregations over a 14 day window
for category in main_categories:
    transaction_set.add_aggregation(
        name=category,
        column=f"category_{category}",
        operations=["sum"],
        windows=["14d"],
        period="1d",
    )

# Add default (offline-parquet & online-nosql) targets
transaction_set.set_targets()

# Plot the pipeline so you can see the different steps
transaction_set.plot(rankdir="LR", with_targets=True)
_images/5746c2072c83bfe6f4dd96176af6b51c906e447201f4ab91fb1bab52e9f86840.svg
Transactions - ingestion#
# Ingest your transactions dataset through your defined pipeline
transactions_df = fstore.ingest(
    transaction_set, transactions_data, infer_options=fstore.InferOptions.default()
)

transactions_df.head(3)
amount_sum_2h amount_sum_12h amount_sum_24h amount_max_2h amount_max_12h amount_max_24h amount_count_2h amount_count_12h amount_count_24h amount_avg_2h ... category_es_contents category_es_travel category_es_leisure amount fraud timestamp target device timestamp_hour timestamp_day_of_week
source
C1022153336 26.92 26.92 26.92 26.92 26.92 26.92 1.0 1.0 1.0 26.92 ... 0 0 0 26.92 0 2023-02-13 14:41:37.388791000 M1823072687 33832bb8607545df97632a7ab02d69c4 14 0
C1006176917 48.22 48.22 48.22 48.22 48.22 48.22 1.0 1.0 1.0 48.22 ... 0 0 0 48.22 0 2023-02-13 14:41:55.682416913 M348934600 fadd829c49e74ffa86c8da3be75ada53 14 0
C1010936270 17.56 17.56 17.56 17.56 17.56 17.56 1.0 1.0 1.0 17.56 ... 0 0 0 17.56 0 2023-02-13 14:42:00.789586939 M348934600 58d0422a50bc40c89d2b4977b2f1beea 14 0

3 rows × 56 columns

After performing the ingestion process, you can see all of the different features that were created with the help of the UI, as shown in the image below.

Features Catalog - fraud prevention

1.2 - User events#
User events - fetching#
# Fetch the user_events dataset from the server
user_events_data = pd.read_csv(
    "https://s3.wasabisys.com/iguazio/data/fraud-demo-mlrun-fs-docs/events.csv",
    index_col=0,
    quotechar="'",
    parse_dates=["timestamp"],
)

# Adjust to the last 2 days to see the latest aggregations in the online feature vectors
user_events_data = adjust_data_timespan(user_events_data, new_period="2d")

# Preview
user_events_data.head(3)
source event timestamp
0 C1974668487 details_change 2023-02-14 23:49:22.487035086
1 C1973547259 login 2023-02-15 02:50:56.744175508
2 C515668508 login 2023-02-14 23:24:03.025352302
User events - create a feature set and preprocessing pipeline#

Now define the events feature set. This is a pretty straightforward pipeline in which you only "one hot encode" the event categories and save the data to the default targets.

user_events_set = fstore.FeatureSet(
    "events",
    entities=[fstore.Entity("source")],
    timestamp_key="timestamp",
    description="user events feature set",
)
# Define and add value mapping
events_mapping = {"event": list(user_events_data.event.unique())}

# One Hot Encode
user_events_set.graph.to(OneHotEncoder(mapping=events_mapping))

# Add default (offline-parquet & online-nosql) targets
user_events_set.set_targets()

# Plot the pipeline so you can see the different steps
user_events_set.plot(rankdir="LR", with_targets=True)
_images/28050331e3b0d2d75db1c9113b4143402c3c8ecee09af1e0c0eb8aa05fff739a.svg
User Events - Ingestion#
# Ingestion of your newly created events feature set
events_df = fstore.ingest(user_events_set, user_events_data)
events_df.head(3)
event_details_change event_login event_password_change timestamp
source
C1974668487 1 0 0 2023-02-14 23:49:22.487035086
C1973547259 0 1 0 2023-02-15 02:50:56.744175508
C515668508 0 1 0 2023-02-14 23:24:03.025352302
Step 2 - Create a labels data set for model training#
Label set - create a feature set#

This feature set contains the label for the fraud demo. It is ingested directly to the default targets without any changes.

def create_labels(df):
    labels = df[["fraud", "timestamp"]].copy()
    labels = labels.rename(columns={"fraud": "label"})
    labels["timestamp"] = labels["timestamp"].astype("datetime64[ms]")
    labels["label"] = labels["label"].astype(int)
    return labels
from mlrun.datastore import ParquetTarget
import os

# Define the "labels" feature set
labels_set = fstore.FeatureSet(
    "labels",
    entities=[fstore.Entity("source")],
    timestamp_key="timestamp",
    description="training labels",
    engine="pandas",
)

labels_set.graph.to(name="create_labels", handler=create_labels)


# specify only Parquet (offline) target since its not used for real-time
target = ParquetTarget(
    name="labels", path=f"v3io:///projects/{project.name}/target.parquet"
)
labels_set.set_targets([target], with_defaults=False)
labels_set.plot(with_targets=True)
_images/e351b0a17973a80c4d35eb173997579b9159cce4db5a9c57d8729212e479aa14.svg
Label set - ingestion#
# Ingest the labels feature set
labels_df = fstore.ingest(labels_set, transactions_data)
labels_df.head(3)
label timestamp
source
C1022153336 0 2023-02-13 14:41:37.388
C1006176917 0 2023-02-13 14:41:55.682
C1010936270 0 2023-02-13 14:42:00.789
Step 3 - Deploy a real-time pipeline#

When dealing with real-time aggregation, it's important to be able to update these aggregations in real-time. For this purpose, you create live serving functions that update the online feature store of the transactions FeatureSet and Events FeatureSet.

Using MLRun's serving runtime, create a nuclio function loaded with your feature set's computational graph definition and an HttpSource to define the HTTP trigger.

Notice that the implementation below does not require any rewrite of the pipeline logic.

3.1 - Transactions#
Transactions - deploy the feature set live endpoint#
# Create iguazio v3io stream and transactions push API endpoint
transaction_stream = f"v3io:///projects/{project.name}/streams/transaction"
transaction_pusher = mlrun.datastore.get_stream_pusher(transaction_stream)
# Define the source stream trigger (use v3io streams)
# define the `key` and `time` fields (extracted from the Json message).
source = mlrun.datastore.sources.StreamSource(
    path=transaction_stream, key_field="source", time_field="timestamp"
)

# Deploy the transactions feature set's ingestion service over a real-time (Nuclio) serverless function
# you can use the run_config parameter to pass function/service specific configuration
transaction_set_endpoint, function = transaction_set.deploy_ingestion_service(
    source=source
)
> 2023-02-15 14:43:00,894 [info] Starting remote function deploy
2023-02-15 14:43:01  (info) Deploying function
2023-02-15 14:43:01  (info) Building
2023-02-15 14:43:01  (info) Staging files and preparing base images
2023-02-15 14:43:01  (info) Building processor image
2023-02-15 14:44:06  (info) Build complete
> 2023-02-15 14:45:06,185 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-fraud-demo-dani-transactions-ingest.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['fraud-demo-dani-transactions-ingest-fraud-demo-dani.default-tenant.app.vmdev94.lab.iguazeng.com/']}
Transactions - test the feature set HTTP endpoint#

By defining your transactions feature set you can now use MLRun and Storey to deploy it as a live endpoint, ready to ingest new data!

Using MLRun's serving runtime, create a nuclio function loaded with your feature set's computational graph definition and an HttpSource to define the HTTP trigger.

import requests
import json

# Select a sample from the dataset and serialize it to JSON
transaction_sample = json.loads(transactions_data.sample(1).to_json(orient="records"))[
    0
]
transaction_sample["timestamp"] = str(pd.Timestamp.now())
transaction_sample
{'step': 7,
 'age': '4',
 'gender': 'M',
 'zipcodeOri': 28007,
 'zipMerchant': 28007,
 'category': 'es_transportation',
 'amount': 3.13,
 'fraud': 0,
 'timestamp': '2023-02-15 14:45:06.241183',
 'source': 'C1039390058',
 'target': 'M348934600',
 'device': '6c08480bd1234bac9e6a4b57310ba9ab'}
# Post the sample to the ingestion endpoint
requests.post(transaction_set_endpoint, json=transaction_sample).text
'{"id": "753953e5-a7df-4fe0-b728-822fadb92ceb"}'
3.2 - User events#
User events - deploy the feature set live endpoint#

Deploy the events feature set's ingestion service using the feature set and all the previously defined resources.

# Create iguazio v3io stream and transactions push API endpoint
events_stream = f"v3io:///projects/{project.name}/streams/events"
events_pusher = mlrun.datastore.get_stream_pusher(events_stream)
# Define the source stream trigger (use v3io streams)
# define the `key` and `time` fields (extracted from the Json message).
source = mlrun.datastore.sources.StreamSource(
    path=events_stream, key_field="source", time_field="timestamp"
)

# Deploy the transactions feature set's ingestion service over a real-time (Nuclio) serverless function
# you can use the run_config parameter to pass function/service specific configuration
events_set_endpoint, function = user_events_set.deploy_ingestion_service(source=source)
> 2023-02-15 14:45:06,443 [info] Starting remote function deploy
2023-02-15 14:45:06  (info) Deploying function
2023-02-15 14:45:06  (info) Building
2023-02-15 14:45:06  (info) Staging files and preparing base images
2023-02-15 14:45:06  (info) Building processor image
User Events - Test the feature set HTTP endpoint#
# Select a sample from the events dataset and serialize it to JSON
user_events_sample = json.loads(user_events_data.sample(1).to_json(orient="records"))[0]
user_events_sample["timestamp"] = str(pd.Timestamp.now())
user_events_sample
# Post the sample to the ingestion endpoint
requests.post(events_set_endpoint, json=user_events_sample).text
Done!#

You've completed Part 1 of the data-ingestion with the feature store. Proceed to Part 2 to learn how to train an ML model using the feature store data.

Part 2: Training#

In this part you learn how to use MLRun's Feature Store to easily define a Feature Vector and create the dataset you need to run the training process.
By the end of this tutorial you’ll learn how to:

  • Combine multiple data sources to a single feature vector

  • Create training dataset

  • Create a model using an MLRun hub function

project_name = "fraud-demo"
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2023-02-15 14:43:21,980 [info] loaded project fraud-demo from MLRun DB
Step 1 - Create a feature vector#

In this section you create a feature vector.
The Feature vector has a name so you can reference to it later via the URI or your serving function, and it has a list of features from the available feature sets. You can add a feature from a feature set by adding <FeatureSet>.<Feature> to the list, or add <FeatureSet>.* to add all the feature set's available features.

By default, the first FeatureSet in the feature list acts as the spine, meaning that all the other features are joined to it.
For example, in this instance you use the early sense sensor data as the spine, so for each early sense event you create produces a row in the resulted feature vector.

# Define the list of features to use
features = [
    "events.*",
    "transactions.amount_max_2h",
    "transactions.amount_sum_2h",
    "transactions.amount_count_2h",
    "transactions.amount_avg_2h",
    "transactions.amount_max_12h",
    "transactions.amount_sum_12h",
    "transactions.amount_count_12h",
    "transactions.amount_avg_12h",
    "transactions.amount_max_24h",
    "transactions.amount_sum_24h",
    "transactions.amount_count_24h",
    "transactions.amount_avg_24h",
    "transactions.es_transportation_sum_14d",
    "transactions.es_health_sum_14d",
    "transactions.es_otherservices_sum_14d",
    "transactions.es_food_sum_14d",
    "transactions.es_hotelservices_sum_14d",
    "transactions.es_barsandrestaurants_sum_14d",
    "transactions.es_tech_sum_14d",
    "transactions.es_sportsandtoys_sum_14d",
    "transactions.es_wellnessandbeauty_sum_14d",
    "transactions.es_hyper_sum_14d",
    "transactions.es_fashion_sum_14d",
    "transactions.es_home_sum_14d",
    "transactions.es_travel_sum_14d",
    "transactions.es_leisure_sum_14d",
    "transactions.gender_F",
    "transactions.gender_M",
    "transactions.step",
    "transactions.amount",
    "transactions.timestamp_hour",
    "transactions.timestamp_day_of_week",
]
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
fv_name = "transactions-fraud"

# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(
    fv_name,
    features,
    label_feature="labels.label",
    description="Predicting a fraudulent transaction",
)

# Save the feature vector in the feature store
transactions_fv.save()
Step 2 - Preview the feature vector data#

Obtain the values of the features in the feature vector, to ensure the data appears as expected.

# Import the Parquet Target so you can directly save your dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())
> 2023-02-15 14:43:23,376 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-dani/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2023-02-15T14:43:23.375968+00:00', 'size': 140838, 'partitioned': True}
# Preview your dataset
train_dataset.to_dataframe().tail(5)
event_details_change event_login event_password_change amount_max_2h amount_sum_2h amount_count_2h amount_avg_2h amount_max_12h amount_sum_12h amount_count_12h ... es_home_sum_14d es_travel_sum_14d es_leisure_sum_14d gender_F gender_M step amount timestamp_hour timestamp_day_of_week label
1763 0 0 1 45.28 144.56 5.0 28.9120 161.75 1017.80 33.0 ... 0.0 1.0 0.0 1.0 0.0 96.0 24.02 14.0 2.0 0.0
1764 1 0 0 26.81 47.75 2.0 23.8750 68.16 653.02 24.0 ... 0.0 0.0 0.0 0.0 1.0 134.0 26.81 14.0 2.0 0.0
1765 0 1 0 33.10 91.11 4.0 22.7775 121.96 1001.32 32.0 ... 2.0 0.0 0.0 1.0 0.0 141.0 14.95 14.0 2.0 0.0
1766 0 0 1 22.35 37.68 3.0 12.5600 71.63 1052.44 37.0 ... 0.0 0.0 0.0 0.0 1.0 101.0 13.62 14.0 2.0 0.0
1767 0 0 1 44.37 76.87 4.0 19.2175 159.32 1189.73 39.0 ... 0.0 0.0 0.0 0.0 1.0 40.0 12.82 14.0 2.0 0.0

5 rows × 36 columns

Step 3 - Train models and choose the highest accuracy#

With MLRun, you can easily train different models and compare the results. In the code below, you train three different models. Each one uses a different algorithm (random forest, XGBoost, adabost), and you choose the model with the highest accuracy.

# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function("hub://auto_trainer")
# Prepare the parameters list for the training function
# you use 3 different models
training_params = {
    "model_name": [
        "transaction_fraud_rf",
        "transaction_fraud_xgboost",
        "transaction_fraud_adaboost",
    ],
    "model_class": [
        "sklearn.ensemble.RandomForestClassifier",
        "sklearn.ensemble.GradientBoostingClassifier",
        "sklearn.ensemble.AdaBoostClassifier",
    ],
}

# Define the training task, including your feature vector, label and hyperparams definitions
train_task = mlrun.new_task(
    "training",
    inputs={"dataset": transactions_fv.uri},
    params={"label_columns": "label"},
)

train_task.with_hyper_params(training_params, strategy="list", selector="max.accuracy")

# Specify your cluster image
classifier_fn.spec.image = "mlrun/mlrun"

# Run training
classifier_fn.run(train_task, local=False)
> 2023-02-15 14:43:23,870 [info] starting run training uid=946725e1c01f4e0ba9d7eb62f7f24142 DB=http://mlrun-api:8080
> 2023-02-15 14:43:24,069 [info] Job is running in the background, pod: training-68dct
> 2023-02-15 14:43:58,472 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:00,031 [info] label columns: label
> 2023-02-15 14:44:00,031 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:00,278 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:03,298 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:04,277 [info] label columns: label
> 2023-02-15 14:44:04,277 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:04,281 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:07,773 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:09,037 [info] label columns: label
> 2023-02-15 14:44:09,037 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:09,040 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:11,957 [info] best iteration=1, used criteria max.accuracy
> 2023-02-15 14:44:12,668 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani'}
> 2023-02-15 14:44:12,668 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/946725e1c01f4e0ba9d7eb62f7f24142/overview'}
> 2023-02-15 14:44:12,669 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-dani 0 Feb 15 14:43:57 completed training
v3io_user=dani
kind=job
owner=dani
mlrun/client_version=1.3.0-rc23
mlrun/client_python_version=3.9.16
dataset
label_columns=label
best_iteration=1
accuracy=1.0
f1_score=1.0
precision_score=1.0
recall_score=1.0
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model
iteration_results
parallel_coordinates

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-02-15 14:44:15,576 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f3288543e20>
Step 4 - Perform feature selection#

As part of the data science process, try to reduce the training dataset's size to get rid of bad or unuseful features and save computation time.

Use your ready-made feature selection function from MLRun's hub://feature_selection to select the best features to keep on a sample from your dataset, and run the function on that.

feature_selection_fn = mlrun.import_function("hub://feature_selection")

feature_selection_run = feature_selection_fn.run(
    params={
        "k": 18,
        "min_votes": 2,
        "label_column": "label",
        "output_vector_name": fv_name + "-short",
        "ignore_type_errors": True,
    },
    inputs={"df_artifact": transactions_fv.uri},
    name="feature_extraction",
    handler="feature_selection",
    local=False,
)
> 2023-02-15 14:44:16,098 [info] starting run feature_extraction uid=da55327c222f4a9389232f25fc6b9739 DB=http://mlrun-api:8080
> 2023-02-15 14:44:16,262 [info] Job is running in the background, pod: feature-extraction-pv66m
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-dani 0 Feb 15 14:45:50 completed feature_extraction
v3io_user=dani
kind=job
owner=dani
mlrun/client_version=1.3.0-rc23
mlrun/client_python_version=3.9.16
host=feature-extraction-pv66m
df_artifact
k=18
min_votes=2
label_column=label
output_vector_name=transactions-fraud-short
ignore_type_errors=True
top_features_vector=store://feature-vectors/fraud-demo-dani/transactions-fraud-short
f_classif
mutual_info_classif
chi2
f_regression
LinearSVC
LogisticRegression
ExtraTreesClassifier
feature_scores
max_scaled_scores_feature_scores
selected_features_count
selected_features

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-02-15 14:46:05,989 [info] run executed, status=completed
mlrun.get_dataitem(feature_selection_run.outputs["top_features_vector"]).as_df().tail(5)
amount_max_2h amount_sum_2h amount_count_2h amount_avg_2h amount_max_12h amount_sum_12h amount_count_12h amount_avg_12h amount_max_24h amount_sum_24h amount_count_24h amount_avg_24h es_transportation_sum_14d es_health_sum_14d es_otherservices_sum_14d label
9995 54.55 118.62 4.0 29.655 70.47 805.10 27.0 29.818519 85.97 1730.23 58.0 29.831552 120.0 0.0 0.0 0
9996 31.14 31.14 1.0 31.140 119.50 150.64 2.0 75.320000 119.50 330.61 5.0 66.122000 0.0 7.0 0.0 0
9997 218.48 365.30 5.0 73.060 218.48 1076.37 25.0 43.054800 218.48 1968.00 59.0 33.355932 107.0 5.0 1.0 0
9998 34.93 118.22 5.0 23.644 79.16 935.26 31.0 30.169677 89.85 2062.69 68.0 30.333676 116.0 0.0 0.0 0
9999 77.76 237.95 5.0 47.590 95.71 1259.07 37.0 34.028919 95.71 2451.98 72.0 34.055278 122.0 0.0 0.0 0
Step 5 - Train your models with top features#

Following the feature selection, you train new models using the resultant features. You can observe that the accuracy and other results remain high, meaning you get a model that requires less features to be accurate and thus less error-prone.

# Define your training task, including your feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task(
    "training",
    inputs={"dataset": feature_selection_run.outputs["top_features_vector"]},
    params={"label_columns": "label"},
)
ensemble_train_task.with_hyper_params(
    training_params, strategy="list", selector="max.accuracy"
)

classifier_fn.run(ensemble_train_task)
> 2023-02-15 14:46:06,131 [info] starting run training uid=4ac3afbfb6a1409daa1e834f8f153295 DB=http://mlrun-api:8080
> 2023-02-15 14:46:07,756 [info] Job is running in the background, pod: training-hgz6t
> 2023-02-15 14:46:17,141 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:17,731 [info] label columns: label
> 2023-02-15 14:46:17,732 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:18,031 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:21,793 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:22,767 [info] label columns: label
> 2023-02-15 14:46:22,767 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:22,770 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:28,944 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:29,507 [info] label columns: label
> 2023-02-15 14:46:29,507 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:29,511 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:31,968 [info] best iteration=2, used criteria max.accuracy
> 2023-02-15 14:46:32,376 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani'}
> 2023-02-15 14:46:32,376 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/4ac3afbfb6a1409daa1e834f8f153295/overview'}
> 2023-02-15 14:46:32,377 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-dani 0 Feb 15 14:46:16 completed training
v3io_user=dani
kind=job
owner=dani
mlrun/client_version=1.3.0-rc23
mlrun/client_python_version=3.9.16
dataset
label_columns=label
best_iteration=2
accuracy=0.992503748125937
f1_score=0.4827586206896552
precision_score=0.5833333333333334
recall_score=0.4117647058823529
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model
iteration_results
parallel_coordinates

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-02-15 14:46:33,094 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f324160af40>
Done!#

You've completed Part 2 of the model training with the feature store. Proceed to Part 3 to learn how to deploy and monitor the model.

Part 3: Serving#

In this part you use MLRun's serving runtime to deploy your trained models from the previous stage, a Voting Ensemble using max vote logic. You also use MLRun's Feature store to receive the latest tag of the online Feature Vector you defined in the previous stage.

By the end of this tutorial you’ll learn how to:

  • Define a model class to load your models, run preprocessing, and predict on the data

  • Define a Voting Ensemble function on top of your models

  • Test the serving function locally using your mock server

  • Deploy the function to the cluster and test it live

Environment setup#

First, make sure SciKit-Learn is installed in the correct version:

!pip install -U scikit-learn
Requirement already satisfied: scikit-learn in /conda/envs/mlrun-extended/lib/python3.9/site-packages (1.2.1)
Requirement already satisfied: numpy>=1.17.3 in /conda/envs/mlrun-extended/lib/python3.9/site-packages (from scikit-learn) (1.22.4)
Requirement already satisfied: joblib>=1.1.1 in /conda/envs/mlrun-extended/lib/python3.9/site-packages (from scikit-learn) (1.2.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /conda/envs/mlrun-extended/lib/python3.9/site-packages (from scikit-learn) (3.1.0)
Requirement already satisfied: scipy>=1.3.2 in /conda/envs/mlrun-extended/lib/python3.9/site-packages (from scikit-learn) (1.10.0)

Restart your kernel post installing. Since your work is done in this project scope, you should define the project itself for all your MLRun work in this notebook.

project_name = "fraud-demo"
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2023-02-15 14:48:31,777 [info] loaded project fraud-demo from MLRun DB
Define model class#
  • Load models

  • Predict from the feature store online service via the source key

# mlrun: start-code
import numpy as np
from cloudpickle import load
from mlrun.serving.v2_serving import V2ModelServer


class ClassifierModel(V2ModelServer):
    def load(self):
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model(".pkl")
        self.model = load(open(model_file, "rb"))

    def predict(self, body: dict) -> list:
        """Generate model predictions from sample"""
        print(f"Input -> {body['inputs']}")
        feats = np.asarray(body["inputs"])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()
# mlrun: end-code
Define a serving function#

MLRun serving can produce managed real-time serverless pipelines from various tasks, including MLRun models or standard model files. The pipelines use the Nuclio real-time serverless engine, which can be deployed anywhere. Nuclio is a high-performance open-source serverless framework that's focused on data, I/O, and compute-intensive workloads.

The EnrichmentVotingEnsemble and the EnrichmentModelRouter router classes auto-enrich the request with data from the feature store. The router input accepts a list of inference requests (each request can be a dict or list of incoming features/keys). It enriches the request with data from the specified feature vector (feature_vector_uri).

In many cases the features can have null values (None, NaN, Inf, …). The Enrichment routers can substitute the null value with fixed or statistical value per feature. This is done through the impute_policy parameter, which accepts the impute policy per feature (where * is used to specify the default). The value can be a fixed number for constants or $mean, $max, $min, $std, $count for statistical values, to substitute the value with the equivalent feature stats (taken from the feature store).

The following code achieves:

  • Gather ClassifierModel code from this notebook

  • Define EnrichmentVotingEnsemble - Max-Vote based ensemble with feature enrichment and imputing

  • Add the previously trained models to the ensemble

# Create the serving function from your code above
serving_fn = mlrun.code_to_function(
    "transaction-fraud", kind="serving", image="mlrun/mlrun"
).apply(mlrun.auto_mount())

serving_fn.set_topology(
    "router",
    "mlrun.serving.routers.EnrichmentVotingEnsemble",
    name="VotingEnsemble",
    feature_vector_uri="transactions-fraud-short",
    impute_policy={"*": "$mean"},
)

model_names = [
    "transaction_fraud_rf",
    "transaction_fraud_xgboost",
    "transaction_fraud_adaboost",
]

for i, name in enumerate(model_names, start=1):
    serving_fn.add_model(
        name,
        class_name="ClassifierModel",
        model_path=project.get_artifact_uri(f"{name}#{i}:latest"),
    )

# Plot the ensemble configuration
serving_fn.spec.graph.plot()
_images/ecf61c713c710c52b37e3d2e70940e74a280a455403ac195fc056fe763443941.svg
Test the server locally#

Before deploying the serving function, you can test it in the current notebook and check the model output.

# Create a mock server from the serving function
local_server = serving_fn.to_mock_server()
> 2023-02-15 14:48:36,438 [info] model transaction_fraud_rf was loaded
> 2023-02-15 14:48:36,482 [info] model transaction_fraud_xgboost was loaded
> 2023-02-15 14:48:36,520 [info] model transaction_fraud_adaboost was loaded
# Choose an id for your test
sample_id = "C1000148617"

model_inference_path = "/v2/models/infer"

# Send your sample ID for prediction
local_server.test(path=model_inference_path, body={"inputs": [[sample_id]]})

# notice the input vector is printed 3 times (once per child model) and is enriched with data from the feature store
Input -> [[60.98, 73.78999999999999, 2.0, 36.894999999999996, 134.16, 1037.48, 32.0, 32.42125, 143.87, 1861.8400000000001, 59.0, 31.556610169491528, 90.0, 1.0, 2.0]]
Input -> [[60.98, 73.78999999999999, 2.0, 36.894999999999996, 134.16, 1037.48, 32.0, 32.42125, 143.87, 1861.8400000000001, 59.0, 31.556610169491528, 90.0, 1.0, 2.0]]Input -> [[60.98, 73.78999999999999, 2.0, 36.894999999999996, 134.16, 1037.48, 32.0, 32.42125, 143.87, 1861.8400000000001, 59.0, 31.556610169491528, 90.0, 1.0, 2.0]]
X does not have valid feature names, but RandomForestClassifier was fitted with feature names
X does not have valid feature names, but AdaBoostClassifier was fitted with feature names
X does not have valid feature names, but GradientBoostingClassifier was fitted with feature names
{'id': '5237524f362a47b78828d9d7f7f87d9a',
 'model_name': 'VotingEnsemble',
 'outputs': [0],
 'model_version': 'v1'}
Accessing the real-time feature vector directly#

You can also directly query the feature store values using the get_online_feature_service method. This method is used internally in the EnrichmentVotingEnsemble router class.

Note

The timestamp of the last event is not returned with get_online_feature_service / svc.get.

import mlrun.feature_store as fstore

# Create the online feature service
svc = fstore.get_online_feature_service(
    "transactions-fraud-short:latest", impute_policy={"*": "$mean"}
)

# Get sample feature vector
sample_fv = svc.get([{"source": sample_id}])
sample_fv
[{'amount_max_2h': 60.98,
  'amount_max_12h': 134.16,
  'amount_max_24h': 143.87,
  'amount_sum_2h': 73.78999999999999,
  'amount_sum_12h': 1037.48,
  'amount_sum_24h': 1861.8400000000001,
  'amount_count_2h': 2.0,
  'amount_count_12h': 32.0,
  'amount_count_24h': 59.0,
  'es_transportation_sum_14d': 90.0,
  'es_health_sum_14d': 1.0,
  'es_otherservices_sum_14d': 2.0,
  'amount_avg_2h': 36.894999999999996,
  'amount_avg_12h': 32.42125,
  'amount_avg_24h': 31.556610169491528}]
Deploying the function on the Kubernetes cluster#

You can now deploy the function. Once deployed, you get a function with http trigger that can be called from other locations.

Model activities can be tracked into a real-time stream and time-series DB. The monitoring data is used to create real-time dashboards, detect drift, and analyze performance.
To monitor a deployed model, apply set_tracking().

import os

# Enable model monitoring
serving_fn.set_tracking()
project.set_model_monitoring_credentials(os.getenv("V3IO_ACCESS_KEY"))

# Deploy the serving function
serving_fn.deploy()
> 2023-02-15 14:48:36,931 [info] Starting remote function deploy
2023-02-15 14:48:39  (info) Deploying function
2023-02-15 14:48:39  (info) Building
2023-02-15 14:48:39  (info) Staging files and preparing base images
2023-02-15 14:48:39  (info) Building processor image
2023-02-15 14:50:15  (info) Build complete
2023-02-15 14:51:05  (info) Function deploy complete
> 2023-02-15 14:51:05,648 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['fraud-demo-dani-transaction-fraud-fraud-demo-dani.default-tenant.app.vmdev94.lab.iguazeng.com/']}
'http://fraud-demo-dani-transaction-fraud-fraud-demo-dani.default-tenant.app.vmdev94.lab.iguazeng.com/'
Test the server#

You can test the serving function and examine the model output.

# Choose an id for your test
sample_id = "C1000148617"

model_inference_path = "/v2/models/infer"

# Send your sample ID for prediction
serving_fn.invoke(path=model_inference_path, body={"inputs": [[sample_id]]})
> 2023-02-15 14:51:05,714 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': 'c34706e4-f1c8-4aff-b226-020c2cad7e4a',
 'model_name': 'VotingEnsemble',
 'outputs': [0],
 'model_version': 'v1'}

You can also directly query the feature store values, which are used in the enrichment.

Simulate incoming data#
# Load the dataset
data = mlrun.get_dataitem(
    "https://s3.wasabisys.com/iguazio/data/fraud-demo-mlrun-fs-docs/data.csv"
).as_df()

# use only first 10k
data = data.sort_values(by="source", axis=0)[:10000]

# keys
sample_ids = data["source"].to_list()
from random import choice, uniform
from time import sleep

# Sending random requests
for _ in range(10):
    data_point = choice(sample_ids)
    try:
        resp = serving_fn.invoke(
            path=model_inference_path, body={"inputs": [[data_point]]}
        )
        print(resp)
        sleep(uniform(0.2, 1.7))
    except OSError:
        pass
> 2023-02-15 14:51:47,845 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': 'f09841c5-4427-4ea1-95a9-723bb09349bb', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:49,373 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': 'd8dd6ca2-d448-4953-aa84-1414f6274f91', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:49,725 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': '8aa2c1cb-5fdf-49e7-9b30-15c4b606bbe2', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:50,581 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': '4357ee2a-c0ca-476d-a04c-add47487391a', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:51,542 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': '324c5938-82b5-4a68-b61b-204530e4b8c9', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:52,476 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': '523c8e5c-ab91-4c8b-83d1-3d57cfa7a5cd', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:53,067 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': '3a03000a-9223-4304-948b-66b3651a38de', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:53,662 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': 'b65943ac-ffbe-4ab9-b209-36611ca2c6cb', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:54,543 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': '85791d18-e959-46e6-ae5f-cdc901c2dce3', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
> 2023-02-15 14:51:54,972 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-transaction-fraud.default-tenant.svc.cluster.local:8080/v2/models/infer'}
{'id': '73d49f78-0f0d-4a4f-a905-61d4fed44cba', 'model_name': 'VotingEnsemble', 'outputs': [0], 'model_version': 'v1'}
Done!#

You've completed Part 3 of the deploying the serving function. Proceed to Part 4 to learn how to automate ML Pipeline.

Part 4: Automated ML pipeline#

MLRun Project is a container for all your work on a particular activity: all of the associated code, functions, jobs/workflows and artifacts. Projects can be mapped to git repositories, which enable versioning, collaboration, and CI/CD. Users can create project definitions using the SDK or a yaml file and store those in MLRun DB, file, or archive. Once the project is loaded you can run jobs/workflows that refer to any project element by name, allowing separation between configuration and code.

Projects contain workflows that execute the registered functions in a sequence/graph (DAG), can reference project parameters, secrets and artifacts by name. This notebook demonstrates how to build an automated workflow with feature selection, training, testing, and deployment.

Step 1: Setting up your project#

To run a pipeline, you first need to get or create a project object and define/import the required functions for its execution. See the Create, save, and use projects for details.

The following code gets or creates a user project named "fraud-demo".

# Set the base project name
project_name = "fraud-demo"
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2023-02-15 14:52:09,517 [info] loaded project fraud-demo from MLRun DB

Step 2: Updating project and function definitions#

You need to save the definitions for the function you use in the projects. This enables automatically converting code to functions or import external functions whenever you load new versions of your code or when you run automated CI/CD workflows. In addition, you may want to set other project attributes such as global parameters, secrets, and data.

Your code can be stored in Python files, notebooks, external repositories, packaged containers, etc. You use the project.set_function() method to register your code in the project. The definitions are saved to the project object, as well as in a YAML file in the root of our project. Functions can also be imported from MLRun marketplace (using the hub:// schema).

This tutorial uses these functions:

  • feature_selection — the first function, which determines the top features to be used for training.

  • train — the model-training function

  • evaluate — the model-testing function

  • mlrun-model — the model-serving function

Note

set_function uses the code_to_function and import_function methods under the hood (used in the previous notebooks), but in addition it saves the function configurations in the project spec for use in automated workflows and CI/CD.

Add the function definitions to the project along with parameters and data artifacts and save the project.

project.set_function("hub://feature_selection", "feature_selection")
project.set_function("hub://auto_trainer", "train")
project.set_function("hub://v2_model_server", "serving")
Names with underscore '_' are about to be deprecated, use dashes '-' instead. Replacing underscores with dashes.
<mlrun.runtimes.serving.ServingRuntime at 0x7f5701e79520>
# set project level parameters and save
project.spec.params = {"label_column": "label"}
project.save()
<mlrun.projects.project.MlrunProject at 0x7f5720229280>


When you save the project it stores the project definitions in the project.yaml. This allows you to load the project from the source control (GIT) and run it with a single command or API call.

The project YAML for this project can be printed using:

print(project.to_yaml())
kind: project
metadata:
  name: fraud-demo-dani
  created: '2023-02-15T14:40:29.807000'
spec:
  params:
    label_column: label
  functions:
  - url: hub://feature_selection
    name: feature_selection
  - url: hub://auto_trainer
    name: train
  - url: hub://v2_model_server
    name: serving
  workflows: []
  artifacts: []
  source: ''
  desired_state: online
  owner: dani
status:
  state: online
Saving and loading projects from GIT#

After you saved your project and its elements (functions, workflows, artifacts, etc.) you can commit all your changes to a GIT repository. This can be done using standard GIT tools or using MLRun project methods such as pull, push, remote, which calls the Git API for you.

Projects can then be loaded from Git using MLRun load_project method, for example:

project = mlrun.load_project("./myproj", "git://github.com/mlrun/project-demo.git", name=project_name)

or using MLRun CLI:

mlrun project -n myproj -u "git://github.com/mlrun/project-demo.git" ./myproj

Read CI/CD integration for more details.

Using Kubeflow pipelines#

You're now ready to create a full ML pipeline. This is done by using Kubeflow Pipelines — an open-source framework for building and deploying portable, scalable machine-learning workflows based on Docker containers. MLRun leverages this framework to take your existing code and deploy it as steps in the pipeline.

Step 3: Defining and saving a pipeline workflow#

A pipeline is created by running an MLRun "workflow". The following code defines a workflow and writes it to a file in your local directory. (The file name is workflow.py.) The workflow describes a directed acyclic graph (DAG) for execution using Kubeflow Pipelines, and depicts the connections between the functions and the data as part of an end-to-end pipeline. The workflow file has a definition of a pipeline DSL for connecting the function inputs and outputs.

The defined pipeline includes the following steps:

  • Perform feature selection (feature_selection).

  • Train and the model (train).

  • Test the model with its test data set (evaluate).

  • Deploy the model as a real-time serverless function (deploy).

Note

A pipeline can also include continuous build integration and deployment (CI/CD) steps, such as building container images and deploying models.

%%writefile workflow.py
import mlrun
from kfp import dsl
from mlrun.model import HyperParamOptions

from mlrun import (
    build_function,
    deploy_function,
    import_function,
    run_function,
)

    
@dsl.pipeline(
    name="Fraud Detection Pipeline",
    description="Detecting fraud from a transactions dataset"
)

def kfpipeline(vector_name='transactions-fraud'):
    
    project = mlrun.get_current_project()
    
    # Feature selection   
    feature_selection = run_function(
        "feature_selection",
        name="feature_selection",
        params={'output_vector_name': "short",
                "label_column": project.get_param('label_column', 'label'),
                "k": 18,
                "min_votes": 2,
                'ignore_type_errors': True
               },
        inputs={'df_artifact': project.get_artifact_uri(vector_name, 'feature-vector')},
        outputs=['feature_scores', 'selected_features_count', 'top_features_vector', 'selected_features'])
    
    
    # train with hyper-paremeters
    train = run_function(
        "train",
        name="train",
        handler="train",
        params={"sample": -1, 
                "label_column": project.get_param('label_column', 'label'),
                "test_size": 0.10},
        hyperparams={"model_name": ['transaction_fraud_rf', 
                                    'transaction_fraud_xgboost', 
                                    'transaction_fraud_adaboost'],
                     'model_class': ["sklearn.ensemble.RandomForestClassifier", 
                                     "sklearn.linear_model.LogisticRegression",
                                     "sklearn.ensemble.AdaBoostClassifier"]},
        hyper_param_options=HyperParamOptions(selector="max.accuracy"),
        inputs={"dataset": feature_selection.outputs['top_features_vector']},
        outputs=['model', 'test_set'])
    
            
    # test and visualize your model
    test = run_function(
        "train",
        name="evaluate",
        handler='evaluate',
        params={"label_columns": project.get_param('label_column', 'label'),
                "model": train.outputs["model"], 
                "drop_columns": project.get_param('label_column', 'label')},
        inputs={"dataset": train.outputs["test_set"]})
    
    # route your serving model to use enrichment
    funcs['serving'].set_topology('router', 
                                  'mlrun.serving.routers.EnrichmentModelRouter', 
                                  name='EnrichmentModelRouter', 
                                  feature_vector_uri="transactions-fraud-short", 
                                  impute_policy={"*": "$mean"},
                                  exist_ok=True)

    
    # deploy your model as a serverless function, you can pass a list of models to serve 
    deploy = deploy_function("serving", models=[{"key": 'fraud', "model_path": train.outputs["model"]}])
Writing workflow.py

Step 4: Registering the workflow#

Use the set_workflow MLRun project method to register your workflow with MLRun. The following code sets the name parameter to the selected workflow name ("main") and the code parameter to the name of the workflow file that is found in your project directory (workflow.py).

# Register the workflow file as "main"
project.set_workflow("main", "workflow.py")

Step 5: Running a pipeline#

First run the following code to save your project:

project.save()
<mlrun.projects.project.MlrunProject at 0x7f5720229280>

Use the run MLRun project method to execute your workflow pipeline with Kubeflow Pipelines.

You can pass arguments or set the artifact_path to specify a unique path for storing the workflow artifacts.

run_id = project.run("main", arguments={}, dirty=True, watch=True)
Pipeline running (id=2e7556a2-c398-4134-8229-163bd7ee3ec3), click here to view the details in MLRun UI
_images/34582477f38509d80c3524cdac676bbcd1c20bc5699aa009e4950ac024bcfcec.svg

Run Results

[info] Workflow 2e7556a2-c398-4134-8229-163bd7ee3ec3 finished, state=Succeeded


click the hyper links below to see detailed results
uid start state name parameters results
Feb 15 14:53:55 completed evaluate
label_columns=label
model=store://artifacts/fraud-demo-dani/transaction_fraud_adaboost:2e7556a2-c398-4134-8229-163bd7ee3ec3
drop_columns=label
evaluation_accuracy=0.991504247876062
evaluation_f1_score=0.4137931034482759
evaluation_precision_score=0.42857142857142855
evaluation_recall_score=0.4
Feb 15 14:53:00 completed train
sample=-1
label_column=label
test_size=0.1
best_iteration=9
accuracy=0.991504247876062
f1_score=0.4137931034482759
precision_score=0.42857142857142855
recall_score=0.4
Feb 15 14:52:23 completed feature_selection
output_vector_name=short
label_column=label
k=18
min_votes=2
ignore_type_errors=True
top_features_vector=store://feature-vectors/fraud-demo-dani/short

Step 6: Test the model endpoint#

Now that your model is deployed using the pipeline, you can invoke it as usual:

# Define your serving function
serving_fn = project.get_function("serving")

# Choose an id for your test
sample_id = "C1000148617"
model_inference_path = "/v2/models/fraud/infer"

# Send our sample ID for predcition
serving_fn.invoke(path=model_inference_path, body={"inputs": [[sample_id]]})
> 2023-02-15 14:56:50,310 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-fraud-demo-dani-serving.default-tenant.svc.cluster.local:8080/v2/models/fraud/infer'}
{'id': 'dbc3b94e-367d-4970-8825-f99ebf76320b',
 'model_name': 'fraud',
 'outputs': [0]}

Done!#

Batch runs and workflows#

In this section

MLRun execution context#

After running a job, you need to be able to track it. To gain the maximum value, MLRun uses the job context object inside the code. This provides access to job metadata, parameters, inputs, secrets, and API for logging and monitoring the results, as well as log text, files, artifacts, and labels.

Inside the function you can access the parameters/inputs by simply adding them as parameters to the function, or you can get them from the context object (using get_param() and get_input()).

  • If context is specified as the first parameter in the function signature, MLRun injects the current job context into it.

  • Alternatively, if it does not run inside a function handler (e.g. in Python main or Notebook) you can obtain the context object from the environment using the get_or_create_ctx() function.

Common context methods:

  • get_secret(key: str) — get the value of a secret

  • logger.info("started experiment..") — textual logs

  • log_result(key: str, value) — log simple values

  • set_label(key, value) — set a label tag for that task

  • log_artifact(key, body=None, local_path=None, ...) — log an artifact (body or local file)

  • log_dataset(key, df, ...) — log a dataframe object

  • log_model(key, ...) — log a model object

Example function and usage of the context object:

from mlrun.artifacts import PlotlyArtifact
import pandas as pd

def my_job(context, p1=1, p2="x"):
    # load MLRUN runtime context (will be set by the runtime framework)

    # get parameters from the runtime context (or use defaults)

    # access input metadata, values, files, and secrets (passwords)
    print(f"Run: {context.name} (uid={context.uid})")
    print(f"Params: p1={p1}, p2={p2}")
    print("accesskey = {}".format(context.get_secret("ACCESS_KEY")))
    print("file\n{}\n".format(context.get_input("infile.txt", "infile.txt").get()))

    # Run some useful code e.g. ML training, data prep, etc.

    # log scalar result values (job result metrics)
    context.log_result("accuracy", p1 * 2)
    context.log_result("loss", p1 * 3)
    context.set_label("framework", "sklearn")

    # log various types of artifacts (file, web page, table), will be versioned and visible in the UI
    context.log_artifact(
        "model",
        body=b"abc is 123",
        local_path="model.txt",
        labels={"framework": "xgboost"},
    )
    context.log_artifact(
        "html_result", body=b"<b> Some HTML <b>", local_path="result.html"
    )

     # create a plotly output (will show in the pipelines UI)
    x = np.arange(10)
    fig = go.Figure(data=go.Scatter(x=x, y=x**2))

    # Create a PlotlyArtifact using the figure and log it
    plotly_artifact = PlotlyArtifact(figure=fig, key="plotly")
    context.log_artifact(plotly_artifact)
    
    raw_data = {
        "first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
        "last_name": ["Miller", "Jacobson", "Ali", "Milner", "Cooze"],
        "age": [42, 52, 36, 24, 73],
        "testScore": [25, 94, 57, 62, 70],
    }
    df = pd.DataFrame(raw_data, columns=["first_name", "last_name", "age", "testScore"])
    context.log_dataset("mydf", df=df, stats=True)

Example of creating the context objects from the environment:

if __name__ == "__main__":
    context = mlrun.get_or_create_ctx('train')
    p1 = context.get_param('p1', 1)
    p2 = context.get_param('p2', 'a-string')
    # do something
    context.log_result("accuracy", p1 * 2)
    # commit the tracking results to the DB (and mark as completed)
    context.commit(completed=True)

Note that MLRun context is also a python context and can be used in a with statement (eliminating the need for commit).

if __name__ == "__main__":
    with mlrun.get_or_create_ctx('train') as context:
        p1 = context.get_param('p1', 1)
        p2 = context.get_param('p2', 'a-string')
        # do something
        context.log_result("accuracy", p1 * 2)

Decorators and auto-logging#

While it is possible to log results and artifacts using the MLRun execution context, it is often more convenient to use the mlrun.handler() decorator.

Basic example#

Assume you have the following code in train.py

import pandas as pd
from sklearn.svm import SVC

def train_and_predict(train_data,
                      predict_input,
                      label_column='label'):

    x = train_data.drop(label_column, axis=1)
    y = train_data[label_column]

    clf = SVC()
    clf.fit(x, y)

    return list(clf.predict(predict_input))

With the mlrun.handler the python function itself would not change, and logging of the inputs and outputs would be automatic. The resultant code is as follows:

import pandas as pd
from sklearn.svm import SVC
import mlrun

@mlrun.handler(labels={'framework':'scikit-learn'},
               outputs=['prediction:dataset'],
               inputs={"train_data": pd.DataFrame,
                       "predict_input": pd.DataFrame})
def train_and_predict(train_data,
                      predict_input,
                      label_column='label'):

    x = train_data.drop(label_column, axis=1)
    y = train_data[label_column]

    clf = SVC()
    clf.fit(x, y)

    return list(clf.predict(predict_input))

To run the code, use the following example:

import mlrun
project = mlrun.get_or_create_project("mlrun-example", context="./", user_project=True)

trainer = project.set_function("train.py", name="train_and_predict", kind="job", image="mlrun/mlrun", handler="train_and_predict")

trainer_run = project.run_function(
    "train_and_predict", 
    inputs={"train_data": mlrun.get_sample_path('data/iris/iris_dataset.csv'),
            "predict_input": mlrun.get_sample_path('data/iris/iris_to_predict.csv')
           }
)

The outcome is a run with:

  1. A label with key "framework" and value "scikit-learn".

  2. Two inputs "train_data" and "predict_input" created from Pandas DataFrame.

  3. An artifact called "prediction" of type "dataset". The contents of the dataset will be the return value (in this case the prediction result).

Labels#

The decorator gives you the option to set labels for the run. The labels parameter is a dictionary with keys and values to set for the labels.

Input type parsing#

The mlrun.handler decorator can also parse the input types, if they are specified. An equivalent definition is as follows:

@mlrun.handler(labels={'framework':'scikit-learn'},
               outputs=['prediction:dataset'])
def train_and_predict(train_data: pd.DataFrame,
                      predict_input: pd.DataFrame,
                      label_column='label'):

...

Notice: Type hints from the typing module (e.g. typing.Optional, typing.Union, typing.List etc.) are currently not supported but will be in the future.

Note: If the inputs does not have a type input, the decorator assumes the parameter type in mlrun.datastore.DataItem. If you specify inputs=False, all the run inputs are assumed to be of type mlrun.datastore.DataItem. You also have the option to specify a dictionary where each key is the name of the input and the value is the type.

Logging return values as artifacts#

If you specify the outputs parameter, the return values will be logged as the run artifacts. outputs expects a list; the length of the list must match the number of returned values.

The simplest option is to specify a list of strings. Each string contains the name of the artifact. You can also specify the artifact type by adding a colon after the artifact name followed by the type ('name:artifact_type'). The following are valid artifact types:

  • dataset

  • directory

  • file

  • object

  • plot

  • result

If you use only the name without the type, the following mapping is used:

Python type

Artifact type

pandas.DataFrame

Dataset

pandas.Series

Dataset

numpy.ndarray

Dataset

dict

Result

list

Result

tuple

Result

str

Result

int

Result

float

Result

bytes

Object

bytearray

Object

matplotlib.pyplot.Figure

Plot

plotly.graph_objs.Figure

Plot

Refer to the mlrun.handler() for more details.

Running a task (job)#

In this section

Submit tasks (jobs) using run_function#

Use the run_function() method for invoking a job over MLRun batch functions. The run_function method accepts various parameters such as name, handler, params, inputs, schedule, etc. Alternatively, you can pass a Task object that holds all of the parameters plus the advanced options. See: new_task(), and the example in run_function.

Functions can host multiple methods (handlers). You can set the default handler per function. You need to specify which handler you intend to call in the run command.

You can pass parameters (arguments) or data inputs (such as datasets, feature-vectors, models, or files) to the functions through the run method.

  • Parameters (params) are meant for basic python objects that can be parsed from text without special handling. So, passing int, float, str and dict, list are all possible using params. MLRun takes the parameter and assigns it to the relevant handler parameter by name.

Important

Parameters that are passed to a workflow are limited to 10000 chars.

  • Inputs are used for passing various local or remote data objects (files, tables, models, etc.) to the function as DataItem objects. You can pass data objects using the inputs dictionary argument, where the dictionary keys match the function's handler argument names and the MLRun data urls are provided as the values. DataItems have many methods like local
    (download the data item's file to a local temp directory) and as_df (parse the data to a pd.DataFrame). The dataItem objects handle data movement, tracking, and security in an optimal way. Read more about data items.

When a type hint is available for an argument, MLRun automatically parses the DataItem to the hinted type (when the hinted type is supported).

Use run_function as a project methods. For example:

# run the "train" function in myproject
run_results = myproject.run_function("train", inputs={"data": data_url})  

The first parameter in run_function is either the function name (in the project), or a function object if you want to use functions that you imported/created ad hoc, or modify a function spec, for example:

run_results = project.run_function(fn, params={"label_column": "label"}, inputs={'data': data_url})

Run/simulate functions locally:

Functions can also run and be debugged locally by using the local runtime or by setting the local=True parameter in the run() method (for batch functions).

MLRun also supports iterative jobs that can run and track multiple child jobs (for hyperparameter tasks, AutoML, etc.). See Hyperparameter tuning optimization for details and examples.

Run result object and UI#

The run_function() command returns an MLRun RunObject object that you can use to track the job and its results. If you pass the parameter watch=True (default) the command blocks until the job completes.

Run object has the following methods/properties:

  • uid() — returns the unique ID.

  • state() — returns the last known state.

  • show() — shows the latest job state and data in a visual widget (with hyperlinks and hints).

  • outputs — returns a dictionary of the run results and artifact paths.

  • logs(watch=True) — returns the latest logs. Use Watch=False to disable the interactive mode in running jobs.

  • artifact(key) — returns an artifact for the provided key (as DataItem object).

  • output(key) — returns a specific result or an artifact path for the provided key.

  • wait_for_completion() — wait for async run to complete

  • refresh() — refresh run state from the db/service

  • to_dict(), to_yaml(), to_json() — converts the run object to a dictionary, YAML, or JSON format (respectively).


You can view the job details, logs, and artifacts in the UI. When you first open the Monitor Jobs tab it displays the last jobs that ran and their data. Click a job name to view its run history, and click a run to view more of the run's data.


project-jobs-train-artifacts-test_set

See full details and examples in Functions.

Running a multi-stage workflow#

A workflow is a definition of execution of functions. It defines the order of execution of multiple dependent steps in a directed acyclic graph (DAG). A workflow can reference the project’s params, secrets, artifacts, etc. It can also use a function execution output as a function execution input (which, of course, defines the order of execution).

MLRun supports running workflows on a local or kubeflow pipeline engine. The local engine runs the workflow as a local process, which is simpler for debugging and running simple/sequential tasks. The kubeflow ("kfp") engine runs as a task over the cluster and supports more advanced operations (conditions, branches, etc.). You can select the engine at runtime. Kubeflow-specific directives like conditions and branches are not supported by the local engine.

Workflows are saved/registered in the project using the set_workflow().
Workflows are executed using the run() method or using the CLI command mlrun project.

Refer to the Tutorials and Examples for complete examples.

In this section

Composing workflows#

Workflows are written as python functions that make use of function operations (run, build, deploy) and can access project parameters, secrets, and artifacts using get_param(), get_secret() and get_artifact_uri().

For workflows to work in Kubeflow you need to add a decorator (@dsl.pipeline(..)) as shown below.

Example workflow:

from kfp import dsl
import mlrun
from mlrun.model import HyperParamOptions

funcs = {}
DATASET = "iris_dataset"

in_kfp = True


@dsl.pipeline(name="Demo training pipeline", description="Shows how to use mlrun.")
def newpipe():

    project = mlrun.get_current_project()

    # build our ingestion function (container image)
    builder = mlrun.build_function("gen-iris")

    # run the ingestion function with the new image and params
    ingest = mlrun.run_function(
        "gen-iris",
        name="get-data",
        params={"format": "pq"},
        outputs=[DATASET],
    ).after(builder)

    # train with hyper-parameters
    train = mlrun.run_function(
        "train",
        name="train",
        params={"sample": -1, "label_column": project.get_param("label", "label"), "test_size": 0.10},
        hyperparams={
            "model_pkg_class": [
                "sklearn.ensemble.RandomForestClassifier",
                "sklearn.linear_model.LogisticRegression",
                "sklearn.ensemble.AdaBoostClassifier",
            ]
        },
        hyper_param_options=HyperParamOptions(selector="max.accuracy"),
        inputs={"dataset": ingest.outputs[DATASET]},
        outputs=["model", "test_set"],
    )
    print(train.outputs)

    # test and visualize our model
    mlrun.run_function(
        "test",
        name="test",
        params={"label_column": project.get_param("label", "label")},
        inputs={
            "models_path": train.outputs["model"],
            "test_set": train.outputs["test_set"],
        },
    )

    # deploy our model as a serverless function, we can pass a list of models to serve
    serving = mlrun.import_function("hub://v2_model_server", new_name="serving")
    deploy = mlrun.deploy_function(
        serving,
        models=[{"key": f"{DATASET}:v1", "model_path": train.outputs["model"]}],
    )

    # test out new model server (via REST API calls), use imported function
    tester = mlrun.import_function("hub://v2_model_tester", new_name="live_tester")
    mlrun.run_function(
        tester,
        name="model-tester",
        params={"addr": deploy.outputs["endpoint"], "model": f"{DATASET}:v1"},
        inputs={"table": train.outputs["test_set"]},
    )

Note

For defining the steps order you can either use steps outputs as written above, or use .after(step_1,step_2,..) method, that allows the user to define the order of the workflow steps without the need to forward the outputs from the previous steps.

Saving workflows#

If you want to use workflows as part of an automated flow, save them and register them in the project. Use the set_workflow() method to register workflows, to specify a workflow name, the path to the workflow file, and the function handler name (or it looks for a handler named "pipeline"), and can set the default engine (local or kfp).

When setting the embed flag to True, the workflow code is embedded in the project file (can be used if you want to describe the entire project using a single YAML file).

You can define the schema for workflow arguments (data type, default, doc, etc.) by setting the args_schema with a list of EntrypointParam objects.

Example:

    # define agrument for the workflow
    arg = mlrun.model.EntrypointParam(
        "model_pkg_class",
        type="str",
        default="sklearn.linear_model.LogisticRegression",
        doc="model package/algorithm",
    )
    
    # register the workflow in the project and save the project
    project.set_workflow("main", "./myflow.py", handler="newpipe", args_schema=[arg])
    project.save()
    
    # run the workflow
    project.run("main", arguments={"model_pkg_class": "sklearn.ensemble.RandomForestClassifier"})

Running workflows#

Use the run() method to execute workflows. Specify the workflow using its name or workflow_path (path to the workflow file) or workflow_handler (the workflow function handler). You can specify the input arguments for the workflow and can override the system default artifact_path.

Workflows are asynchronous by default. You can set the watch flag to True and the run operation blocks until completion and prints out the workflow progress. Alternatively, you can use .wait_for_completion() on the run object.

The default workflow engine is kfp. You can override it by specifying the engine in the run() or set_workflow() methods. Using the local engine executes the workflow state machine locally (its functions still run as cluster jobs). If you set the local flag to True, the workflow uses the local engine AND the functions run as local process. This mode is used for local debugging of workflows. The remote engine runs the workflow from a remote pod. From the project source you can set the remote engine to run in local by setting engine to remote:local.

When running workflows from a git enabled context it first verifies that there are no uncommitted git changes (to guarantee that workflows that load from git do not use old code versions). You can suppress that check by setting the dirty flag to True.

Examples:

# simple run of workflow 'main' with arguments, block until it completes (watch=True)
run = project.run("main", arguments={"param1": 6}, watch=True)

# run workflow specified with a function handler (my_pipe)
run = project.run(workflow_handler=my_pipe)
# wait for pipeline completion
run.wait_for_completion()

# run workflow in local debug mode
run = project.run(workflow_handler=my_pipe, local=True, arguments={"param1": 6})
Notification#

Instead of waiting for completion, you can set up a notification in Slack with a results summary, similar to:
slack notification

Use one of:

project.notifiers.add_notification(notification_type="slack",params={"webhook":"<user-slack-webhook>"})

or in a Jupyter notebook with the %env magic command:

%env SLACK_WEBHOOK=<slack webhook url>

Configuring runs and functions#

MLRun orchestrates serverless functions over Kubernetes. You can specify the resource requirements (CPU, memory, GPUs), preferences, and pod priorities in the logical function object. You can also configure how MLRun prevents stuck pods. All of these are used during the function deployment.

Configuring runs and functions is relevant for all supported cloud platforms.

In this section

Environment variables#

Environment variables can be added individually, from a Python dictionary, or a file:

# Single variable
fn.set_env(name="MY_ENV", value="MY_VAL")

# Multiple variables
fn.set_envs(env_vars={"MY_ENV" : "MY_VAL", "SECOND_ENV" : "SECOND_VAL"})

# Multiple variables from file
fn.set_envs(file_path="env.txt")

Replicas#

Some runtimes can scale horizontally, configured either as a number of replicas:
spec.replicas
or a range (for auto scaling in Dask or Nuclio):

spec.min_replicas = 1
spec.max_replicas = 4

Note

Scaling (replication) algorithm, if a target utilization (Target CPU%) value is set, the replication controller calculates the utilization value as a percentage of the equivalent resource request (CPU request) on the replicas and based on that provides horizontal scaling. See also Kubernetes horizontal autoscale

See more details in Dask, MPIJob and Horovod, Spark, Nuclio.

CPU, GPU, and memory limits for user jobs#

When you create a pod in an MLRun job or Nuclio function, the pod has default CPU and memory limits. When the job runs, it can consume resources up to the limits defined. The default limits are set at the service level. You can change the default limit for the service, and also overwrite the default when creating a job, or a function. Adding requests and limits to your function specify what compute resources are required. It is best practice to define this for each MLRun function.

# Requests - lower bound
fn.with_requests(mem="1G", cpu=1)

# Limits - upper bound
fn.with_limits(mem="2G", cpu=2, gpus=1)

See more about Kubernetes Resource Management for Pods and Containers.

UI configuration#

When creating a service, set the Memory and CPU in the Common Parameters tab, under User jobs defaults. When creating a job or a function, overwrite the default Memory, CPU, or GPU in the Configuration tab, under Resources.

SDK configuration#

Configure the limits assigned to a function by using with_limits. For example:

training_function = mlrun.code_to_function("training.py", name="training", handler="train", 
                                           kind="mpijob", image="mlrun/mlrun-gpu")
training_function.spec.replicas = 2
training_function.with_requests(cpu=2)
training_function.with_limits(gpus=1)

Note

When specifying GPUs, MLRun uses nvidia.com/gpu as default GPU type. To use a different type of GPU, specify it using the optional gpu_type parameter.

Number of workers and GPUs#

For each Nuclio or serving function, MLRun creates an HTTP trigger with the default of 1 worker. When using GPU in remote functions you must ensure that the number of GPUs is equal to the number of workers (or manage the GPU consumption within your code). You can set the number of GPUs for each pod using the MLRun SDK.

You can change the number of workers after you create the trigger (function object), then you need to redeploy the function. Examples of changing the number of workers:

with_http:
serve.with_http(workers=8, worker_timeout=10)

add_v3io_stream_trigger:
serve.add_v3io_stream_trigger(stream_path='v3io:///projects/myproj/stream1', maxWorkers=3,name='stream', group='serving', seek_to='earliest', shards=1)

Volumes#

When you create a pod in an MLRun job or Nuclio function, the pod by default has access to a file-system which is ephemeral, and gets deleted when the pod completes its execution. In many cases, a job requires access to files residing on external storage, or to files containing configurations and secrets exposed through Kubernetes config-maps or secrets. Pods can be configured to consume the following types of volumes, and to mount them as local files in the local pod file-system:

  • V3IO containers: when running on the Iguazio system, pods have access to the underlying V3IO shared storage. This option mounts a V3IO container or a subpath within it to the pod through the V3IO FUSE driver.

  • PVC: Mount a Kubernetes persistent volume claim (PVC) to the pod. The persistent volume and the claim need to be configured beforehand.

  • Config Map: Mount a Kubernetes Config Map as local files to the pod.

  • Secret: Mount a Kubernetes secret as local files to the pod.

For each of the options, a name needs to be assigned to the volume, as well as a local path to mount the volume at (using a Kubernetes Volume Mount). Depending on the type of the volume, other configuration options may be needed, such as an access-key needed for V3IO volume.

See more about Kubernetes Volumes.

MLRun supports the concept of volume auto-mount which automatically mounts the most commonly used type of volume to all pods, unless disabled. See more about MLRun auto mount.

UI configuration#

You can configure Volumes when creating a job, rerunning an existing job, and creating an ML function. Modify the Volumes for an ML function by pressing ML functions, then _images/kebab-menu.png of the function, Edit | Resources | Volumes drop-down list.

Select the volume mount type: either Auto (using auto-mount), Manual or None. If selecting Manual, fill in the details in the volumes list for each volume to mount to the pod. Multiple volumes can be configured for a single pod.

SDK configuration#

Configure volumes attached to a function by using the apply function modifier on the function.

For example, using v3io storage:

# import the training function from the Function Hub (hub://)
train = mlrun.import_function('hub://sklearn_classifier')# Import the function:
open_archive_function = mlrun.import_function("hub://open_archive")

# use mount_v3io() for iguazio volumes
open_archive_function.apply(mount_v3io())

You can specify a list of the v3io path to use and how they map inside the container (using volume_mounts). For example:

mlrun.mount_v3io(name='data',access_key='XYZ123..',volume_mounts=[mlrun.VolumeMount("/data", "projects/proj1/data")])

See full details in mount_v3io.

Alternatively, using a PVC volume:

mount_pvc(pvc_name="data-claim", volume_name="data", volume_mount_path="/data")

See full details in mount_pvc.

Preemption mode: Spot vs. On-demand nodes#

Node selector is supported for all cloud platforms. It is relevant for MLRun and Nuclio only.

When running ML functions you might want to control whether to run on spot nodes or on-demand nodes. Preemption mode controls whether pods can be scheduled on preemptible (spot) nodes. Preemption mode is supported for all functions.

Preemption mode uses Kubernetes Taints and Toleration to enforce the mode selected. Read more in Kubernetes Taints and Tolerations.

Why preemption mode#

On-demand instances provide full control over the instance lifecycle. You decide when to launch, stop, hibernate, start, reboot, or terminate it. With Spot instances, you request capacity from specific availability zones, though it is susceptible to spot capacity availability. This is a good choice if you can be flexible about when your applications run and if your applications can be interrupted.

Here are some questions to consider when choosing the type of node:

  • Is the function mission critical and must be operational at all times?

  • Is the function a stateful function or stateless function?

  • Can the function recover from unexpected failure?

  • Is this a job that should run only when there are available inexpensive resources?

Important

When an MLRun job is running on a spot node and it fails, it won't get back up again. However, if Nuclio goes down due to a spot issue, it is brought up by Kubernetes.

Kubernetes has a few methods for configuring which nodes to run on. To get a deeper understanding, see Pod Priority and Preemption. Also, you must understand the configuration of the spot nodes as specified by the cloud provider.

Stateless and Stateful Applications#

When deploying your MLRun jobs to specific nodes, take into consideration that on-demand nodes are designed to run stateful applications while spot nodes are designed for stateless applications. MLRun jobs are more stateful by nature. An MLRun job that is assigned to run on a spot node might be subject to interruption; it would have to be designed so that the job/function state will be saved when scaling to zero.

Supported preemption modes#

Preemption mode has three values:

  • Allow: The function pod can run on a spot node if one is available.

  • Constrain: The function pod only runs on spot nodes, and does not run if none is available.

  • Prevent: Default. The function pod cannot run on a spot node.

To change the default function preemption mode, it is required to override mlrun the api configuration (and specifically "MLRUN_FUNCTION_DEFAULTS__PREENPTION_MODE" envvar to either one of the above modes).

UI configuration#

Note

Relevant when MLRun is executed in the Iguazio platform.

You can configure Spot node support when creating a job, rerunning an existing job, and creating an ML function. The Run on Spot nodes drop-down list is in the Resources section of jobs. Configure the Spot node support for individual Nuclio functions when creating a function in the Configuration tab, under Resources.

SDK configuration#

Configure preemption mode by adding the with_preemption_mode parameter in your Jupyter notebook, and specifying a mode from the list of values above.
This example illustrates a function that cannot be scheduled on preemptible nodes:

import mlrun
import os

train_fn = mlrun.code_to_function('training', 
                            kind='job', 
                            handler='my_training_function') 
train_fn.with_preemption_mode(mode="prevent") 
train_fn.run(inputs={"dataset": my_data})
   

See with_preemption_mode.

Alternatively, you can specify the preemption using with_priority_class and with_node_selection parameters. This example specifies that the pod/function runs only on non-preemptible nodes:

import mlrun
import os
train_fn = mlrun.code_to_function('training', 
                            kind='job', 
                            handler='my_training_function') 
train_fn.with_preemption_mode(mode="prevent") 
train_fn.run(inputs={"dataset" :my_data})

fn.with_priority_class(name="default-priority")
fn.with_node_selection(node_selector={"app.iguazio.com/lifecycle":"non-preemptible"})

See with_priority_class. See with_node_selection.

Pod priority for user jobs#

Pods (services, or jobs created by those services) can have priorities, which indicate the relative importance of one pod to the other pods on the node. The priority is used for scheduling: a lower priority pod can be evicted to allow scheduling of a higher priority pod. Pod priority is relevant for all pods created by the service. For MLRun, it applies to the jobs created by MLRun. For Nuclio it applies to the pods of the Nuclio-created functions.

Eviction uses these values in conjunction with pod priority to determine what to evict Pod Priority and Preemption.

Pod priority is specified through Priority classes, which map to a priority value. The priority values are: High, Medium, Low. The default is Medium. Pod priority is supported for:

  • MLRun jobs: the default priority class for the jobs that MLRun creates.

  • Nuclio functions: the default priority class for the user-created functions.

  • Jupyter

  • Presto (The pods priority also affects any additional services that are directly affected by Presto, for example like hive and mariadb, which are created if Enable hive is checked in the Presto service.)

  • Grafana

  • Shell

UI configuration#

Note

Relevant when MLRun is executed in the Iguazio platform.

Configure the default priority for a service, which is applied to the service itself or to all subsequently created user-jobs in the service's Common Parameters tab, User jobs defaults section, Priority class drop-down list.

Modify the priority for an ML function by pressing ML functions, then _images/kebab-menu.png of the function, Edit | Resources | Pods Priority drop-down list.

SDK configuration#

Configure pod priority by adding the priority class parameter in your Jupyter notebook.
For example:

import mlrun
import os
train_fn = mlrun.code_to_function('training', 
                            kind='job', 
                            handler='my_training_function') 
train_fn.with_priority_class(name={value})
train_fn.run(inputs={"dataset" :my_data})
 

See with_priority_class.

Node selection#

Node selection can be used to specify where to run workloads (e.g. specific node groups, instance types, etc.). This is a more advanced parameter mainly used in production deployments to isolate platform services from workloads. See Node affinity for more information on how to configure node selection.

# Only run on non-spot instances
fn.with_node_selection(node_selector={"app.iguazio.com/lifecycle" : "non-preemptible"})

Scaling and auto-scaling#

Scaling behavior can be added to real-time and distributed runtimes including nuclio, serving, spark, dask, and mpijob. See Replicas to see how to configure scaling behavior per runtime. This example demonstrates setting replicas for nuclio/serving runtimes:

# Nuclio/serving scaling
fn.spec.replicas = 2
fn.spec.min_replicas = 1
fn.spec.max_replicas = 4

Mount persistent storage#

In some instances, you might need to mount a file-system to your container to persist data. This can be done with native K8s PVC's or the V3IO data layer for Iguazio clusters. See Attach storage to functions for more information on the storage options.

# Mount persistent storage - V3IO
fn.apply(mlrun.mount_v3io())

# Mount persistent storage - PVC
fn.apply(mlrun.platforms.mount_pvc(pvc_name="data-claim", volume_name="data", volume_mount_path="/data"))

Preventing stuck pods#

The runtimes spec has four "state_threshold" attributes that can determine when to abort a run. Once a threshold is passed and the run is in the matching state, the API monitoring aborts the run, deletes its resources, sets the run state to aborted, and issues a "status_text" message.

The four states and their default thresholds are:

'pending_scheduled': '1h', #Scheduled and pending and therefore consumes resources
'pending_not_scheduled': '-1', #Scheduled but not pending, can continue to wait for resources
'image_pull_backoff': '1h', #Container running in a pod fails to pull the required image from a container registry
'running': '24h' #Job is running  

The thresholds are time strings constructed of value and scale pairs (e.g. "30 minutes 5h 1day"). To configure to infinity, use -1.

To change the state thresholds, use:

func.set_state_thresholds({"pending_not_scheduled": "1 min"}) 

For just the run, use:

func.run(state_thresholds={"running": "1 min", "image_pull_backoff": "1 minute and 30s"}) 

See:

Note

State thresholds are not supported for Nuclio/serving runtimes (since they have their own monitoring) or for the Dask runtime (which can be monitored by the client).

Scheduled jobs and workflows#

Oftentimes you may want to run a job on a regular schedule. For example, fetching from a datasource every morning, compiling an analytics report every month, or detecting model drift every hour.

Creating a job and scheduling it#

MLRun makes it very simple to add a schedule to a given job. To showcase this, the following job runs the code below, which resides in a file titled schedule.py:

def hello(context):
    print("You just ran a scheduled job!")

To create the job, use the code_to_function syntax and specify the kind like below:

import mlrun

job = mlrun.code_to_function(
    name="my-scheduled-job",      # Name of the job (displayed in console and UI)
    filename="schedule.py",       # Python file or Jupyter notebook to run
    kind="job",                   # Run as a job
    image="mlrun/mlrun",          # Use this Docker image
    handler="hello"               # Execute the function hello() within code.py
)

Running the job using a schedule

To add a schedule, run the job and specify the schedule parameter using Cron syntax like so:

job.run(schedule="0 * * * *")

This runs the job every hour. An excellent resource for generating Cron schedules is Crontab.guru.

Scheduling a workflow#

After loading the project (load_project), run the project with the scheduled workflow:

project.run("main", schedule='0 * * * *')

Note

  1. Remote workflows can only be performed by a project with a remote source (git://github.com/mlrun/something.git, http://some/url/file.zip or http://some/url/file.tar.gz). So you need to either put your code in Git or archive it and then set a source to it.

    • To set project source use the project.set_source method.

    • To set workflow use the project.set_workflow method.

  2. Example for a remote GitHub project - mlrun/project-demo

You can delete a scheduled workflow in the MLRun UI. To update a scheduled workflow, re-define the schedule in the workflow, for example:

project.run("main", schedule='0 * * * *')

Notifications#

MLRun supports configuring notifications on jobs and scheduled jobs. This section describes the SDK for notifications.

The Notification Object#

The notification object's schema is:

  • kind: str - notification kind (slack, git, etc…)

  • when: list[str] - run states on which to send the notification (completed, error, aborted)

  • name: str - notification name

  • message: str - notification message

  • severity: str - notification severity (info, warning, error, debug)

  • params: dict - notification parameters (See definitions in Notification Kinds)

  • secret_params: dict - secret data notification parameters (See definitions in Notification Params and Secrets)

  • condition: str - jinja template for a condition that determines whether the notification is sent or not (See Notification Conditions)

Local vs Remote#

Notifications can be sent either locally from the SDK, or remotely from the MLRun API. Usually, a local run sends locally, and a remote run sends remotely. However, there are several special cases where the notification is sent locally either way. These cases are:

  • Local or KFP Engine Pipelines: To conserve backwards compatibility, the SDK sends the notifications as it did before adding the run notifications mechanism. This means you need to watch the pipeline in order for its notifications to be sent. (Remote pipelines act differently. See [Configuring Notifications For Pipelines](#configuring-notifications-for-pipelines For Pipelines for more details.)

  • Dask: Dask runs are always local (against a remote dask cluster), so the notifications are sent locally as well.

Disclaimer: Notifications of local runs aren't persisted.

Notification Params and Secrets#

The notification parameters often contain sensitive information, such as Slack webhooks Git tokens, etc. To ensure the safety of this sensitive data, the parameters are split into 2 objects - params and secret_params. Either can be used to store any notification parameter. However the secret_params will be protected by project secrets. When a notification is created, its secret_params are automatically masked and stored in a mlrun project secret. The name of the secret is built from the hash of the params themselves (So if multiple notifications use the same secret, it won't waste space in the project secret). Inside the notification's secret_params, you'll find a reference to the secret under the secret key once it's been masked. For non-sensitive notification parameters, you can simply use the params parameter, which doesn't go through this masking process. It's essential to utilize secret_params exclusively for handling sensitive information, ensuring secure data management.

Notification Kinds#

Currently, the supported notification kinds and their params are as follows:

  • slack:

    • webhook: The slack webhook to which to send the notification.

  • git:

    • token: The git token to use for the git notification.

    • repo: The git repo to which to send the notification.

    • issue: The git issue to which to send the notification.

    • merge_request: In gitlab (as opposed to github), merge requests and issues are separate entities. If using merge request, the issue will be ignored, and vice versa.

    • server: The git server to which to send the notification.

    • gitlab: (bool) Whether the git server is gitlab or not.

  • webhook:

    • url: The webhook url to which to send the notification.

    • method: The http method to use when sending the notification (GET, POST, PUT, etc…).

    • headers: (dict) The http headers to send with the notification.

    • override_body: (dict) The body to send with the notification. If not specified, the body will be a dict with the name, message, severity, and the runs list of the completed runs.

    • verify_ssl: (bool) Whether SSL certificates are validated during HTTP requests or not, The default is set to True.

  • console (no params, local only)

  • ipython (no params, local only)

Configuring Notifications For Runs#

In any run method you can configure the notifications via their model. For example:

notification = mlrun.model.Notification(
    kind="webhook",
    when=["completed","error"],
    name="notification-1",
    message="completed",
    severity="info",
    secret_params={"url": "<webhook url>"},
    params={"method": "GET", "verify_ssl": True},
)
function.run(handler=handler, notifications=[notification])

Configuring Notifications For Pipelines#

To set notifications on pipelines, supply the notifications in the run method of either the project or the pipeline. For example:

notification = mlrun.model.Notification(
    kind="webhook",
    when=["completed","error"],
    name="notification-1",
    message="completed",
    severity="info",
    secret_params={"url": "<webhook url>"},
    params={"method": "GET", "verify_ssl": True},
)
project.run(..., notifications=[notification])
Remote Pipeline Notifications#

In remote pipelines, the pipeline end notifications are sent from the MLRun API. This means you don't need to watch the pipeline in order for its notifications to be sent. The pipeline start notification is still sent from the SDK when triggering the pipeline.

Local and KFP Engine Pipeline Notifications#

In these engines, the notifications are sent locally from the SDK. This means you need to watch the pipeline in order for its notifications to be sent. This is a fallback to the old notification behavior, therefore not all of the new notification features are supported. Only the notification kind and params are taken into account. In these engines the old way of setting project notifiers is still supported:

project.notifiers.add_notification(notification_type="slack",params={"webhook":"<slack webhook url>"})
project.notifiers.add_notification(notification_type="git", params={"repo": "<repo>", "issue": "<issue>", "token": "<token>"})

Instead of passing the webhook in the notification params, it is also possible in a Jupyter notebook to use the %env magic command:

%env SLACK_WEBHOOK=<slack webhook url>

Editing and removing notifications is done similarly with the following methods:

project.notifiers.edit_notification(notification_type="slack",params={"webhook":"<new slack webhook url>"})
project.notifiers.remove_notification(notification_type="slack")

Setting Notifications on Live Runs#

You can set notifications on live runs via the set_run_notifications method. For example:

import mlrun

mlrun.get_run_db().set_run_notifications("<project-name>", "<run-uid>", [notification1, notification2])

Using the set_run_notifications method overrides any existing notifications on the run. To delete all notifications, pass an empty list.

Setting Notifications on Scheduled Runs#

You can set notifications on scheduled runs via the set_schedule_notifications method. For example:

import mlrun

mlrun.get_run_db().set_schedule_notifications("<project-name>", "<schedule-name>", [notification1, notification2])

Using the set_schedule_notifications method overrides any existing notifications on the schedule. To delete all notifications, pass an empty list.

Notification Conditions#

You can configure the notification to be sent only if the run meets certain conditions. This is done using the condition parameter in the notification object. The condition is a string that is evaluated using a jinja templator with the run object in its context. The jinja template should return a boolean value that determines whether the notification is sent or not. If any other value is returned or if the template is malformed, the condition is ignored and the notification is sent as normal.

Take the case of a run that calculates and outputs model drift. This example code sets a notification to fire only if the drift is above a certain threshold:

notification = mlrun.model.Notification(
    kind="slack",
    when=["completed","error"],
    name="notification-1",
    message="completed",
    severity="info",
    secret_params={"webhook": "<slack webhook url>"},
    condition='{{ run["status"]["results"]["drift"] > 0.1 }}'
)

Real-time serving pipelines (graphs)#

MLRun graphs enable building and running DAGs (directed acyclic graph).

MLRun graph capabilities include:

  • Easy to build and deploy distributed real-time computation graphs

  • Use the real-time serverless engine (Nuclio) for auto-scaling and optimized resource utilization

  • Built-in operators to handle data manipulation, IO, machine learning, deep-learning, NLP, etc.

  • Built-in monitoring for performance, resources, errors, data, model behaviour, and custom metrics

  • Debug in the IDE/Notebook

Graphs are composed of individual steps. The first graph element accepts an Event object, transforms/processes the event and passes the result to the next steps in the graph. The final result can be written out to some destination (file, DB, stream, etc.) or returned back to the caller (one of the graph steps can be marked with .respond()).

The serving graphs can be composed of pre-defined graph steps, block-type elements (model servers, routers, ensembles, data readers and writers, data engineering tasks, validators, etc.), custom steps, or from native python classes/functions. A graph can have data processing steps, model ensembles, model servers, post-processing, etc. (see the Advanced Model Serving Graph Notebook Example). Graphs can auto-scale and span multiple function containers (connected through streaming protocols).

serving graph high-level

Different steps can run on the same local function, or run on a remote function. You can call existing functions from the graph and reuse them from other graphs, as well as scale up and down the different components individually.

Graphs can run inside your IDE or Notebook for test and simulation. Serving graphs are built on top of Nuclio (real-time serverless engine), MLRun jobs, MLRun Storey (native Python async and stream processing engine), and other MLRun facilities.

The serving graphs are used by MLRun’s Feature Store to build real-time feature engineering pipelines.

In this section

Getting started#

This example uses a custom class and custom function. See custom steps for more details.

In this section

Steps#

The following code defines basic steps that illustrate building a graph. These steps are:

  • inc: increments the value by 1.

  • mul: multiplies the value by 2.

  • WithState: class that increments an internal counter, prints an output, and adds the input value to the current counter.

# mlrun: start-code


def inc(x):
    return x + 1


def mul(x):
    return x * 2


class WithState:
    def __init__(self, name, context, init_val=0):
        self.name = name
        self.context = context
        self.counter = init_val

    def do(self, x):
        self.counter += 1
        print(f"Echo: {self.name}, x: {x}, counter: {self.counter}")
        return x + self.counter


# mlrun: end-code

Create a function#

Now take the code above and create an MLRun function called simple-graph, of type serving.

import mlrun

fn = mlrun.code_to_function("simple-graph", kind="serving", image="mlrun/mlrun")
graph = fn.set_topology("flow")

Build the graph#

Use graph.to() to chain steps. Use .respond() to mark that the output of that step is returned to the caller (as an http response). By default the graph is async with no response.

graph.to(name="+1", handler="inc").to(name="*2", handler="mul").to(
    name="(X+counter)", class_name="WithState"
).respond()
<mlrun.serving.states.TaskStep at 0x7f821e504450>

Visualize the graph#

Using the plot method, you can visualize the graph.

graph.plot(rankdir="LR")
_images/afcf743dfa1a42d4d07d72c530a4f262a567213f86f351986be57b74afab0ff9.svg

Test the function#

Create a mock server and test the graph locally. Since this graph accepts a numeric value as the input, that value is provided in the body parameter.

server = fn.to_mock_server()
server.test(body=5)
Echo: (X+counter), x: 12, counter: 1
13

Run the function again. This time, the counter should be 2 and the output should be 14.

server.test(body=5)
Echo: (X+counter), x: 12, counter: 2
14

Deploy the function#

Use the deploy method to deploy the function.

fn.deploy(project="basic-graph-demo")
> 2021-11-08 07:30:21,571 [info] Starting remote function deploy
2021-11-08 07:30:21  (info) Deploying function
2021-11-08 07:30:21  (info) Building
2021-11-08 07:30:21  (info) Staging files and preparing base images
2021-11-08 07:30:21  (info) Building processor image
2021-11-08 07:30:26  (info) Build complete
2021-11-08 07:30:31  (info) Function deploy complete
> 2021-11-08 07:30:31,785 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-basic-graph-demo-simple-graph.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['basic-graph-demo-simple-graph-basic-graph-demo.default-tenant.app.aganefaibuzg.iguazio-cd2.com/']}
'http://basic-graph-demo-simple-graph-basic-graph-demo.default-tenant.app.aganefaibuzg.iguazio-cd2.com/'

Test the deployed function#

Use the invoke method to call the function.

fn.invoke("", body=5)
> 2021-11-08 07:30:43,241 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-basic-graph-demo-simple-graph.default-tenant.svc.cluster.local:8080/'}
13
fn.invoke("", body=5)
> 2021-11-08 07:30:48,359 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-basic-graph-demo-simple-graph.default-tenant.svc.cluster.local:8080/'}
14

Use cases#

In this section

In addition to the examples in this section, see the:

  • Distributed (multi-function) pipeline example that details how to run a pipeline that consists of multiple serverless functions (connected using streams).

  • Advanced model serving graph notebook example that illustrates the flow, task, model, and ensemble router states; building tasks from custom handlers; classes and storey components; using custom error handlers; testing graphs locally; deploying a graph as a real-time serverless function.

  • MLRun demos repository for additional use cases and full end-to-end examples, including fraud prevention using the Iguazio feature store, a mask detection demo, and converting existing ML code to an MLRun project.

Data and feature engineering (using the feature store)#

You can build a feature set transformation using serving graphs.

High-level transformation logic is automatically converted to real-time serverless processing engines that can read from any online or offline source, handle any type of structures or unstructured data, run complex computation graphs and native user code. Iguazio’s solution uses a unique multi-model database, serving the computed features consistently through many different APIs and formats (like files, SQL queries, pandas, real-time REST APIs, time-series, streaming), resulting in better accuracy and simpler integration.

Read more in Feature store, and Feature set transformations.

Example of a simple model serving router#

Graphs are used for serving models with different transformations.

To deploy a serving function, you need to import or create the serving function, add models to it, and then deploy it.

    import mlrun  
    # load the sklearn model serving function and add models to it  
    fn = mlrun.import_function('hub://v2_model_server')
    fn.add_model("model1", model_path={model1-url})
    fn.add_model("model2", model_path={model2-url})

    # deploy the function to the cluster
    fn.deploy()
    
    # test the live model endpoint
    fn.invoke('/v2/models/model1/infer', body={"inputs": [5]})

The serving function supports the same protocol used in KFServing V2 and Triton Serving framework. To invoke the model, to use following url: <function-host>/v2/models/model1/infer.

See the serving protocol specification for details.

Note

Model url is either an MLRun model store object (starts with store://) or URL of a model directory (in NFS, s3, v3io, azure, for example s3://{bucket}/{model-dir}). Note that credentials might need to be added to the serving function via environment variables or MLRun secrets.

See the scikit-learn classifier example, which explains how to create/log MLRun models.

Writing your own serving class#

You can implement your own model serving or data processing classes. All you need to do is:

  1. Inherit the base model serving class.

  2. Add your implementation for model load() (download the model file(s) and load the model into memory).

  3. predict() (accept the request payload and return the prediction/inference results).

You can override additional methods: preprocess, validate, postprocess, explain.
You can add custom API endpoints by adding the method op_xx(event) (which can be invoked by calling the <model-url>/xx, where operation = xx). See model class API.

For an example of writing the minimal serving functions, see Minimal sklearn serving function example.

See the full V2 Model Server (SKLearn) example that tests one or more classifier models against a held-out dataset.

Example of advanced data processing and serving ensemble#

MLRun serving graphs can host advanced pipelines that handle event/data processing, ML functionality, or any custom task. The following example demonstrates an asynchronous pipeline that pre-processes data, passes the data into a model ensemble, and finishes off with post processing.

For a complete example, see the Advanced graph example notebook.

Create a new function of type serving from code and set the graph topology to async flow.

import mlrun
function = mlrun.code_to_function("advanced", filename="demo.py", 
                                  kind="serving", image="mlrun/mlrun",
                                  requirements=['storey'])
graph = function.set_topology("flow", engine="async")

Build and connect the graph (DAG) using the custom function and classes and plot the result. Add steps using the step.to() method (adds a new step after the current one), or using the graph.add_step() method.

Use the graph error_handler if you want an error from the graph or a step to be fed into a specific state (catcher). See the full description in Error handling.

Specify which step is the responder (returns the HTTP response) using the step.respond() method. If the responder is not specified, the graph is non-blocking.

# use built-in storey class or our custom Echo class to create and link Task steps. Add an error handling step that runs only if the "Echo" step fails
graph.to("storey.Extend", name="enrich", _fn='({"tag": "something"})') \
     .to(class_name="Echo", name="pre-process", some_arg='abc').error_handler(name='catcher', handler='handle_error', full_event=True)

# add an Ensemble router with two child models (routes), the "*" prefix marks it as router class
router = graph.add_step("*mlrun.serving.VotingEnsemble", name="ensemble", after="pre-process")
router.add_route("m1", class_name="ClassifierModel", model_path=path1)
router.add_route("m2", class_name="ClassifierModel", model_path=path2)

# add the final step (after the router), which handles post-processing and response to the client
graph.add_step(class_name="Echo", name="final", after="ensemble").respond()

# plot the graph (using Graphviz) and run a test
graph.plot(rankdir='LR')


graph-flow

Create a mock (test) server, and run a test. Use wait_for_completion() to wait for the async event loop to complete.

server = function.to_mock_server()
resp = server.test("/v2/models/m2/infer", body={"inputs": data})
server.wait_for_completion()

And deploy the graph as a real-time Nuclio serverless function with one command:

function.deploy()

Note

If you test a Nuclio function that has a serving graph with the async engine via the Nuclio UI, the UI might not display the logs in the output.

Example of an NLP processing pipeline with real-time streaming#

In some cases it's useful to split your processing to multiple functions and use streaming protocols to connect those functions. In this example the data processing is in the first function/container and the NLP processing is in the second function. In this example the GPU contained in the second function.

See the full notebook example.

# define a new real-time serving function (from code) with an async graph
fn = mlrun.code_to_function("multi-func", filename="./data_prep.py", kind="serving", image='mlrun/mlrun')
graph = fn.set_topology("flow", engine="async")

# define the graph steps (DAG)
graph.to(name="load_url", handler="load_url")\
     .to(name="to_paragraphs", handler="to_paragraphs")\
     .to("storey.FlatMap", "flatten_paragraphs", _fn="(event)")\
     .to(">>", "q1", path=internal_stream)\
     .to(name="nlp", class_name="ApplyNLP", function="enrich")\
     .to(name="extract_entities", handler="extract_entities", function="enrich")\
     .to(name="enrich_entities", handler="enrich_entities", function="enrich")\
     .to("storey.FlatMap", "flatten_entities", _fn="(event)", function="enrich")\
     .to(name="printer", handler="myprint", function="enrich")\
     .to(">>", "output_stream", path=out_stream)

# specify the "enrich" child function, add extra package requirements
child = fn.add_child_function('enrich', './nlp.py', 'mlrun/mlrun')
child.spec.build.commands = ["python -m pip install spacy",
                             "python -m spacy download en_core_web_sm"]
graph.plot()

Currently queues support iguazio v3io and Kafka streams.

Graph concepts and state machine#

A graph is composed of the following:

  • Step: A step runs a function or class handler or a REST API call. MLRun comes with a list of pre-built steps that include data manipulation, readers, writers and model serving. You can also write your own steps using standard Python functions or custom functions/classes, or can be a external REST API (the special $remote class).

  • Router: A special type of step is a router with routing logic and multiple child routes/models. The basic routing logic is to route to the child routes based on the event.path. More advanced or custom routing can be used, for example, the ensemble router sends the event to all child routes in parallel, aggregates the result and responds.

  • Queue: A queue or stream that accepts data from one or more source steps and publishes to one or more output steps. Queues are best used to connect independent functions/containers. Queues can run in-memory or be implemented using a stream, which allows it to span processes/containers.

The graph server has two modes of operation (topologies):

  • Router topology (default): A minimal configuration with a single router and child tasks/routes. This can be used for simple model serving or single hop configurations.

  • Flow topology: A full graph/DAG. The flow topology is implemented using two engines: async (the default) is based on Storey and asynchronous event loop; and sync, which supports a simple sequence of steps.

In this section

The Event object#

The graph state machine accepts an event object (similar to a Nuclio event) and passes it along the pipeline. An event object hosts the event body along with other attributes such as path (http request path), method (GET, POST, …), and id (unique event ID).

In some cases the events represent a record with a unique key, which can be read/set through the event.key.

The task steps are called with the event.body by default. If a task step needs to read or set other event elements (key, path, time, …) you should set the task full_event argument to True.

Task steps support optional input_path and result_path attributes that allow controlling which portion of the event is sent as input to the step, and where to update the returned result.

For example, for an event body {"req": {"body": "x"}}, input_path="req.body" and result_path="resp" the step gets "x" as the input. The output after the step is {"req": {"body": "x"}: "resp": <step output>}. Note that input_path and result_path do not work together with full_event=True.

The context object#

The step classes are initialized with a context object (when they have context in their __init__ args). The context is used to pass data and for interfacing with system services. The context object has the following attributes and methods.

Attributes:

  • logger: Central logger (Nuclio logger when running in Nuclio).

  • verbose: True if in verbose/debug mode.

  • root: The graph object.

  • current_function: When running in a distributed graph, the current child function name.

Methods:

  • get_param(key, default=None): Get the graph parameter by key. Parameters are set at the serving function (e.g. function.spec.parameters = {"param1": "x"}).

  • get_secret(key): Get the value of a project/user secret.

  • get_store_resource(uri, use_cache=True): Get the mlrun store object (data item, artifact, model, feature set, feature vector).

  • get_remote_endpoint(name, external=False): Return the remote nuclio/serving function http(s) endpoint given its [project/]function-name[:tag].

  • Response(headers=None, body=None, content_type=None, status_code=200): Create a nuclio response object, for returning detailed http responses.

Example, using the context:

if self.context.verbose:
    self.context.logger.info("my message", some_arg="text")
    x = self.context.get_param("x", 0)

Topology#

Router#

Once you have a serving function, you need to choose the graph topology. The default is router topology. With the router topology you can specify different machine learning models. Each model has a logical name. This name is used to route to the correct model when calling the serving function.

from sklearn.datasets import load_iris

# set the topology/router
graph = fn.set_topology("router")

# Add the model
fn.add_model(
    "model1",
    class_name="ClassifierModel",
    model_path="https://s3.wasabisys.com/iguazio/models/iris/model.pkl",
)

# Add additional models
# fn.add_model("model2", class_name="ClassifierModel", model_path="<path2>")

# create and use the graph simulator
server = fn.to_mock_server()
x = load_iris()["data"].tolist()
result = server.test("/v2/models/model1/infer", {"inputs": x})

print(result)
> 2021-11-02 04:18:36,925 [info] model model1 was loaded
> 2021-11-02 04:18:36,926 [info] Initializing endpoint records
> 2021-11-02 04:18:36,965 [info] Loaded ['model1']
{'id': '6bd11e864805484ea888f58e478d1f91', 'model_name': 'model1', 'outputs': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]}
Flow#

Using the flow topology, you can specify tasks, which typically manipulate the data. The most common scenario is pre-processing of data prior to the model execution.

Note

Once the topology is set, you cannot change an existing function topology.

In this topology, you build and connect the graph (DAG) by adding steps using the step.to() method, or by using the graph.add_step() method.

The step.to() is typically used to chain steps together. graph.add_step can add steps anywhere on the graph and has before and after parameters to specify the location of the step.

fn2 = mlrun.code_to_function(
    "serving_example_flow", kind="serving", image="mlrun/mlrun"
)

graph2 = fn2.set_topology("flow")

graph2_enrich = graph2.to("storey.Extend", name="enrich", _fn='({"tag": "something"})')

# add an Ensemble router with two child models (routes)
router = graph2.add_step(mlrun.serving.ModelRouter(), name="router", after="enrich")
router.add_route(
    "m1",
    class_name="ClassifierModel",
    model_path="https://s3.wasabisys.com/iguazio/models/iris/model.pkl",
)
router.respond()

# Add additional models
# router.add_route("m2", class_name="ClassifierModel", model_path=path2)

# plot the graph (using Graphviz)
graph2.plot(rankdir="LR")
_images/f707b7a3e0019266c2b046f092c113e7adabe36fb14dd1e09c3e4969237b8bb6.svg
fn2_server = fn2.to_mock_server()

result = fn2_server.test("/v2/models/m1/infer", {"inputs": x})

print(result)
> 2021-11-02 04:18:42,142 [info] model m1 was loaded
> 2021-11-02 04:18:42,142 [info] Initializing endpoint records
> 2021-11-02 04:18:42,183 [info] Loaded ['m1']
{'id': 'f713fd7eedeb431eba101b13c53a15b5'}

Building distributed graphs#

Graphs can be hosted by a single function (using zero to n containers), or span multiple functions where each function can have its own container image and resources (replicas, GPUs/CPUs, volumes, etc.). It has a root function, which is where you configure triggers (http, incoming stream, cron, …), and optional downstream child functions.

You can specify the function attribute in task or router steps. This indicates where this step should run. When the function attribute is not specified it runs on the root function. function="*" means the step can run in any of the child functions.

Steps on different functions should be connected using a queue step (a stream).

Adding a child function:

fn.add_child_function(
    "enrich",
    "./entity_extraction.ipynb",
    image="mlrun/mlrun",
    requirements=["storey", "sklearn"],
)

See a full example with child functions.

A distributed graph looks like this:

distributed graph

Model serving graph#

In this section

Serving Functions#

To start using a serving graph, you first need a serving function. A serving function contains the serving class code to run the model and all the code necessary to run the tasks. MLRun comes with a wide library of tasks. If you use just those, you don't have to add any special code to the serving function, you just have to provide the code that runs the model. For more information about serving classes see Build your own model serving class.

For example, the following code is a basic model serving class:

# mlrun: start-code
from cloudpickle import load
from typing import List
import numpy as np

import mlrun


class ClassifierModel(mlrun.serving.V2ModelServer):
    def load(self):
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model(".pkl")
        self.model = load(open(model_file, "rb"))

    def predict(self, body: dict) -> List:
        """Generate model predictions from sample."""
        feats = np.asarray(body["inputs"])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()
# mlrun: end-code

To obtain the serving function, use the code_to_function and specify kind to be serving.

fn = mlrun.code_to_function("serving_example", kind="serving", image="mlrun/mlrun")

Topology#

Router#

Once you have a serving function, you need to choose the graph topology. The default is router topology. With the router topology you can specify different machine learning models. Each model has a logical name. This name is used to route to the correct model when calling the serving function.

from sklearn.datasets import load_iris

# set the topology/router
graph = fn.set_topology("router")

# Add the model
fn.add_model(
    "model1",
    class_name="ClassifierModel",
    model_path="https://s3.wasabisys.com/iguazio/models/iris/model.pkl",
)

# Add additional models
# fn.add_model("model2", class_name="ClassifierModel", model_path="<path2>")

# create and use the graph simulator
server = fn.to_mock_server()
x = load_iris()["data"].tolist()
result = server.test("/v2/models/model1/infer", {"inputs": x})

print(result)
> 2021-11-02 04:18:36,925 [info] model model1 was loaded
> 2021-11-02 04:18:36,926 [info] Initializing endpoint records
> 2021-11-02 04:18:36,965 [info] Loaded ['model1']
{'id': '6bd11e864805484ea888f58e478d1f91', 'model_name': 'model1', 'outputs': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]}
Flow#

You can use the flow topology to specify tasks, which typically manipulates the data. The most common scenario is pre-processing of data prior to the model execution.

Note

Once the topology is set, you cannot change an existing function topology.

In this topology, you build and connect the graph (DAG) by adding steps using the step.to() method, or by using the graph.add_step() method.

The step.to() is typically used to chain steps together. graph.add_step can add steps anywhere on the graph and has before and after parameters to specify the location of the step.

fn2 = mlrun.code_to_function(
    "serving_example_flow", kind="serving", image="mlrun/mlrun"
)

graph2 = fn2.set_topology("flow")

graph2_enrich = graph2.to("storey.Extend", name="enrich", _fn='({"tag": "something"})')

# add an Ensemble router with two child models (routes)
router = graph2.add_step(mlrun.serving.ModelRouter(), name="router", after="enrich")
router.add_route(
    "m1",
    class_name="ClassifierModel",
    model_path="https://s3.wasabisys.com/iguazio/models/iris/model.pkl",
)
router.respond()

# add an error handling step, run only when/if the "pre-process" step fails
graph.to(name="pre-process", handler="raising_step").error_handler(
    name="catcher", handler="handle_error", full_event=True
)

# Add additional models
# router.add_route("m2", class_name="ClassifierModel", model_path=path2)

# plot the graph (using Graphviz)
graph2.plot(rankdir="LR")
_images/f707b7a3e0019266c2b046f092c113e7adabe36fb14dd1e09c3e4969237b8bb6.svg
fn2_server = fn2.to_mock_server()

result = fn2_server.test("/v2/models/m1/infer", {"inputs": x})

print(result)
> 2021-11-02 04:18:42,142 [info] model m1 was loaded
> 2021-11-02 04:18:42,142 [info] Initializing endpoint records
> 2021-11-02 04:18:42,183 [info] Loaded ['m1']
{'id': 'f713fd7eedeb431eba101b13c53a15b5'}

Remote execution#

You can chain functions together with remote execution. This allows you to:

  • Call existing functions from the graph and reuse them from other graphs.

  • Scale up and down different components individually.

Calling a remote function can either use HTTP or via a queue (streaming).

HTTP#

Calling a function using http uses the special $remote class. First deploy the remote function:

remote_func_name = "serving-example-flow"
project_name = "graph-basic-concepts"
fn_remote = mlrun.code_to_function(
    remote_func_name, project=project_name, kind="serving", image="mlrun/mlrun"
)

fn_remote.add_model(
    "model1",
    class_name="ClassifierModel",
    model_path="https://s3.wasabisys.com/iguazio/models/iris/model.pkl",
)

remote_addr = fn_remote.deploy()
> 2022-03-17 08:20:40,674 [info] Starting remote function deploy
2022-03-17 08:20:40  (info) Deploying function
2022-03-17 08:20:40  (info) Building
2022-03-17 08:20:40  (info) Staging files and preparing base images
2022-03-17 08:20:40  (info) Building processor image
2022-03-17 08:20:42  (info) Build complete
2022-03-17 08:20:47  (info) Function deploy complete
> 2022-03-17 08:20:48,289 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-graph-basic-concepts-serving-example-flow.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['graph-basic-concepts-serving-example-flow-graph-basic-concepts.default-tenant.app.maor-gcp2.iguazio-cd0.com/']}

Create a new function with a graph and call the remote function above:

fn_preprocess = mlrun.new_function("preprocess", kind="serving")
graph_preprocessing = fn_preprocess.set_topology("flow")

graph_preprocessing.to("storey.Extend", name="enrich", _fn='({"tag": "something"})').to(
    "$remote", "remote_func", url=f"{remote_addr}v2/models/model1/infer", method="put"
).respond()

graph_preprocessing.plot(rankdir="LR")
_images/56a4553e46b450f91b3446f1858e51597e65dfd688b09686219df40c565356d1.svg
fn3_server = fn_preprocess.to_mock_server()
my_data = """{"inputs":[[5.1, 3.5, 1.4, 0.2],[7.7, 3.8, 6.7, 2.2]]}"""
result = fn3_server.test("/v2/models/my_model/infer", body=my_data)
print(result)
> 2022-03-17 08:20:48,374 [warning] run command, file or code were not specified
{'id': '3a1dd36c-e7de-45af-a0c4-72e3163ba92a', 'model_name': 'model1', 'outputs': [0, 2]}
Queue (streaming)#

You can use queues to send events from one part of the graph to another and to decouple the processing of those parts. Queues are better suited to deal with bursts of events, since all the events are stored in the queue until they are processed.

V3IO stream example#

The example below uses a V3IO stream, which is a fast real-time implementation of a stream that allows processing of events at very low latency.

%%writefile echo.py
def echo_handler(x):
    print(x)
    return x
Overwriting echo.py

Configure the streams

import os

streams_prefix = (
    f"v3io:///users/{os.getenv('V3IO_USERNAME')}/examples/graph-basic-concepts"
)

input_stream = streams_prefix + "/in-stream"
out_stream = streams_prefix + "/out-stream"
err_stream = streams_prefix + "/err-stream"

Alternativey, use Kafka to configure the streams:

kafka_prefix = f"kafka://{broker}/"
internal_topic = kafka_prefix + "in-topic"
out_topic = kafka_prefix + "out-topic"
err_topic = kafka_prefix + "err-topic"

Create the graph. In the to method the class name is one of >> or $queue to specify that this is a queue. To configure a consumer group for the step, include the group in the to method.

fn_preprocess2 = mlrun.new_function("preprocess", kind="serving")
fn_preprocess2.add_child_function("echo_func", "./echo.py", "mlrun/mlrun")

graph_preprocess2 = fn_preprocess2.set_topology("flow")

graph_preprocess2.to("storey.Extend", name="enrich", _fn='({"tag": "something"})').to(
    ">>", "input_stream", path=input_stream, group="mygroup"
).to(name="echo", handler="echo_handler", function="echo_func").to(
    ">>", "output_stream", path=out_stream, sharding_func="partition"
)

graph_preprocess2.plot(rankdir="LR")
_images/f729dd3128e0d3cee3b1f84b0129194590343a716ebb1a8f424d2b6073f3a97a.svg
from echo import *

fn4_server = fn_preprocess2.to_mock_server(current_function="*")

my_data = """{"inputs": [[5.1, 3.5, 1.4, 0.2], [7.7, 3.8, 6.7, 2.2]], "partition": 0}"""

result = fn4_server.test("/v2/models/my_model/infer", body=my_data)

print(result)
> 2022-03-17 08:20:55,182 [warning] run command, file or code were not specified
{'id': 'a6efe8217b024ec7a7e02cf0b7850b91'}
{'inputs': [[5.1, 3.5, 1.4, 0.2], [7.7, 3.8, 6.7, 2.2]], 'tag': 'something'}
Kafka stream example#
%%writefile echo.py
def echo_handler(x):
    print(x)
    return x
Overwriting echo.py

Configure the streams

import os

input_topic = "in-topic"
out_topic = "out-topic"
err_topic = "err-topic"

# replace this
brokers = "<broker IP>"

Create the graph. In the to method the class name is one of >> or $queue to specify that this is a queue. To configure a consumer group for the step, include the group in the to method.

import mlrun

fn_preprocess2 = mlrun.new_function("preprocess", kind="serving")
fn_preprocess2.add_child_function("echo_func", "./echo.py", "mlrun/mlrun")

graph_preprocess2 = fn_preprocess2.set_topology("flow")

graph_preprocess2.to("storey.Extend", name="enrich", _fn='({"tag": "something"})').to(
    ">>",
    "input_stream",
    path=input_topic,
    group="mygroup",
    kafka_bootstrap_servers=brokers,
).to(name="echo", handler="echo_handler", function="echo_func").to(
    ">>", "output_stream", path=out_topic, kafka_bootstrap_servers=brokers
)

graph_preprocess2.plot(rankdir="LR")

from echo import *

fn4_server = fn_preprocess2.to_mock_server(current_function="*")

fn4_server.set_error_stream(f"kafka://{brokers}/{err_topic}")

my_data = """{"inputs":[[5.1, 3.5, 1.4, 0.2],[7.7, 3.8, 6.7, 2.2]]}"""

result = fn4_server.test("/v2/models/my_model/infer", body=my_data)

print(result)

Examples of graph functionality#

NLP processing pipeline with real-time streaming#

In some cases it's useful to split your processing to multiple functions and use streaming protocols to connect those functions.

See the full notebook example, where the data processing is in the first function/container and the NLP processing is in the second function. And the second function contains the GPU.

Currently queues support Iguazio v3io and Kafka streams.

Graph that splits and rejoins#

You can define a graph that splits into two parallel steps, and the output of both steps join back together.

In this basic example, all input goes into both stepA and stepB, and then both stepA and stepB forward the input to stepC. This means that a dataset of 5 rows generates an output of 10 rows (barring any filtering or other processing that would change the number of rows).

Note

Use this configuration to join the graph branches and not to join the events into a single large one.

Example:

graph.to("stepB")
graph.to("stepC")
graph.add_step(name="stepD", after=["stepB", "stepC"])


graph = fn.set_topology("flow", exist_ok=True)
dbl = graph.to(name="double", handler="double")
dbl.to(name="add3", class_name="Adder", add=3)
dbl.to(name="add2", class_name="Adder", add=2)
graph.add_step("Gather").after("add2", "add3")

Graphs that split and rejoin can also be used for these types of scenarios:

  • Steps B and C are filter steps that complement each other. For example B passes events where key < X, and C passes events where key >= X. The resulting DF contains the exact event ingested, since each event was handled once on one of the branches.

  • Steps B and C modify the content of the event in different ways. B adds a column col1 with value X, and C adds a column col2 with value X. The resulting DF contains both col1 and col2. Each key is represented twice: once with col1 == X, col2 == null and once with col1 == null, col2 == X.

Writing custom steps#

The Graph executes built-in task classes, or task classes and functions that you implement. The task parameters include the following:

  • class_name (str): the relative or absolute class name.

  • handler (str): the function handler (if class_name is not specified it is the function handler).

  • **class_args: a set of class __init__ arguments.

For example, see the following simple echo class:

import mlrun
# mlrun: start
# echo class, custom class example
class Echo:
    def __init__(self, context, name=None, **kw):
        self.context = context
        self.name = name
        self.kw = kw

    def do(self, x):
        print("Echo:", self.name, x)
        return x
# mlrun: end

Test the graph: first convert the code to function, and then add the step to the graph:

fn_echo = mlrun.code_to_function("echo_function", kind="serving", image="mlrun/mlrun")

graph_echo = fn_echo.set_topology("flow")

graph_echo.to(class_name="Echo", name="pre-process", some_arg="abc")

graph_echo.plot(rankdir="LR")
_images/8d7ba409f9714cb8b605852a1a22b379a91d17c4ae317179702e9fbd53bdea1e.svg

Create a mock server to test this locally:

echo_server = fn_echo.to_mock_server(current_function="*")

result = echo_server.test("", {"inputs": 123})

print(result)
{'id': '97397ea412334afdb5e4cb7d7c2e6dd3'}
Echo: pre-process {'inputs': 123}

For more information, see the Advanced model serving graph notebook example

You can use any Python function by specifying the handler name (e.g. handler=json.dumps). The function is triggered with the event.body as the first argument, and its result is passed to the next step.

Alternatively, you can use classes that can also store some step/configuration and separate the one time init logic from the per event logic. The classes are initialized with the class_args. If the class init args contain context or name, they are initialized with the graph context and the step name.

By default, the class_name and handler specify a class/function name in the globals() (i.e. this module). Alternatively, those can be full paths to the class (module.submodule.class), e.g. storey.WriteToParquet. You can also pass the module as an argument to functions such as function.to_mock_server(namespace=module). In this case the class or handler names are also searched in the provided module.

When using classes the class event handler is invoked on every event with the event.body. If the Task step full_event parameter is set to True the handler is invoked and returns the full event object. If the class event handler is not specified, it invokes the class do() method.

If you need to implement async behavior, then subclass storey.MapClass.

Create a single step#

When creating a serving function with a step that is a part of the class, the graph context is created in the init, and you can use this context to get the functions' parameters.

When creating a single step (no class), the context is not created, and therefore the get_param does not work. The following example illustrates how to create the context and then use the parameters.

%%writefile serving-handler-func.py
import pandas as pd
import mlrun
import os
import json
def test(event):
    server_context = mlrun.serving.GraphServer().from_dict(json.loads(os.environ["SERVING_SPEC_ENV"]))
    context = mlrun.serving.GraphContext(server=server_context)
    param = context.get_param("Test")
    return param
Overwriting serving-handler-func.py
serving_func_handler = project.set_function(
    name="serving-handler-func",
    func="serving-handler-func.py",
    image="mlrun/mlrun",
    kind="serving",
)
serving_func_handler.spec.parameters = {"Test": "test"}
graph = serving_func_handler.set_topology("flow")

graph.to(name="test", handler="test").respond()
serving_func_deploy = project.deploy_function("serving-handler-func")
> 2023-05-09 14:24:55,287 [info] Starting remote function deploy
2023-05-09 14:24:56  (info) Deploying function
2023-05-09 14:24:56  (info) Building
2023-05-09 14:24:56  (info) Staging files and preparing base images
2023-05-09 14:24:57  (info) Building processor image
2023-05-09 14:26:02  (info) Build complete
> 2023-05-09 14:26:08,232 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-serving-context-shapira-serving-handler-func.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['serving-context-shapira-serving-handler-func-serving-c-6v7nqbg6.default-tenant.app.cust-cs-il-3-5-2.iguazio-cd2.com/']}
serving_func_deploy.function.invoke(
    "/",
)

Built-in steps#

MlRun provides you with many built-in steps that you can use when building your graph. All steps are supported by the storey engine. Support by any other engines is included in the step description, as relevant.

Click on the step names in the following sections to see the full usage.

See also Data transformations.

Base Operators#

Class name

Description

storey.transformations.Batch

Batches events. This step emits a batch every max_events events, or when timeout seconds have passed since the first event in the batch was received.

storey.transformations.Choice

Redirects each input element into one of the multiple downstreams.

storey.Extend

Adds fields to each incoming event.

storey.transformations.Filter

Filters events based on a user-provided function.

storey.transformations.FlatMap

Maps, or transforms, each incoming event into any number of events.

storey.steps.Flatten

Flatten is equivalent to FlatMap(lambda x: x).

storey.transformations.ForEach

Applies the given function on each event in the stream, and passes the original event downstream.

storey.transformations.MapClass

Similar to Map, but instead of a function argument, this class should be extended and its do() method overridden.

storey.transformations.MapWithState

Maps, or transforms, incoming events using a stateful user-provided function, and an initial state, which can be a database table.

storey.transformations.Partition

Partitions events by calling a predicate function on each event. Each processed event results in a Partitioned namedtuple of (left=Optional[Event], right=Optional[Event]).

storey.Reduce

Reduces incoming events into a single value that is returned upon the successful termination of the flow.

storey.transformations.SampleWindow

Emits a single event in a window of window_size events, in accordance with emit_period and emit_before_termination.

External IO and data enrichment#

Class name

Description

BatchHttpRequests

A class for calling remote endpoints in parallel.

mlrun.datastore.DataItem

Data input/output class abstracting access to various local/remote data sources.

storey.transformations.JoinWithTable

Joins each event with data from the given table.

JoinWithV3IOTable

Joins each event with a V3IO table. Used for event augmentation.

QueryByKey

Similar to AggregateByKey, but this step is for serving only and does not aggregate the event.

RemoteStep

Class for calling remote endpoints.

storey.transformations.SendToHttp

Joins each event with data from any HTTP source. Used for event augmentation.

Models#

Class name

Description

mlrun.frameworks.onnx.ONNXModelServer

A model serving class for serving ONYX Models. A sub-class of the V2ModelServer class.

mlrun.frameworks.pytorch.PyTorchModelServer

A model serving class for serving PyTorch Models. A sub-class of the V2ModelServer class.

mlrun.frameworks.sklearn.SklearnModelServer

A model serving class for serving Sklearn Models. A sub-class of the V2ModelServer class.

mlrun.frameworks.tf_keras.TFKerasModelServer

A model serving class for serving TFKeras Models. A sub-class of the V2ModelServer class.

mlrun.frameworks.xgboost.XGBModelServer

A model serving class for serving XGB Models. A sub-class of the V2ModelServer class.

Routers#

Class name

Description

mlrun.serving.EnrichmentModelRouter

Auto enrich the request with data from the feature store. The router input accepts a list of inference requests (each request can be a dict or a list of incoming features/keys). It enriches the request with data from the specified feature vector (feature_vector_uri).

mlrun.serving.EnrichmentVotingEnsemble

Auto enrich the request with data from the feature store. The router input accepts a list of inference requests (each request can be a dict or a list of incoming features/keys). It enriches the request with data from the specified feature vector (feature_vector_uri).

mlrun.serving.ModelRouter

Basic model router, for calling different models per each model path.

mlrun.serving.VotingEnsemble

An ensemble machine learning model that combines the prediction of several models.

Other#

Class name

Description

mlrun.feature_store.FeaturesetValidator

Validate feature values according to the feature set validation policy. Supported also by the Pandas engines.

ReduceToDataFrame

Builds a pandas DataFrame from events and returns that DataFrame on flow termination.

Demos and tutorials#

Read these tutorials to get an even better understanding of serving graphs.

Distributed (multi-function) pipeline example#

This example demonstrates how to run a pipeline that consists of multiple serverless functions (connected using streams).

In the pipeline example the request contains the a URL of a file. It loads the content of the file and breaks it into paragraphs (using the FlatMap class), and pushes the results to a queue/stream. The second function picks up the paragraphs and runs the NLP flow to extract the entities and push the results to the output stream.

Setting the stream URLs for the internal queue, the final output and error/exceptions stream:

streams_prefix = "v3io:///users/admin/"
internal_stream = streams_prefix + "in-stream"
out_stream = streams_prefix + "out-stream"
err_stream = streams_prefix + "err-stream"

Alternatively, using Kafka:

kafka_prefix = f"kafka://{broker}/"
internal_topic = kafka_prefix + "in-topic"
out_topic = kafka_prefix + "out-topic"
err_topic = kafka_prefix + "err-topic"

In either case, continue with:

# set up the environment
import mlrun

project = mlrun.get_or_create_project("pipe")
> 2021-05-03 14:28:39,987 [warning] Failed resolving version info. Ignoring and using defaults
> 2021-05-03 14:28:43,801 [warning] Unable to parse server or client version. Assuming compatible: {'server_version': '0.6.3-rc4', 'client_version': 'unstable'}
('pipe', '/v3io/projects/{{run.project}}/artifacts')
# uncomment to install spacy requirements locally
# !pip install spacy
# !python -m spacy download en_core_web_sm

In this example

Create the pipeline#

The pipeline consists of two functions: data-prep and NLP. Each one has different package dependencies.

Create a file with data-prep graph steps:

Note

The model, version and operation can also be specified in the message body to support streaming protocols (e.g. Kafka).

%%writefile data_prep.py
import mlrun
import json

# load struct from a json file (event points to the url)
def load_url(event):
    url = event["url"]
    data = mlrun.get_object(url).decode("utf-8")
    return {"url": url, "doc": json.loads(data)}

def to_paragraphs(event):
    paragraphs = []
    url = event["url"]
    for i, paragraph in enumerate(event["doc"]):
        paragraphs.append(
            {"url": url, "paragraph_id": i, "paragraph": paragraph}
        )
    return paragraphs
Overwriting data_prep.py

Create a file with NLP graph steps (use spacy):

%%writefile nlp.py
import json
import spacy

def myprint(x):
    print(x)
    return x

class ApplyNLP:
    def __init__(self, context=None, spacy_dict="en_core_web_sm"):

        self.nlp = spacy.load(spacy_dict)

    def do(self, paragraph: dict):
        tokenized_paragraphs = []
        if isinstance(paragraph, (str, bytes)):
            paragraph = json.loads(paragraph)
        tokenized = {
            "url": paragraph["url"],
            "paragraph_id": paragraph["paragraph_id"],
            "tokens": self.nlp(paragraph["paragraph"]),
        }
        tokenized_paragraphs.append(tokenized)

        return tokenized_paragraphs

def extract_entities(tokens):
    paragraph_entities = []
    for token in tokens:
        entities = token["tokens"].ents
        for entity in entities:
            paragraph_entities.append(
                {
                    "url": token["url"],
                    "paragraph_id": token["paragraph_id"],
                    "entity": entity.ents,
                }
            )
    return paragraph_entities

def enrich_entities(entities):
    enriched_entities = []
    for entity in entities:
        enriched_entities.append(
            {
                "url": entity["url"],
                "paragraph_id": entity["paragraph_id"],
                "entity_text": entity["entity"][0].text,
                "entity_start_char": entity["entity"][0].start_char,
                "entity_end_char": entity["entity"][0].end_char,
                "entity_label": entity["entity"][0].label_,
            }
        )
    return enriched_entities
Overwriting nlp.py

Build and show the graph:

Create the master function ("multi-func") with the data_prep.py source and an async graph topology. Add a pipeline of steps made of custom python handlers, classes and built-in classes (like storey.FlatMap).

The pipeline runs across two functions which are connected by a queue/stream (q1). Use the function= to specify which function runs the specified step. End the flow with writing to the output stream.

# define a new real-time serving function (from code) with an async graph
fn = mlrun.code_to_function(
    "multi-func", filename="./data_prep.py", kind="serving", image="mlrun/mlrun"
)
graph = fn.set_topology("flow", engine="async")

# define the graph steps (DAG)
graph.to(name="load_url", handler="load_url").to(
    name="to_paragraphs", handler="to_paragraphs"
).to("storey.FlatMap", "flatten_paragraphs", _fn="(event)").to(
    ">>", "q1", path=internal_stream
).to(name="nlp", class_name="ApplyNLP", function="enrich").to(
    name="extract_entities", handler="extract_entities", function="enrich"
).to(name="enrich_entities", handler="enrich_entities", function="enrich").to(
    "storey.FlatMap", "flatten_entities", _fn="(event)", function="enrich"
).to(name="printer", handler="myprint", function="enrich").to(
    ">>", "output_stream", path=out_stream
)
<mlrun.serving.states.QueueState at 0x7f9e618f9910>
# specify the "enrich" child function, add extra package requirements
child = fn.add_child_function("enrich", "./nlp.py", "mlrun/mlrun")
child.spec.build.commands = [
    "python -m pip install spacy",
    "python -m spacy download en_core_web_sm",
]
graph.plot(rankdir="LR")
_images/05468a6798f156846f23273b6b93112c02467c37ca61fc48c7ef4c5935db3842.svg
Test the pipeline locally#

Create an input file:

%%writefile in.json
["Born and raised in Queens, New York City, Trump attended Fordham University for two years and received a bachelor's degree in economics from the Wharton School of the University of Pennsylvania. He became president of his father Fred Trump's real estate business in 1971, renamed it The Trump Organization, and expanded its operations to building or renovating skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, mostly by licensing his name. Trump and his businesses have been involved in more than 4,000 state and federal legal actions, including six bankruptcies. He owned the Miss Universe brand of beauty pageants from 1996 to 2015, and produced and hosted the reality television series The Apprentice from 2004 to 2015.", 
 "Trump's political positions have been described as populist, protectionist, isolationist, and nationalist. He entered the 2016 presidential race as a Republican and was elected in a surprise electoral college victory over Democratic nominee Hillary Clinton while losing the popular vote.[a] He became the oldest first-term U.S. president[b] and the first without prior military or government service. His election and policies have sparked numerous protests. Trump has made many false or misleading statements during his campaign and presidency. The statements have been documented by fact-checkers, and the media have widely described the phenomenon as unprecedented in American politics. Many of his comments and actions have been characterized as racially charged or racist."]
Overwriting in.json

Create a mock server (simulator) and test:

# tuggle verbosity if needed
fn.verbose = False
to
# create a mock server (simulator), specify to simulate all the functions in the pipeline ("*")
server = fn.to_mock_server(current_function="*")
# push a sample request into the pipeline and see the results print out (by the printer step)
resp = server.test(body={"url": "in.json"})
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'Queens', 'entity_start_char': 19, 'entity_end_char': 25, 'entity_label': 'GPE'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'New York City', 'entity_start_char': 27, 'entity_end_char': 40, 'entity_label': 'GPE'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'Trump', 'entity_start_char': 42, 'entity_end_char': 47, 'entity_label': 'ORG'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'Fordham University', 'entity_start_char': 57, 'entity_end_char': 75, 'entity_label': 'ORG'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'two years', 'entity_start_char': 80, 'entity_end_char': 89, 'entity_label': 'DATE'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'the Wharton School of the University of Pennsylvania', 'entity_start_char': 141, 'entity_end_char': 193, 'entity_label': 'ORG'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'Fred Trump', 'entity_start_char': 229, 'entity_end_char': 239, 'entity_label': 'PERSON'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': '1971', 'entity_start_char': 266, 'entity_end_char': 270, 'entity_label': 'DATE'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'The Trump Organization', 'entity_start_char': 283, 'entity_end_char': 305, 'entity_label': 'ORG'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'more than 4,000', 'entity_start_char': 529, 'entity_end_char': 544, 'entity_label': 'CARDINAL'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'six', 'entity_start_char': 588, 'entity_end_char': 591, 'entity_label': 'CARDINAL'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'Universe', 'entity_start_char': 624, 'entity_end_char': 632, 'entity_label': 'PERSON'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': '1996 to 2015', 'entity_start_char': 663, 'entity_end_char': 675, 'entity_label': 'DATE'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': 'The Apprentice', 'entity_start_char': 731, 'entity_end_char': 745, 'entity_label': 'WORK_OF_ART'}
{'url': 'in.json', 'paragraph_id': 0, 'entity_text': '2004 to 2015', 'entity_start_char': 751, 'entity_end_char': 763, 'entity_label': 'DATE'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'Trump', 'entity_start_char': 0, 'entity_end_char': 5, 'entity_label': 'ORG'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': '2016', 'entity_start_char': 122, 'entity_end_char': 126, 'entity_label': 'DATE'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'Republican', 'entity_start_char': 150, 'entity_end_char': 160, 'entity_label': 'NORP'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'Democratic', 'entity_start_char': 222, 'entity_end_char': 232, 'entity_label': 'NORP'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'Hillary Clinton', 'entity_start_char': 241, 'entity_end_char': 256, 'entity_label': 'PERSON'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'first', 'entity_start_char': 312, 'entity_end_char': 317, 'entity_label': 'ORDINAL'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'U.S.', 'entity_start_char': 323, 'entity_end_char': 327, 'entity_label': 'GPE'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'first', 'entity_start_char': 349, 'entity_end_char': 354, 'entity_label': 'ORDINAL'}
{'url': 'in.json', 'paragraph_id': 1, 'entity_text': 'American', 'entity_start_char': 671, 'entity_end_char': 679, 'entity_label': 'NORP'}
server.wait_for_completion()
Deploy to the cluster#
# add credentials to the data/streams
fn.apply(mlrun.platforms.v3io_cred())
child.apply(mlrun.platforms.v3io_cred())

# specify the error stream (to store exceptions from the functions)
fn.spec.error_stream = err_stream

# deploy as a set of serverless functions
fn.deploy()
> 2021-05-03 14:33:55,400 [info] deploy child function enrich ...
> 2021-05-03 14:33:55,427 [info] Starting remote function deploy
2021-05-03 14:33:55  (info) Deploying function
2021-05-03 14:33:55  (info) Building
2021-05-03 14:33:55  (info) Staging files and preparing base images
2021-05-03 14:33:55  (info) Building processor image
2021-05-03 14:34:02  (info) Build complete
2021-05-03 14:34:08  (info) Function deploy complete
> 2021-05-03 14:34:09,232 [info] function deployed, address=default-tenant.app.yh30.iguazio-c0.com:32356
> 2021-05-03 14:34:09,233 [info] deploy root function multi-func ...
> 2021-05-03 14:34:09,234 [info] Starting remote function deploy
2021-05-03 14:34:09  (info) Deploying function
2021-05-03 14:34:09  (info) Building
2021-05-03 14:34:09  (info) Staging files and preparing base images
2021-05-03 14:34:09  (info) Building processor image
2021-05-03 14:34:16  (info) Build complete
2021-05-03 14:34:22  (info) Function deploy complete
> 2021-05-03 14:34:22,891 [info] function deployed, address=default-tenant.app.yh30.iguazio-c0.com:32046
'http://default-tenant.app.yh30.iguazio-c0.com:32046'

Listen on the output stream

You can use the SDK or CLI to listen on the output stream. Listening should be done in a separate console/notebook. Run:

mlrun watch-stream v3io:///users/admin/out-stream -j

or use the SDK:

from mlrun.platforms import watch_stream
watch_stream("v3io:///users/admin/out-stream", is_json=True)

Test the live function:

Note

The url must be a valid path to the input file.

fn.invoke("", body={"url": "v3io:///users/admin/pipe/in.json"})
{'id': '79354e45-a158-405f-811c-976e9cf4ab5e'}

Advanced model serving graph - notebook example#

This example demonstrates how to use MLRun serving graphs and their advanced functionality including:

  • Use of flow, task, model, and ensemble router states

  • Build tasks from custom handlers, classes and storey components

  • Use custom error handlers

  • Test graphs locally

  • Deploy the graph as a real-time serverless functions

In this example

Define functions and classes used in the graph#
from cloudpickle import load
from typing import List
from sklearn.datasets import load_iris
import numpy as np


# model serving class example
class ClassifierModel(mlrun.serving.V2ModelServer):
    def load(self):
        """load and initialize the model and/or other elements"""
        model_file, extra_data = self.get_model(".pkl")
        self.model = load(open(model_file, "rb"))

    def predict(self, body: dict) -> List:
        """Generate model predictions from sample."""
        feats = np.asarray(body["inputs"])
        result: np.ndarray = self.model.predict(feats)
        return result.tolist()


# echo class, custom class example
class Echo:
    def __init__(self, context, name=None, **kw):
        self.context = context
        self.name = name
        self.kw = kw

    def do(self, x):
        print("Echo:", self.name, x)
        return x


# error echo function, demo catching error and using custom function
def error_catcher(x):
    x.body = {"body": x.body, "origin_state": x.origin_state, "error": x.error}
    print("EchoError:", x)
    return None
# mark the end of the code section, DO NOT REMOVE !
# mlrun: end-code
Create a new serving function and graph#

Use code_to_function to convert the above code into a serving function object and initialize a graph with async flow topology.

function = mlrun.code_to_function(
    "advanced", kind="serving", image="mlrun/mlrun", requirements=["storey"]
)
graph = function.set_topology("flow", engine="async")
# function.verbose = True

Specify the sklearn models that are used in the ensemble.

models_path = "https://s3.wasabisys.com/iguazio/models/iris/model.pkl"
path1 = models_path
path2 = models_path

Build and connect the graph (DAG) using the custom function and classes and plot the result. Add states using the state.to() method (adds a new state after the current one), or using the graph.add_step() method.

Use the graph error_handler if you want an error from the graph or a step to be fed into a specific state (catcher). See the full description in Error handling.

You can specify which state is the responder (returns the HTTP response) using the state.respond() method. If you don't specify the responder, the graph is non-blocking.

# use built-in storey class or our custom Echo class to create and link Task steps. Add an error handling step that runs if the grah step fails
graph.to("storey.Extend", name="enrich", _fn='({"tag": "something"})').to(
    class_name="Echo", name="pre-process", some_arg="abc"
).error_handler(name="catcher", handler="handle_error", full_event=True)

# add an Ensemble router with two child models (routes). The "*" prefix mark it is a router class
router = graph.add_step(
    "*mlrun.serving.VotingEnsemble", name="ensemble", after="pre-process"
)
router.add_route("m1", class_name="ClassifierModel", model_path=path1)
router.add_route("m2", class_name="ClassifierModel", model_path=path2)

# add the final step (after the router) that handles post processing and responds to the client
graph.add_step(class_name="Echo", name="final", after="ensemble").respond()

# plot the graph (using Graphviz) and run a test
graph.plot(rankdir="LR")
_images/dd18c3158777aa17ac7555c8bb00a629aa5ef314d72fc49c39eae8463e9f9e20.svg
Test the function locally#

Create a test set.

import random

iris = load_iris()
x = random.sample(iris["data"].tolist(), 5)

Create a mock server (simulator) and test the graph with the test data.

Note: The model and router objects support a common serving protocol API, see the protocol and API section.

server = function.to_mock_server()
resp = server.test("/v2/models/infer", body={"inputs": x})
server.wait_for_completion()
resp
> 2021-01-09 22:49:26,365 [info] model m1 was loaded
> 2021-01-09 22:49:26,493 [info] model m2 was loaded
> 2021-01-09 22:49:26,494 [info] Loaded ['m1', 'm2']
Echo: pre-process {'inputs': [[6.9, 3.2, 5.7, 2.3], [6.4, 2.7, 5.3, 1.9], [4.9, 3.1, 1.5, 0.1], [7.3, 2.9, 6.3, 1.8], [5.4, 3.7, 1.5, 0.2]], 'tag': 'something'}
Echo: final {'model_name': 'ensemble', 'outputs': [2, 2, 0, 2, 0], 'id': '0ebcc5f6f4c24d4d83eb36391eaefb98'}
{'model_name': 'ensemble',
 'outputs': [2, 2, 0, 2, 0],
 'id': '0ebcc5f6f4c24d4d83eb36391eaefb98'}
Deploy the graph as a real-time serverless function#
function.deploy()
> 2021-01-09 22:49:40,088 [info] Starting remote function deploy
2021-01-09 22:49:40  (info) Deploying function
2021-01-09 22:49:40  (info) Building
2021-01-09 22:49:40  (info) Staging files and preparing base images
2021-01-09 22:49:40  (info) Building processor image
2021-01-09 22:49:41  (info) Build complete
2021-01-09 22:49:47  (info) Function deploy complete
> 2021-01-09 22:49:48,422 [info] function deployed, address=default-tenant.app.yh55.iguazio-cd0.com:32222
'http://default-tenant.app.yh55.iguazio-cd0.com:32222'

Invoke the remote function using the test data

function.invoke("/v2/models/infer", body={"inputs": x})
{'model_name': 'ensemble',
 'outputs': [1, 2, 0, 0, 0],
 'id': '0ebcc5f6f4c24d4d83eb36391eaefb98'}

See the MLRun demos repository for additional use cases and full end-to-end examples, including Fraud Prevention using the Iguazio feature store, a mask detection demo, and converting existing ML code to an MLRun project.

Serving graph high availability configuration#

This figure illustrates a simplistic flow of an MLRun serving graph with remote invocation: graph-flow

As explained in Real-time serving pipelines (graphs), the serving graph is based on Nuclio functions.

In this section

Using Nuclio with stream triggers#

Nuclio can use different trigger types. When used with stream triggers, such as Kafka and V3IO, it uses a consumer group to continue reading from the last processed offset on function restart. This provides the "at least once" semantics for stateless functions. However, if the function does have state, such as persisting a batch of events to storage (e.g. parquet files, database) or if the function performs additional processing of events after the function handler returns, then the flow can get into situations where events seem to be lost. The mechanism of Window ACK provides a solution for such stateful event processing.

Note

For stateful functions, each worker has its own state. See, for example, MapWithState().

With Window ACK, the consumer group's committed offset is delayed by one window, committing the offset at (processed event num – window). When the function restarts (for any reason including scale-up or scale-down), it starts consuming from this last committed point.

The size of the required Window ACK is based on the number of events that could be in processing when the function terminates. You can define a window ACK per trigger (Kafka, V3IO stream, etc.). When used with a serving graph, the appropriate Window ACK size depends on the graph structure and should be calculated accordingly. The following sections explain the relevant considerations.

Consumer function configuration#

A consumer function is essentially a Nuclio function with a stream trigger. As part of the trigger, you can set a consumer group.

The number of replicas per function depends on the source:

  • StreamSource: The number of replicas is derived from the number of shards and is therefore nonconfigurable. Furthermore, the number of workers in each replica is set to 1 and also is not configurable.

  • KafkaSource: For Nuclio earlier than 1.12.10, it is 1 and non-configurable. For 1.12.10 and later, the number of replicas is set with, for example:

    • function.spec.min_replicas = 2. Default = 1

    • function.spec.max_replicas = 3. Default = 4

    and the number of workers is set with:

    • KafkaSource(attributes={"max_workers": 1}). Default = 1

The consumer function has one buffer per worker, measured in number of messages, holding the incoming events that were received by the worker and are waiting to be processed. Once this buffer is full, events need to be processed so that the function is able to receive more events. The buffer size is configurable and is key to the overall configuration.

The buffer should be as small as possible. There is a trade-off between the buffer size and the latency. A larger buffer has lower latency but increases the recovery time after a failure, due to the high number of records that need to be reprocessed.
To set the buffer size:

function.spec.parameters["source_args"] = {"buffer_size": 1}

The default buffer_size is 8 (messages).

Remote function retry mechanism#

The required processing time of a remote function varies, depending on the function. The system assumes a processing time in the order of seconds, which affects the default configurations. However, some functions require a longer processing time. You can configure the timeout on both the caller and on the remote, as appropriate for your functions.

When an event is sent to the remote function, and no response is received by the configured (or default) timeout, or an error 500 (the remote function failed), or error 502, 503, or 504 (the remote function is too busy to handle the request at this time) is received, the caller retries the request, using the platform's exponential retry backoff mechanism. If the number of caller retries reaches the configured maximum number of retries, the event is pushed to the exception stream, indicating that this event did not complete successfully. You can look at the exception stream to see the functions that did not complete successfully.

Remote-function caller configuration#

In a simplistic flow these are the consumer function defaults:

  • Maximum retries: The default is 6, which is equivalent to about 3-4 minutes if all of the related parameters are at their default values. If you expect that some cases will require a higher number, for example, a new node needs to be scaled up depending on your cloud vendor, the instance type, and the zone you are running in, you might want to increase the number of retries.

  • Remote step http timeout: The time interval the caller waits for a response from the remote before retrying the request. This value is affected by the remote function processing time.

  • Max in flight: The maximum number of requests that each caller worker can send in parallel to the remote function. If the caller has more than one worker, each worker has its own Max in flight.

To set Max in flight, timeout, and retries:

RemoteStep(name=”remote_scale”, …, max_in_flight=2, timeout=100, retries=10)

Remote-function configuration#

For the remote function, you can configure the following:

  • Worker timeout: The maximum time interval, in seconds, an incoming request waits for an available worker. The worker timeout must be shorter than the gateway timeout. The default is 10.

  • Gateway timeout: The maximum time interval, in seconds, the gateway waits for a response to a request. This determines when the ingress times out on a request. It must be slightly longer than the expected function processing time. The default is 60.

To set the buffer gateway timeout and worker timeout:

my_serving_func.with_http(gateway_timeout=125, worker_timeout=60)

Configuration considerations#

The following figure zooms in on a single consumer and its workers and illustrates the various concepts and parameters that provide high availability, using a non-default configuration. graph-ha-params

  • Assume the processing time of the remote function is Pt, in seconds.

  • timeout: Between <Pt+epsilon> and <Pt+worker_timeout>.

  • Serving function

    • gateway_timeout: Pt+1 second (usually sufficient).

    • worker_timeout: The general rule is the greater of Pt/10 or 60 seconds. However, you should adjust the value according to your needs.

  • max_in_flight: If the processing time is very high then max_in_flight should be low. Otherwise, there will be many retries.

  • ack_window_size:

    • With 1 worker: The consumer buffer_size+max_in_flight, since it is per each shard and there is a single worker.

    • With >1 worker: The consumer (#workers x buffer_size)+max_in_flight

Make sure you thoroughly understand your serving graph and its functions before defining the ack_window_size. Its value depends on the entire graph flow. You need to understand which steps are parallel (branching) vs. sequential invocation. Another key aspect is that the number of workers affects the window size.

See the add_v3io_stream_trigger.

For example:

  • If a graph includes: consumer -> remote r1 -> remote r2:

    • The window should be the sum of: consumer’s buffer_size + max_in_flight to r1 + max_in_flight to r2.

  • If a graph includes: calling to remote r1 and r2 in parallel:

    • The window should be set to: consumer’s buffer_size + max (max_in_flight to r1, max_in_flight to r2).

Error handling#

Graph steps might raise an exception. You can define exception handling (an error handling flow) that is triggered on error. The exception can be on a:

  • step: The error handler is appended to the step that, if it fails, triggers the error handling. If you want the graph to continue after an error handler execution, specify the next step in the before parameter. If you want the graph to complete after an error handler execution, omit the before parameter.

  • graph: When set on the graph object, the graph completes after the error handler execution.

Example of an exception on a step that only runs when/if the "pre-process" step fails:

graph = function.set_topology('flow', engine='async')
graph.to(name='pre-process', handler='raising_step').error_handler(name='catcher', handler='handle_error', full_event=True, before='echo')

# Add another step after pre-process step or the error handling
graph.add_step(name="echo", handler='echo', after="pre-process").respond()
graph

Example of an exception on a graph:

graph = function.set_topology('flow', engine='async')
graph.error_handler(name='error_catcher', handler='handle_error', full_event=True, before='echo')
graph.to(name='raise', handler='raising_step').to(name="echo", handler='echo', after="raise").respond()

See full parameter description in error_handler.

Exception stream#

The graph errors/exceptions can be pushed into a special error stream. This is very convenient in the case of distributed and production graphs.

To set the exception stream address (using v3io streams uri):

fn_preprocess2.spec.error_stream = err_stream

Model monitoring#

By definition, ML models in production make inferences on constantly changing data. Even models that have been trained on massive data sets, with the most meticulously labelled data, start to degrade over time, due to concept drift. Changes in the live environment due to changing behavioral patterns, seasonal shifts, new regulatory environments, market volatility, etc., can have a big impact on a trained model’s ability to make accurate predictions.

Model performance monitoring is a basic operational task that is implemented after an AI model has been deployed. Model monitoring includes:

  • Built-in model monitoring: Machine learning model monitoring is natively built in to the Iguazio MLOps Platform, along with a wide range of model management features and ML monitoring reports. It monitors all of your models in a single, simple, dashboard.

  • Automated drift detection: Automatically detects concept drift, anomalies, data skew, and model drift in real-time. Even if you are running hundreds of models simultaneously, you can be sure to spot and remediate the one that has drifted.

  • Automated retraining: When drift is detected, Iguazio automatically starts the entire training pipeline to retrain the model, including all relevant steps in the pipeline. The output is a production-ready challenger model, ready to be deployed. This keeps your models up to date, automatically.

  • Native feature store integration: Feature vectors and labels are stored and analyzed in the Iguazio feature store and are easily compared to the trained features and labels running as part of the model development phase, making it easier for data science teams to collaborate and maintain consistency between AI projects.

See full details and examples in Model monitoring.