Quick start tutorial#

Open In Colab

Introduction to MLRun - Use serverless functions to train and deploy models

This notebook provides a quick overview of developing and deploying machine learning applications using the MLRun MLOps orchestration framework.

Tutorial steps:

Install MLRun#

MLRun has a backend service which can run locally or over Kubernetes (preferred). See the instructions for installing it locally using Docker or over Kubernetes Cluster. Alternatively, you can use Iguazio’s managed MLRun service.

Before you start, make sure the MLRun client package is installed and configured properly !

This notebook uses sklearn. If it is not installed in your environment run !pip install scikit-learn~=1.0.

# install MLRun, run this only once (restart the notebook after the install !!!)
%pip install mlrun

Restart the notebook kernel after the pip installation !!!

import mlrun

Using Remote MLRun service/cluster:#

Skip this section when using a local or pre-configured setup !

When using a remote MLRun service (over Kubernetes or Iguazio’s managed) the remote URL and credentials must be set.
Create the mlrun.env file by running the cell below and editing it with your service address, username, and access-key:

#@title run this cell to create MLRun env file, edit with your own credentials
%%writefile mlrun.env
MLRUN_DBPATH=https://<service-address>
V3IO_USERNAME=<user>
V3IO_ACCESS_KEY=<access-key>

After you create the mlrun.env file, run mlrun.set_env_from_file("mlrun.env") to connect to the remote service.
Alternatively, you can add the variable MLRUN_SET_ENV_FILE=mlrun.env to your environment to load it automatically.
See further details on how to setup the mlrun client environment

# run this **only** when using remote MLRun service
mlrun.set_env_from_file("mlrun.env")

Define MLRun project and ML functions#

MLRun Project is a container for all your work on a particular activity or application. Projects host functions, workflow, artifacts, secrets, and more. Projects have access control and can be accessed by one or more users; they are usually associated with a GIT and interact with CI/CD frameworks for automation. See the MLRun Projects documentation.

Create a new project:

project = mlrun.new_project("quick-tutorial", "./", user_project=True)

MLRun Serverless Functions specify the source code, base image, extra package requirements, runtime engine kind (batch job, real-time serving, spark, dask, etc.), and desired resources (cpu, gpu, mem, storage, …). The runtime engines (local, job, Nuclio, Spark, etc.) automatically transform the function code and spec into fully managed and elastic services that run over Kubernetes. Function source code can come from a single file (.py, .ipynb, etc.) or a full archive (git, zip, tar). MLRun can execute an entire file/notebook or specific function classes/handlers.

When you specify a context argument in the function it can be used to access the run metadata, secrets, and the log results or artifacts.

Function code:

Run the following cell to generate the data prep file (or copy it manually):

%%writefile data-prep.py

import pandas as pd
from sklearn.datasets import load_breast_cancer

import mlrun


def breast_cancer_generator(context, format="csv"):
    """a function which generates the breast cancer dataset"""
    breast_cancer = load_breast_cancer()
    breast_cancer_dataset = pd.DataFrame(
        data=breast_cancer.data, columns=breast_cancer.feature_names
    )
    breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
    breast_cancer_dataset = pd.concat(
        [breast_cancer_dataset, breast_cancer_labels], axis=1
    )

    context.logger.info("saving breast cancer dataframe")
    context.log_result("label_column", "label")
    context.log_dataset("dataset", df=breast_cancer_dataset, format=format, index=False)
Writing data-prep.py

Create a serverless function object from the code above, and register it in the project:

data_gen_fn = project.set_function("data-prep.py", name="data-prep", kind="job", image="mlrun/mlrun", handler="breast_cancer_generator")
project.save()  # save the project with the latest config

Run your data processing function and log artifacts#

Functions are executed (using the CLI or SDK run command) with an optional handler, various params, inputs, and resource requirements. This generates a run object that can be tracked through the CLI, UI, and SDK. Multiple functions can be executed and tracked as part of a multi-stage pipeline (workflow).

When a function has additional package requirements or needs to include the content of a source archive, you must first build the function using the project.build_function() method.

The local flag indicates if the function is executed locally or “teleported” and executed in the Kubernetes cluster. The execution progress and results can be viewed in the UI (see hyperlinks below).


Run using the SDK:

gen_data_run = project.run_function("data-prep", params={"format": "csv"}, local=True)
> 2022-06-02 18:39:17,753 [info] starting run gen-cancer-data-breast_cancer_generator uid=ed6d11cebe4647c78101757bf5c52b82 DB=http://mlrun-api:8080
> 2022-06-02 18:39:18,392 [info] saving breast cancer dataframe
project uid iter start state name labels inputs parameters results artifacts
quick-tutorial-yaron 0 Jun 02 18:39:17 completed gen-cancer-data-breast_cancer_generator
v3io_user=yaron
kind=
owner=yaron
host=jupyter-56f755bb9-j2nh2
format=csv
label_column=label
dataset

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-06-02 18:39:18,679 [info] run executed, status=completed

Print the run state and outputs:

gen_data_run.state()
'completed'
gen_data_run.outputs
{'label_column': 'label',
 'dataset': 'store://artifacts/quick-tutorial-yaron/gen-cancer-data-breast_cancer_generator_dataset:ed6d11cebe4647c78101757bf5c52b82'}

Print the output dataset artifact (DataItem object) as dataframe

gen_data_run.artifact("dataset").as_df().head()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension label
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 0
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 0
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 0
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 0
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 0

5 rows × 31 columns

Train a model using MLRun built-in hub function#

MLRun provides a public functions hub which hosts a set of pre-implemented and validated ML, DL, and data processing functions.

You can import the auto-trainer hub function which can train an ML model using variety of ML frameworks, generate various metrics and charts, and log the model along with its metadata into the MLRun model registry.

# import the function
trainer = mlrun.import_function('hub://auto_trainer')

See the auto_trainer function usage instructions in the marketplace or by typing trainer.doc()

Run the function on the cluster (if exist):

trainer_run = project.run_function(trainer,
    inputs={"dataset": gen_data_run.outputs["dataset"]},
    params = {
        "model_class": "sklearn.ensemble.RandomForestClassifier",
        "train_test_split_size": 0.2,
        "label_columns": "label",
        "model_name": 'cancer',
    }, 
    handler='train',
)
> 2022-06-02 18:39:24,030 [info] starting run auto-trainer-train uid=69a14cd1fed34cd48b51ffa3bb9fc5f0 DB=http://mlrun-api:8080
> 2022-06-02 18:39:24,245 [info] Job is running in the background, pod: auto-trainer-train-gn7rr
> 2022-06-02 18:39:29,239 [info] Sample set not given, using the whole training set as the sample set
> 2022-06-02 18:39:29,458 [info] training 'cancer'
> 2022-06-02 18:39:31,255 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
quick-tutorial-yaron 0 Jun 02 18:39:28 completed auto-trainer-train
v3io_user=yaron
kind=job
owner=yaron
mlrun/client_version=1.1.0-rc3
host=auto-trainer-train-gn7rr
dataset
model_class=sklearn.ensemble.RandomForestClassifier
train_test_split_size=0.2
label_columns=label
model_name=cancer
accuracy=0.9736842105263158
f1_score=0.979591836734694
precision_score=0.9863013698630136
recall_score=0.972972972972973
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-06-02 18:39:33,606 [info] run executed, status=completed

View the job progress results and the selected run in the MLRun UI:

train job in UI


Results (metrics) and artifacts are generated and tracked automatically by MLRun:

trainer_run.outputs
{'accuracy': 0.9736842105263158,
 'f1_score': 0.979591836734694,
 'precision_score': 0.9863013698630136,
 'recall_score': 0.972972972972973,
 'feature-importance': 'v3io:///projects/quick-tutorial-yaron/artifacts/feature-importance.html',
 'test_set': 'store://artifacts/quick-tutorial-yaron/auto-trainer-train_test_set:69a14cd1fed34cd48b51ffa3bb9fc5f0',
 'confusion-matrix': 'v3io:///projects/quick-tutorial-yaron/artifacts/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/quick-tutorial-yaron/artifacts/roc-curves.html',
 'calibration-curve': 'v3io:///projects/quick-tutorial-yaron/artifacts/calibration-curve.html',
 'model': 'store://artifacts/quick-tutorial-yaron/cancer:69a14cd1fed34cd48b51ffa3bb9fc5f0'}
# Display HTML output artifacts
trainer_run.artifact('confusion-matrix').show()