Train, compare, and register models#

This notebook provides a quick overview of training ML models using MLRun MLOps orchestration framework.

Make sure you reviewed the basics in MLRun Quick Start Tutorial.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure mlrun and sklearn packages are installed (pip install mlrun scikit-learn~=1.0) and that you have configured the access to the MLRun service.

# install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

Define MLRun project and a training functions#

You should create, load, or use (get) an MLRun Project that holds all your functions and assets.

Get or create a new project:

The get_or_create_project() method tries to load the project from MLRun DB. If the project does not exist it creates a new one.

import mlrun
project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)
> 2022-06-12 22:22:44,591 [info] loaded project tutorial from MLRun DB

Add (auto) MLOps to your training function:

Training functions generate models and various model statistics. You’ll want to store the models along with all the relevant data, metadata, and measurements. MLRun can apply all the MLOps functionality automatically (“Auto-MLOps”) by simply using the framework specific apply_mlrun() method.

In the training function below note the single custom line you need to add to your code:

apply_mlrun(model=model, model_name="my_model", x_test=x_test, y_test=y_test)

apply_mlrun() manages the training process and automatically logs all the framework-specific model object, details, data, metadata, and metrics. It accepts the model object and various optional parameters. When specifying the x_test and y_test data it generates various plots and calculations to evaluate the model. Metadata and parameters are automatically recorded (from MLRun context object) and don’t need to be specified.

Function code:

Run the following cell to generate the trainer.py file (or copy it manually):

%%writefile trainer.py

from sklearn import ensemble
from sklearn.model_selection import train_test_split

import mlrun
from mlrun.frameworks.sklearn import apply_mlrun


def train(
    dataset: mlrun.DataItem,  # data inputs are of type DataItem (abstract the data source)
    label_column: str = "label",
    n_estimators: int = 100,
    learning_rate: float = 0.1,
    max_depth: int = 3,
    model_name: str = "cancer_classifier",
):
    # Get the input dataframe (Use DataItem.as_df() to access any data source)
    df = dataset.as_df()

    # Initialize the x & y data
    X = df.drop(label_column, axis=1)
    y = df[label_column]

    # Train/Test split the dataset
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Pick an ideal ML model
    model = ensemble.GradientBoostingClassifier(
        n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth
    )

    # -------------------- The only line you need to add for MLOps -------------------------
    # Wraps the model with MLOps (test set is provided for analysis & accuracy measurements)
    apply_mlrun(model=model, model_name=model_name, x_test=X_test, y_test=y_test)
    # --------------------------------------------------------------------------------------

    # Train the model
    model.fit(X_train, y_train)
Writing trainer.py

Create a serverless function object from the code above, and register it in the project:

trainer = project.set_function("trainer.py", name="trainer", kind="job", image="mlrun/mlrun", handler="train")

Run the training function and log the artifacts and model#

Create a dataset for training:

import pandas as pd
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
breast_cancer_dataset = pd.DataFrame(data=breast_cancer.data, columns=breast_cancer.feature_names)
breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
breast_cancer_dataset = pd.concat([breast_cancer_dataset, breast_cancer_labels], axis=1)

breast_cancer_dataset.to_csv("cancer-dataset.csv", index=False)

Run the function (locally) using the generated dataset:

trainer_run = project.run_function(
    "trainer", 
    inputs={"dataset": "cancer-dataset.csv"}, 
    params = {"n_estimators": 100, "learning_rate": 1e-1, "max_depth": 3},
    local=True
)
> 2022-06-12 22:23:55,871 [info] starting run trainer-train uid=01bed1db002c44748cbab1445626c1cd DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
tutorial-yaron 0 Jun 12 22:23:55 completed trainer-train
v3io_user=yaron
kind=
owner=yaron
host=jupyter-56f755bb9-j2nh2
dataset
n_estimators=100
learning_rate=0.1
max_depth=3
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-06-12 22:23:59,749 [info] run executed, status=completed

View the auto generated results and artifacts:

trainer_run.outputs
{'accuracy': 0.956140350877193,
 'f1_score': 0.965034965034965,
 'precision_score': 0.9583333333333334,
 'recall_score': 0.971830985915493,
 'feature-importance': 'v3io:///projects/tutorial-yaron/artifacts/feature-importance.html',
 'test_set': 'store://artifacts/tutorial-yaron/trainer-train_test_set:01bed1db002c44748cbab1445626c1cd',
 'confusion-matrix': 'v3io:///projects/tutorial-yaron/artifacts/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/tutorial-yaron/artifacts/roc-curves.html',
 'calibration-curve': 'v3io:///projects/tutorial-yaron/artifacts/calibration-curve.html',
 'model': 'store://artifacts/tutorial-yaron/cancer_classifier:01bed1db002c44748cbab1445626c1cd'}
trainer_run.artifact('feature-importance').show()