Train, compare, and register models#
This notebook provides a quick overview of training ML models using MLRun MLOps orchestration framework.
Make sure you reviewed the basics in MLRun Quick Start Tutorial.
Tutorial steps:
MLRun installation and configuration#
Before running this notebook make sure mlrun
and sklearn
packages are installed (pip install mlrun scikit-learn~=1.3
) and that you have configured the access to the MLRun service.
# Install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun
Define MLRun project and a training functions#
You should create, load, or use (get) an MLRun project that holds all your functions and assets.
Get or create a new project
The get_or_create_project()
method tries to load the project from MLRun DB. If the project does not exist, it creates a new one.
import mlrun
project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)
> 2022-09-20 13:55:10,543 [info] loaded project tutorial from None or context and saved in MLRun DB
Add (auto) MLOps to your training function
Training functions generate models and various model statistics. You’ll want to store the models along with all the relevant data,
metadata, and measurements. MLRun can apply all the MLOps functionality automatically (“Auto-MLOps”) by simply using the framework-specific apply_mlrun()
method.
This is the line to add to your code, as shown in the training function below.
apply_mlrun(model=model, model_name="my_model", x_test=x_test, y_test=y_test)
apply_mlrun()
manages the training process and automatically logs all the framework-specific model object, details, data, metadata, and metrics.
It accepts the model object and various optional parameters. When specifying the x_test
and y_test
data it generates various plots and calculations to evaluate the model.
Metadata and parameters are automatically recorded (from MLRun context
object) and therefore don’t need to be specified.
Function code
Run the following cell to generate the trainer.py
file (or copy it manually):
%%writefile trainer.py
import pandas as pd
from sklearn import ensemble
from sklearn.model_selection import train_test_split
import mlrun
from mlrun.frameworks.sklearn import apply_mlrun
@mlrun.handler()
def train(
dataset: pd.DataFrame,
label_column: str = "label",
n_estimators: int = 100,
learning_rate: float = 0.1,
max_depth: int = 3,
model_name: str = "cancer_classifier",
):
# Initialize the x & y data
x = dataset.drop(label_column, axis=1)
y = dataset[label_column]
# Train/Test split the dataset
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=42
)
# Pick an ideal ML model
model = ensemble.GradientBoostingClassifier(
n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth
)
# -------------------- The only line you need to add for MLOps -------------------------
# Wraps the model with MLOps (test set is provided for analysis & accuracy measurements)
apply_mlrun(model=model, model_name=model_name, x_test=x_test, y_test=y_test)
# --------------------------------------------------------------------------------------
# Train the model
model.fit(x_train, y_train)
Overwriting trainer.py
Create a serverless function object from the code above, and register it in the project
trainer = project.set_function(
"trainer.py", name="trainer", kind="job", image="mlrun/mlrun", handler="train"
)
Run the training function and log the artifacts and model#
Create a dataset for training
import pandas as pd
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
breast_cancer_dataset = pd.DataFrame(
data=breast_cancer.data, columns=breast_cancer.feature_names
)
breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
breast_cancer_dataset = pd.concat([breast_cancer_dataset, breast_cancer_labels], axis=1)
breast_cancer_dataset.to_csv("cancer-dataset.csv", index=False)
Run the function (locally) using the generated dataset
trainer_run = project.run_function(
"trainer",
inputs={"dataset": "cancer-dataset.csv"},
params={"n_estimators": 100, "learning_rate": 1e-1, "max_depth": 3},
local=True,
)
> 2022-09-20 13:56:57,630 [info] starting run trainer-train uid=b3f1bc3379324767bee22f44942b96e4 DB=http://mlrun-api:8080
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
tutorial-iguazio | 0 | Sep 20 13:56:57 | completed | trainer-train | v3io_user=iguazio kind= owner=iguazio host=jupyter-5654cb444f-c9wk2 |
dataset |
n_estimators=100 learning_rate=0.1 max_depth=3 |
accuracy=0.956140350877193 f1_score=0.965034965034965 precision_score=0.9583333333333334 recall_score=0.971830985915493 |
feature-importance test_set confusion-matrix roc-curves calibration-curve model |
> 2022-09-20 13:56:59,356 [info] run executed, status=completed
View the auto generated results and artifacts
trainer_run.outputs
{'accuracy': 0.956140350877193,
'f1_score': 0.965034965034965,
'precision_score': 0.9583333333333334,
'recall_score': 0.971830985915493,
'feature-importance': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/feature-importance.html',
'test_set': 'store://artifacts/tutorial-iguazio/trainer-train_test_set:b3f1bc3379324767bee22f44942b96e4',
'confusion-matrix': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/confusion-matrix.html',
'roc-curves': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/roc-curves.html',
'calibration-curve': 'v3io:///projects/tutorial-iguazio/artifacts/trainer-train/0/calibration-curve.html',
'model': 'store://artifacts/tutorial-iguazio/cancer_classifier:b3f1bc3379324767bee22f44942b96e4'}
trainer_run.artifact("feature-importance").show()