Create a basic training job#

In this section, you create a simple job to train a model and log metrics, logs, and plots using MLRun's auto-logging:

Define the training code#

The code you run is as follows. Notice, there is only a single line from MLRun to add all the MLOps capabilities:

%%writefile trainer.py

from sklearn import ensemble
from sklearn.model_selection import train_test_split

import mlrun
from mlrun.frameworks.sklearn import apply_mlrun


def train(
    dataset: mlrun.DataItem,  # data inputs are of type DataItem (abstract the data source)
    label_column: str = "label",
    n_estimators: int = 100,
    learning_rate: float = 0.1,
    max_depth: int = 3,
    model_name: str = "cancer_classifier",
):
    # Get the input dataframe (Use DataItem.as_df() to access any data source)
    df = dataset.as_df()

    # Initialize the x & y data
    X = df.drop(label_column, axis=1)
    y = df[label_column]

    # Train/Test split the dataset
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )

    # Pick an ideal ML model
    model = ensemble.GradientBoostingClassifier(
        n_estimators=n_estimators, learning_rate=learning_rate, max_depth=max_depth
    )

    # -------------------- The only line you need to add for MLOps -------------------------
    # Wraps the model with MLOps (test set is provided for analysis & accuracy measurements)
    apply_mlrun(model=model, model_name=model_name, x_test=X_test, y_test=y_test)
    # --------------------------------------------------------------------------------------

    # Train the model
    model.fit(X_train, y_train)
Writing trainer.py

Create the job#

Next, use code_to_function to package up the Job to get ready to execute on the cluster:

import mlrun

training_job = mlrun.code_to_function(
    name="basic-training",
    filename="trainer.py",
    kind="job",
    image="mlrun/mlrun",
    handler="train",
)

Run the job#

Finally, run the job. The dataset is from S3, but usually it is the output from a previous step in a pipeline.

run = training_job.run(
    inputs={
        "dataset": "https://igz-demo-datasets.s3.us-east-2.amazonaws.com/cancer-dataset.csv"
    },
    params={"n_estimators": 100, "learning_rate": 1e-1, "max_depth": 3},
)
> 2022-07-22 22:27:15,162 [info] starting run basic-training-train uid=bc1c6ad491c340e1a3b9b91bb520454f DB=http://mlrun-api:8080
> 2022-07-22 22:27:15,349 [info] Job is running in the background, pod: basic-training-train-kkntj
> 2022-07-22 22:27:20,927 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
default 0 Jul 22 22:27:18 completed basic-training-train
v3io_user=nick
kind=job
owner=nick
mlrun/client_version=1.0.4
host=basic-training-train-kkntj
dataset
n_estimators=100
learning_rate=0.1
max_depth=3
accuracy=0.956140350877193
f1_score=0.965034965034965
precision_score=0.9583333333333334
recall_score=0.971830985915493
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model

> to track results use the .show() or .logs() methods or click here to open in UI
> 2022-07-22 22:27:21,640 [info] run executed, status=completed

View job results#

Once the job is complete, you can view the output metrics and visualize the artifacts.

run.outputs
{'accuracy': 0.956140350877193,
 'f1_score': 0.965034965034965,
 'precision_score': 0.9583333333333334,
 'recall_score': 0.971830985915493,
 'feature-importance': 'v3io:///projects/default/artifacts/feature-importance.html',
 'test_set': 'store://artifacts/default/basic-training-train_test_set:bc1c6ad491c340e1a3b9b91bb520454f',
 'confusion-matrix': 'v3io:///projects/default/artifacts/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/default/artifacts/roc-curves.html',
 'calibration-curve': 'v3io:///projects/default/artifacts/calibration-curve.html',
 'model': 'store://artifacts/default/cancer_classifier:bc1c6ad491c340e1a3b9b91bb520454f'}
run.artifact("confusion-matrix").show()
run.artifact("feature-importance").show()