Part 2: Training an ML Model

This part of the MLRun getting-started tutorial walks you through the steps for training a machine-learning (ML) model, including data exploration and model testing.

The tutorial consists of the following steps:

  1. Setup and Configuration

  2. Creating a training function

  3. Exploring the data with an MLRun marketplace function

  4. Testing your model

By the end of this tutorial you’ll learn how to

  • Create a training function, store models, and track experiments while running them.

  • Use artifacts as inputs to functions.

  • Leverage the MLRun functions marketplace.

  • View plot artifacts.

Prerequisites

The following steps are a continuation of the previous part of this getting-started tutorial and rely on the generated outputs. Therefore, make sure to first run part 1 of the tutorial.

Step 1: Setup and Configuration

Importing Libraries

Run the following code to import required libraries:

from os import path
import mlrun

Initializing Your MLRun Environment

Use the set_environment MLRun method to configure the working environment and default configuration. Set the project and user_project parameters to the same values that you used in the call to this method in the Part 1: MLRun Basics tutorial notebook.

# Set the base project name
project_name_base = 'getting-started-tutorial'
# Initialize the MLRun environment and save the project name and artifacts path
project_name, artifact_path = mlrun.set_environment(project=project_name_base,
                                                    user_project=True)

Marking The Beginning of Your Function Code

The following code uses the # nuclio: start-code marker comment annotation to instruct MLRun to start processing code only from this location.

Note: You can add code to define function dependencies and perform additional configuration after the # nuclio: start-code marker and before the # nuclio: end-code marker.

# nuclio: start-code

Step 2: Creating a Training Function

An essential component of artifact management and versioning is storing a model version. This allows users to experiment with different models and compare their performance without having to worry about losing previous results.

The simplest way to store a model named my_model, for example, is with the following code:

from cloudpickle import dumps
model_data = dumps(model)
context.log_model(key='my_model', body=model_data, model_file='my_model.pkl')

You can also store any related metrics by providing a dictionary in the metrics parameter, such as metrics={'accuracy': 0.9}. Furthermore, you can use the extra_data parameter to pass any additional data that you wish to store along with the model; for example extra_data={'confusion': confusion.target_path}.

You can use the eval_model_v2 utility method (from mlrun.utils) to calculate mode metrics.

The following example implements a simple model that’s trained using scikit-learn; (in a real-world scenario, you would typically send the data as input to the function). The last two lines evaluate the model and log the model.

from sklearn import linear_model
from sklearn import datasets
from sklearn.model_selection import train_test_split
from cloudpickle import dumps
import pandas as pd

from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem
from mlrun.mlutils.data import get_sample
from mlrun.mlutils.plots import eval_model_v2
def train_iris(context: MLClientCtx,
               dataset: DataItem,
               label_column: str = "labels"):

    raw, labels, header = get_sample(dataset, sample=-1, label=label_column)
    
    # Basic scikit-learn Iris data-set SVM model
    X_train, X_test, y_train, y_test = train_test_split(
        raw, labels, test_size=0.2, random_state=42)
    
    context.log_dataset('train_set', 
                        df=pd.concat([X_train, y_train.to_frame()], axis=1),
                        format='csv', index=False, 
                        artifact_path=context.artifact_subpath('data'))

    context.log_dataset('test_set', 
                        df=pd.concat([X_test, y_test.to_frame()], axis=1),
                        format='csv', index=False, 
                        labels={"data-type": "held-out"},
                        artifact_path=context.artifact_subpath('data'))
    
    model = linear_model.LogisticRegression(max_iter=10000)
    model.fit(X_train, y_train)
    
    # Evaluate model results and get the evaluation metrics
    eval_metrics = eval_model_v2(context, X_test, y_test, model)
    
    # Log model
    context.log_model("model",
                      body=dumps(model),
                      artifact_path=context.artifact_subpath("models"),
                      extra_data=eval_metrics, 
                      model_file="model.pkl",
                      metrics=context.results,
                      labels={"class": "sklearn.linear_model.LogisticRegression"})

Marking The End of Your Function Code

The following code uses the # nuclio: end-code marker code annotation to mark the end of the code section that should be converted to your MLRun function (which began with the # nuclio: start-code annotation) and instruct MLRun to stop parsing the notebook at this point.

Important: Don’t remove the start-code and end-code annotation cells.

# nuclio: end-code

Converting the Code to an MLRun Function

Use the MLRun code_to_function method to convert the selected portions of your notebook code into an MLRun function in your project — a function object with embedded code, which can run on the cluster.

The following code converts the code of your local train_iris function, which is defined within # nuclio: start-code and # nuclio: end-code annotations that mark the notebook code to convert (see the previous code cells), into into a train_iris_func MLRun function. Because the project parameter of the conversion method isn’t set, the MLRun function is added to the default project that’s configured for your environment (as stored in the project_name variable — see the set_environment call in the previous steps). The code sets the following code_to_function parameters:

  • name — the name of the new MLRun function (train_iris).

  • handler — the name of the function-handler method (train_iris; the default is main).

  • kind — the function’s runtime type (job for a Python process).

  • image — the name of the container image to use for running the job — “mlrun/mlrun”. This image contains the basic machine-learning Python packages (such as scikit-learn).

train_iris_func = mlrun.code_to_function(name='train_iris',
                                         handler='train_iris',
                                         kind='job',
                                         image='mlrun/mlrun')

Mounting a Persistent Volume

When running jobs in Kubernetes, to allow a job to read or write data you need to give it access to a Persistent Volume (PV). This connects your function to your environment’s shared file system and allows you to pass data from your environment to the function and get back the results (plots) directly into your notebook.

A Persistent Volume Claim (PVC) is a request to access a PV. In MLRun, such requests can be issued by calling the apply function method and passing the mount method as input. mount_v3io creates a PVC to the Iguazio Data Science Platform’s V3IO volume; mount_pvc is a generic mount function for other storage types (such as NFS). The tutorial example uses the auto_mount method to automatically select the appropriate mount method for the your environment. Calling auto_mount without any parameters accesses the Iguazio Data Science Platform’s V3IO volume when the V3IO_ACCESS_KEY and V3IO_USERNAME environment variables are set; otherwise, the method attempts to access a Kubernetes PVC volume when an MLRUN_PVC_MOUNT=<pvc-name>:<mount-path> environment variable is set. You can also explicitly set the pvc_name and volume_mount_path parameters to a specific PVC name and volume mount path instead of relying on environment variables.

from mlrun.platforms import auto_mount
train_iris_func = train_iris_func.apply(auto_mount())

Running the Function on a Cluster

Use the following code to run your function on a cluster.

dataset = f'store://{project_name}/prep_data_cleaned_data'
dataset
'store://getting-started-tutorial-iguazio/prep_data_cleaned_data'
train_run = train_iris_func.run(inputs={'dataset': dataset},
                                params={'label_column': 'label'})
> 2021-01-25 08:37:23,350 [info] starting run train-iris-train_iris uid=151f5607321244799eb665a781970a2b DB=http://mlrun-api:8080
> 2021-01-25 08:37:23,566 [info] Job is running in the background, pod: train-iris-train-iris-blw6w
> 2021-01-25 08:37:28,682 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
getting-started-tutorial-iguazio 0 Jan 25 08:37:27 completed train-iris-train_iris
v3io_user=iguazio
kind=job
owner=iguazio
host=train-iris-train-iris-blw6w
dataset
label_column=label
accuracy=1.0
test-error=0.0
auc-micro=1.0
auc-weighted=1.0
f1-score=1.0
precision_score=1.0
recall_score=1.0
train_set
test_set
confusion-matrix
precision-recall-multiclass
roc-multiclass
model
to track results use .show() or .logs() or in CLI: 
!mlrun get run 151f5607321244799eb665a781970a2b --project getting-started-tutorial-iguazio , !mlrun logs 151f5607321244799eb665a781970a2b --project getting-started-tutorial-iguazio
> 2021-01-25 08:37:29,709 [info] run executed, status=completed

Reviewing the Run Output

You can view extensive run information and artifacts from Jupyter Notebook and the MLRun dashboard, as well as browse the project artifacts from the dashboard.

The following code extracts and displays the model from the training-job outputs.

print (train_run.outputs['model'])
store://artifacts/getting-started-tutorial-iguazio/train-iris-train_iris_model:151f5607321244799eb665a781970a2b

Your project’s artifacts directory contains the results for the executed training job. The plots subdirectory has HTML output artifacts for the selected run iteration; (the data subdirectory contains the artifacts for the test data set).

Use the following code to extract and display information from the run object — the accuracy that was achieved with the model, and the confusion and roc HTML output artifacts for the optimal run iteration.

print(f'Accuracy: {train_run.outputs["accuracy"]}')
Accuracy: 1.0
# Display HTML output artifacts
from IPython.display import display, HTML
display(HTML(filename=train_run.outputs['confusion-matrix']))

Confusion Matrix - Normalized Plot

display(HTML(filename=train_run.outputs['roc-multiclass']))

Multiclass ROC Curve

Exploring the Data with pandas DataFrames

Run the following code to use pandas DataFrames to read your data set, extract some basic statistics, and display them.

# Read your data set
df = mlrun.run.get_dataitem(train_run.outputs['test_set']).as_df()
# Display a portion of the read data
df.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label
0 6.1 2.8 4.7 1.2 1
1 5.7 3.8 1.7 0.3 0
2 7.7 3.8 6.7 2.2 2
3 6.0 2.9 4.5 1.5 1
4 6.8 2.8 4.8 1.4 1
# Calculate and display the number of data-set items
print(f'Total number of rows: {len(df)}')
Total number of rows: 30
# Display statistics grouped by label
df.groupby(['label']).describe()
sepal length (cm) sepal width (cm) ... petal length (cm) petal width (cm)
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
label
0 10.0 5.070000 0.346570 4.7 4.8 4.95 5.325 5.7 10.0 3.330000 ... 1.6 1.7 10.0 0.250000 0.108012 0.1 0.20 0.25 0.3 0.4
1 9.0 6.011111 0.391933 5.6 5.7 6.00 6.200 6.8 9.0 2.766667 ... 4.7 4.8 9.0 1.344444 0.166667 1.1 1.20 1.30 1.5 1.6
2 11.0 6.690909 0.656437 5.8 6.3 6.50 6.950 7.9 11.0 3.218182 ... 5.9 6.7 11.0 2.145455 0.269680 1.8 1.95 2.10 2.4 2.5

3 rows × 32 columns

Step 3: Exploring the Data with an MLRun Marketplace Function

You can perform further data exploration by leveraging the MLRun functions marketplace (a.k.a. “the MLRun functions hub”). This marketplace is a centralized location for open-source contributions of function components that are commonly used in machine-learning development. The location of the marketplace is configured via the hub_url MLRun configuration. By default, it points to the mlrun/functions GitHub repository.

Note: hub_url points to the raw GitHub URL, and can be defined with {name} and {tag} annotations, as done in the default value — https://raw.githubusercontent.com/mlrun/functions/{tag}/{name}/function.yaml.

This step uses the describe marketplace function, which performs data exploration on a provided data set. The function is used to extract information from your data set, analyze it, and visualize relevant information in different ways.

Adding an Exploration Function

Use the import_function MLRun method, which adds or updates a function object in a project, to load the describe MLRun marketplace function into a new describe project function. The tutorial code sets the first import_function parameter — url — which identities the function to load.

Note: MLRun supports multiple types of URL formats. The example uses the hub://<function name> format to point to the describe function-code directory in the MLRun functions marketplace ('hub://describe'). You can add :<tag> to this syntax to load a specific function tag — hub://<function_name>:<tag>; replace the <function name> and <tag> placeholders with the desired function name and tag.

describe = mlrun.import_function('hub://describe').apply(auto_mount())

Viewing the Function Documentation

Use the doc method to view the embedded documentation of the describe function.

describe.doc()
function: describe
describe and visualizes dataset stats
default handler: summarize
entry points:
  summarize: Summarize a table
    context(MLClientCtx)  - the function context, default=
    table(DataItem)  - MLRun input pointing to pandas dataframe (csv/parquet file path), default=
    label_column(str)  - ground truth column label, default=None
    class_labels(List[str])  - label for each class in tables and plots, default=[]
    plot_hist(bool)  - (True) set this to False for large tables, default=True
    plots_dest(str)  - destination folder of summary plots (relative to artifact_path), default=plots
    update_dataset  - when the table is a registered dataset update the charts in-place, default=False

Running the Exploration Function

Run the following code to execute the describe project function as a Kubernetes job by using the MLRun run method. The returned run object is stored in a describe_run variable.

The location of the data set is the only input that you need to provide. This information is provided as a table input artifact that points to the table_set output artifact of the train_run job that you ran in the previous step.

describe_run = describe.run(params={'label_column': 'label'},
                            inputs={"table":
                                    train_run.outputs['test_set']})
> 2021-01-25 08:37:29,984 [info] starting run describe-summarize uid=aa1a8315f0fa4e70be7dc8a8071f7c80 DB=http://mlrun-api:8080
> 2021-01-25 08:37:30,129 [info] Job is running in the background, pod: describe-summarize-s2kxq
> 2021-01-25 08:37:38,744 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
getting-started-tutorial-iguazio 0 Jan 25 08:37:34 completed describe-summarize
v3io_user=iguazio
kind=job
owner=iguazio
host=describe-summarize-s2kxq
table
label_column=label
histograms
violin
imbalance
imbalance-weights-vec
correlation-matrix
correlation
to track results use .show() or .logs() or in CLI: 
!mlrun get run aa1a8315f0fa4e70be7dc8a8071f7c80 --project getting-started-tutorial-iguazio , !mlrun logs aa1a8315f0fa4e70be7dc8a8071f7c80 --project getting-started-tutorial-iguazio
> 2021-01-25 08:37:39,317 [info] run executed, status=completed

Reviewing the Run Output

The output cell for your code execution contains a run-information table. You can also view run information in the MLRun dashboard; see the output-review information in Step 2, only this time look for the describe-summarize job and related artifacts.

The describe function generates three HTML output artifacts, which provide visual insights for your data set — histograms, imbalance, and correlation. The artifacts are stored as HTML files in your project’s artifacts directory, under <project artifacts path>/jobs/plots/. The following code displays the artifact files in the notebook.

# Display the `histograms` artifact
display(HTML(describe_run.outputs['histograms']))

histograms

# Display the `imbalance` artifact
display(HTML(filename=describe_run.outputs['imbalance']))

imbalance

# Display the `correlation` artifact
display(HTML(filename=describe_run.outputs['correlation']))

Correlation Matrix

Step 4: Testing Your Model

Now that you have a trained model, you can test it: run a task that uses the test_classifier marketplace function to run the selected trained model against the test data set, as returned for the training task (train) in the previous step.

Adding a Test Function

Run the following code to add to your project a test function that uses the test_classifier marketplace function code, and create a related test function object.

test = mlrun.import_function('hub://test_classifier').apply(auto_mount())

Running a Model-Testing Job

Configure parameters for the test function (params), and provide the selected trained model from the train_run job as an input artifact (inputs).

test_run = test.run(name="test",
                    params={"label_column": "label",
                            "plots_dest": path.join("plots", "test")},
                    inputs={"models_path": train_run.outputs['model'],
                            "test_set": train_run.outputs['test_set']})
> 2021-01-25 08:37:39,556 [info] starting run test uid=38900a3da1f247b8bb309529fe1593fe DB=http://mlrun-api:8080
> 2021-01-25 08:37:39,728 [info] Job is running in the background, pod: test-wkdgh
> 2021-01-25 08:37:44,922 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
getting-started-tutorial-iguazio 0 Jan 25 08:37:44 completed test
v3io_user=iguazio
kind=job
owner=iguazio
host=test-wkdgh
models_path
test_set
label_column=label
plots_dest=plots/test
accuracy=1.0
test-error=0.0
auc-micro=1.0
auc-weighted=1.0
f1-score=1.0
precision_score=1.0
recall_score=1.0
confusion-matrix
precision-recall-multiclass
roc-multiclass
test_set_preds
to track results use .show() or .logs() or in CLI: 
!mlrun get run 38900a3da1f247b8bb309529fe1593fe --project getting-started-tutorial-iguazio , !mlrun logs 38900a3da1f247b8bb309529fe1593fe --project getting-started-tutorial-iguazio
> 2021-01-25 08:37:45,856 [info] run executed, status=completed

Reviewing the Run Output

Check the output information for your run in Jupyter Notebook and on the MLRun dashboard.

Use the following code to display information from the run object — the accuracy of the model, and the confusion and roc HTML output artifacts.

# Display the model accuracy
print(f'Test Accuracy: {test_run.outputs["accuracy"]}')

# Display HTML output artifacts
display(HTML(filename=test_run.outputs['confusion-matrix']))
display(HTML(filename=test_run.outputs['roc-multiclass']))
Test Accuracy: 1.0

Confusion Matrix - Normalized Plot

Multiclass ROC Curve

Done!

Congratulation! You’ve completed Part 2 of the MLRun getting-started tutorial. Proceed to Part 3 to learn how to deploy and server your model using a serverless function.