Part 2: Training an ML Model¶

This part of the MLRun getting-started tutorial walks you through the steps for training a machine-learning (ML) model, including data exploration and model testing.

The tutorial consists of the following steps:

Setup and Configuration
Creating a training function
Exploring the data with an MLRun marketplace function
Testing your model

By the end of this tutorial you’ll learn how to

Create a training function, store models, and track experiments while running them.
Use artifacts as inputs to functions.
Leverage the MLRun functions marketplace.
View plot artifacts.

Prerequisites¶

The following steps are a continuation of the previous part of this getting-started tutorial and rely on the generated outputs. Therefore, make sure to first run part 1 of the tutorial.

Step 1: Setup and Configuration¶

Initializing Your MLRun Environment¶

Use the get_or_create_project MLRun method to create a new project or fetch it from the DB/repository if it already exists. Set the project and user_project parameters to the same values that you used in the call to this method in the Part 1: MLRun Basics tutorial notebook.

import mlrun

# Set the base project name
project_name_base = 'getting-started'

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name_base, context="./", user_project=True)

> 2022-02-08 19:46:23,537 [info] loaded project getting-started from MLRun DB

Marking The Beginning of Your Function Code¶

The following code uses the # mlrun: start-code marker comment annotation to instruct MLRun to start processing code only from this location.

Note: You can add code to define function dependencies and perform additional configuration after the # mlrun: start-code marker and before the # mlrun: end-code marker.

# mlrun: start-code

Step 2: Creating a Training Function¶

Training functions generate models and various model statistics, we want to store the models along will all the relevant data, metadata and measurements. This can be achieved automatically using MLRun auto logging capabilities.

MLRun can apply all the MLOps functionality by simply using the framework specific apply_mlrun() method which manages the training process and automatically logs all the framework specific model details, data, metadata and metrics.

To log the training results and store a model named my_model, we simply need to add the following lines:

from mlrun.frameworks.sklearn import apply_mlrun
apply_mlrun(model, context, model_name='my_model', x_test=X_test, y_test=y_test)

The training job will automatically generate a set of results and versioned artifacts (run train_run.outputs to view the job outputs):

{'accuracy': 1.0,
 'f1_score': 1.0,
 'precision_score': 1.0,
 'recall_score': 1.0,
 'auc-micro': 1.0,
 'auc-macro': 1.0,
 'auc-weighted': 1.0,
 'feature-importance': 'v3io:///projects/getting-started-admin/artifacts/feature-importance.html',
 'test_set': 'store://artifacts/getting-started-admin/train-iris-train_iris_test_set:86fd0a3754c34f75b8afc5c2464959fc',
 'confusion-matrix': 'v3io:///projects/getting-started-admin/artifacts/confusion-matrix.html',
 'roc-curves': 'v3io:///projects/getting-started-admin/artifacts/roc-curves.html',
 'my_model': 'store://artifacts/getting-started-admin/my_model:86fd0a3754c34f75b8afc5c2464959fc'}

from sklearn import ensemble
from sklearn.model_selection import train_test_split
from mlrun.frameworks.sklearn import apply_mlrun
import mlrun

def train_iris(dataset: mlrun.DataItem, label_column: str):
    
    # Initialize our dataframes
    df = dataset.as_df()
    X = df.drop(label_column, axis=1)
    y = df[label_column]

    # Train/Test split Iris data-set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Pick an ideal ML model
    model = ensemble.RandomForestClassifier()
    
    # Wrap our model with Mlrun features, specify the test dataset for analysis and accuracy measurements
    apply_mlrun(model=model, model_name='my_model', x_test=X_test, y_test=y_test)
    
    # Train our model
    model.fit(X_train, y_train)

Marking The End of Your Function Code¶

The following code uses the # mlrun: end-code marker code annotation to mark the end of the code section that should be converted to your MLRun function (which began with the # mlrun: start-code annotation) and instruct MLRun to stop parsing the notebook at this point.

Important: Don’t remove the start-code and end-code annotation cells.

# mlrun: end-code

Converting the Code to an MLRun Function¶

Use the MLRun code_to_function method to convert the selected portions of your notebook code into an MLRun function in your project — a function object with embedded code, which can run on the cluster.

The following code converts the code of your local train_iris function, which is defined within # mlrun: start-code and # mlrun: end-code annotations that mark the notebook code to convert (see the previous code cells), into into a train_iris_func MLRun function. The function will be stored and run under the current project (which was specified in the get_or_create_project method above).

The code sets the following code_to_function parameters:

name — the name of the new MLRun function (train_iris).
handler — the name of the function-handler method (train_iris; the default is main).
kind — the function’s runtime type (job for a Python process).
image — the name of the container image to use for running the job — “mlrun/mlrun”. This image contains the basic machine-learning Python packages (such as scikit-learn).

train_iris_func = mlrun.code_to_function(name='train_iris',
                                         handler='train_iris',
                                         kind='job',
                                         image='mlrun/mlrun')

Running the Function on a Cluster¶

# Our dataset location (uri)
dataset = project.get_artifact_uri('prep_data_cleaned_data')

#train_iris_func.spec.image_pull_policy = "Always"

train_run = train_iris_func.run(inputs={'dataset': dataset},
                                params={'label_column': 'label'},local=True)

> 2022-02-08 19:58:17,705 [info] starting run train-iris-train_iris uid=9d6806aec3134110bd5358479152aa5d DB=http://mlrun-api:8080

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
getting-started-admin	...9152aa5d	0	Feb 08 19:58:17	completed	train-iris-train_iris	v3io_user=admin kind= owner=admin host=jupyter-b7945bb6c-zv48d	dataset	label_column=label	accuracy=1.0 f1_score=1.0 precision_score=1.0 recall_score=1.0 auc-micro=1.0 auc-macro=1.0 auc-weighted=1.0	feature-importance test_set confusion-matrix roc-curves my_model

> to track results use the .show() or .logs() methods or click here to open in UI

> 2022-02-08 19:58:19,385 [info] run executed, status=completed

Reviewing the Run Output¶

You can view extensive run information and artifacts from Jupyter Notebook and the MLRun dashboard, as well as browse the project artifacts from the dashboard.

The following code extracts and displays the model from the training-job outputs.

train_run.outputs['model']

'store://artifacts/getting-started-admin/my_model:86fd0a3754c34f75b8afc5c2464959fc'

Your project’s artifacts directory contains the results for the executed training job. The plots subdirectory has HTML output artifacts for the selected run iteration; (the data subdirectory contains the artifacts for the test data set).

Use the following code to extract and display information from the run object — the accuracy that was achieved with the model, and the confusion and roc HTML output artifacts for the optimal run iteration.

print(f'Accuracy: {train_run.outputs["accuracy"]}')

Accuracy: 1.0

# Display HTML output artifacts
train_run.artifact('confusion-matrix').show()

train_run.artifact('feature-importance').show()

Exploring the Data with pandas DataFrames¶

Run the following code to use pandas DataFrames to read your data set, extract some basic statistics, and display them.

# Read your data set
df = train_run.artifact('test_set').as_df()
df.head()

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	label
0	6.1	2.8	4.7	1.2	1
1	5.7	3.8	1.7	0.3	0
2	7.7	2.6	6.9	2.3	2
3	6.0	2.9	4.5	1.5	1
4	6.8	2.8	4.8	1.4	1

# Display statistics grouped by label
df.groupby(['label']).describe()

	sepal length (cm)								sepal width (cm)		...	petal length (cm)		petal width (cm)
	count	mean	std	min	25%	50%	75%	max	count	mean	...	75%	max	count	mean	std	min	25%	50%	75%	max
label
0	10.0	5.070000	0.346570	4.7	4.80	4.95	5.325	5.7	10.0	3.330000	...	1.60	1.7	10.0	0.250000	0.108012	0.1	0.2	0.25	0.3	0.4
1	9.0	6.011111	0.391933	5.6	5.70	6.00	6.200	6.8	9.0	2.766667	...	4.70	4.8	9.0	1.344444	0.166667	1.1	1.2	1.30	1.5	1.6
2	11.0	6.781818	0.551032	6.1	6.45	6.70	6.850	7.9	11.0	3.000000	...	5.85	6.9	11.0	2.118182	0.194001	1.8	2.0	2.20	2.3	2.3

3 rows × 32 columns

Step 3: Exploring the Data with an MLRun Marketplace Function¶

You can perform further data exploration by leveraging the MLRun functions marketplace (a.k.a. “the MLRun functions hub”). This marketplace is a centralized location for open-source contributions of function components that are commonly used in machine-learning development. The location of the marketplace is centrally configured . By default, it points to the mlrun/functions GitHub repository.

This step uses the describe marketplace function, which performs data exploration on a provided data set. The function is used to extract information from your data set, analyze it, and visualize relevant information in different ways.

Adding an Exploration Function¶

Use the import_function MLRun method, which adds or updates a function object in a project, to load the describe MLRun marketplace function into a new describe project function. The tutorial code sets the first import_function parameter — url — which identities the function to load.

Note: MLRun supports multiple types of URL formats. The example uses the hub://<function name> format to point to the describe function-code directory in the MLRun functions marketplace ('hub://describe'). You can add :<tag> to this syntax to load a specific function tag — hub://<function_name>:<tag>; replace the <function name> and <tag> placeholders with the desired function name and tag.

describe = mlrun.import_function('hub://describe')

Viewing the Function Documentation¶

Use the doc method to view the embedded documentation of the describe function.

describe.doc()

function: describe
describe and visualizes dataset stats
default handler: summarize
entry points:
  summarize: Summarize a table
    context(MLClientCtx)  - the function context, default=
    table(DataItem)  - MLRun input pointing to pandas dataframe (csv/parquet file path), default=
    label_column(str)  - ground truth column label, default=None
    class_labels(List[str])  - label for each class in tables and plots, default=[]
    plot_hist(bool)  - (True) set this to False for large tables, default=True
    plots_dest(str)  - destination folder of summary plots (relative to artifact_path), default=plots
    update_dataset  - when the table is a registered dataset update the charts in-place, default=False

Running the Exploration Function¶

Run the following code to execute the describe project function as a Kubernetes job by using the MLRun run method. The returned run object is stored in a describe_run variable.

The location of the data set is the only input that you need to provide. This information is provided as a table input artifact that points to the table_set output artifact of the train_run job that you ran in the previous step.

describe_run = describe.run(params={'label_column': 'label'},
                            inputs={"table": train_run.outputs['test_set']})

> 2022-02-08 19:51:32,555 [info] starting run describe-summarize uid=d99feda4fcaa4d70b32e5032c7a7d41b DB=http://mlrun-api:8080
> 2022-02-08 19:51:32,719 [info] Job is running in the background, pod: describe-summarize-6mxsz
> 2022-02-08 19:51:41,994 [info] run executed, status=completed
final state: completed

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
getting-started-admin	...c7a7d41b	0	Feb 08 19:51:36	completed	describe-summarize	v3io_user=admin kind=job owner=admin mlrun/client_version=0.10.0-rc7 host=describe-summarize-6mxsz	table	label_column=label		histograms violin imbalance imbalance-weights-vec correlation-matrix correlation

> to track results use the .show() or .logs() methods or click here to open in UI

> 2022-02-08 19:51:52,084 [info] run executed, status=completed

Reviewing the Run Output¶

The output cell for your code execution contains a run-information table. You can also view run information in the MLRun dashboard; see the output-review information in Step 2, only this time look for the describe-summarize job and related artifacts.

The describe function generates three HTML output artifacts, which provide visual insights for your data set — histograms, imbalance, and correlation. The artifacts are stored as HTML files in your project’s artifacts directory, under <project artifacts path>/jobs/plots/. The following code displays the artifact files in the notebook.

# Display the `histograms` artifact
describe_run.artifact('histograms').show()

histograms

# Display the `imbalance` artifact
describe_run.artifact('imbalance').show()

imbalance

# Display the `correlation` artifact
describe_run.artifact('correlation').show()

Correlation Matrix

Step 4: Testing Your Model¶

Now that you have a trained model, you can test it: run a task that uses the test_classifier marketplace function to run the selected trained model against the test data set, as returned for the training task (train) in the previous step.

Adding a Test Function¶

Run the following code to add to your project a test function that uses the test_classifier marketplace function code, and create a related test function object.

test = mlrun.import_function('hub://test_classifier')

Running a Model-Testing Job¶

Configure parameters for the test function (params), and provide the selected trained model from the train_run job as an input artifact (inputs).

test_run = test.run(name="test",
                    params={"label_column": "label"},
                    inputs={"models_path": train_run.outputs['model'],
                            "test_set": train_run.outputs['test_set']})

> 2022-02-08 19:51:52,306 [info] starting run test uid=529d6b7a24f34f6b8071005365c254b0 DB=http://mlrun-api:8080
> 2022-02-08 19:51:52,460 [info] Job is running in the background, pod: test-b29jj
> 2022-02-08 19:51:58,625 [info] run executed, status=completed
final state: completed

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
getting-started-admin	...65c254b0	0	Feb 08 19:51:55	completed	test	v3io_user=admin kind=job owner=admin mlrun/client_version=0.10.0-rc7 host=test-b29jj	models_path test_set	label_column=label	accuracy=1.0 test-error=0.0 auc-micro=1.0 auc-weighted=1.0 f1-score=1.0 precision_score=1.0 recall_score=1.0	confusion-matrix feature-importances precision-recall-multiclass roc-multiclass test_set_preds

> to track results use the .show() or .logs() methods or click here to open in UI

> 2022-02-08 19:52:01,766 [info] run executed, status=completed

Reviewing the Run Output¶

Check the output information for your run in Jupyter Notebook and on the MLRun dashboard.

Use the following code to display information from the run object — the accuracy of the model, and the confusion and roc HTML output artifacts.

print(f'Test Accuracy: {test_run.outputs["accuracy"]}')

Test Accuracy: 1.0

test_run.artifact('confusion-matrix').show()

Confusion Matrix - Normalized Plot

test_run.artifact('roc-multiclass').show()

Multiclass ROC Curve

Done!¶

Congratulation! You’ve completed Part 2 of the MLRun getting-started tutorial. Proceed to Part 3: Model Serving to learn how to deploy and server your model using a serverless function.

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	label
0	6.1	2.8	4.7	1.2	1
1	5.7	3.8	1.7	0.3	0
2	7.7	2.6	6.9	2.3	2
3	6.0	2.9	4.5	1.5	1
4	6.8	2.8	4.8	1.4	1

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	label
0	6.1	2.8	4.7	1.2	1
1	5.7	3.8	1.7	0.3	0
2	7.7	2.6	6.9	2.3	2
3	6.0	2.9	4.5	1.5	1
4	6.8	2.8	4.8	1.4	1

	sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)	label
0	6.1	2.8	4.7	1.2	1
1	5.7	3.8	1.7	0.3	0
2	7.7	2.6	6.9	2.3	2
3	6.0	2.9	4.5	1.5	1
4	6.8	2.8	4.8	1.4	1