Part 1: MLRun Basics

Part 1 of the getting-started tutorial introduces you to the basics of working with functions by using the MLRun open-source MLOps orchestration framework.

The tutorial takes you through the following steps:

  1. Installation and Setup

  2. Creating a basic function and running it locally

  3. Running the function on the cluster

  4. Viewing jobs on the dashboard (UI)

  5. Scheduling jobs

By the end of this tutorial you’ll learn how to

  • Create a basic data-preparation MLRun function.

  • Store data artifacts to be used and managed in a central database.

  • Run your code on a distributed Kubernetes cluster without any DevOps overhead.

  • Schedule jobs to run on the cluster.

Using MLRun Remotely

This tutorial is aimed at running your project from a local Jupyter Notebook service in the same environment in which MLRun is installed and running. However, as a developer you might want to develop your project from a remote location using your own IDE (such as a local Jupyter Notebook or PyCharm), and connect to the MLRun environment remotely. To learn how to use MLRun from a remote IDE, see Setting a Remote Environment.

Introduction to MLRun

MLRun is an open-source MLOps framework that offers an integrative approach to managing your machine-learning pipelines from early development through model development to full pipeline deployment in production. MLRun offers a convenient abstraction layer to a wide variety of technology stacks while empowering data engineers and data scientists to define the feature and models.

MLRun provides the following key benefits:

  • Rapid deployment of code to production pipelines

  • Elastic scaling of batch and real-time workloads

  • Feature management — ingestion, preparation, and monitoring

  • Works anywhere — your local IDE, multi-cloud, or on-prem

MLRun can be installed over Kubernetes or is available as a managed service in the Iguazio Data Science Platform.

▶ For more information about MLRun, see the MLRun Architecture and Vision documentation.

Step 1: Installation and Setup

For information on how to install and configure MLRun over Kubernetes, see the MLRun installation guide. To install the MLRun package, run pip install mlrun with the MLRun version that matches your MLRun service.

For Iguazio Data Science Platform Users

If your are using the Iguazio Data Science Platform, MLRun is available as a default (pre-deployed) shared service.
You can run !/User/align_mlrun.sh to install the MLRun package or upgrade the version of an installed package. By default, the script attempts to download the latest version of the MLRun Python package that matches the version of the running MLRun service.

Kernel Restart

After installing or updating the MLRun package, restart the notebook kernel in your environment!

Initializing Your MLRun Environment

MLRun projects are used for namespacing and grouping multiple runs, functions, workflows, and artifacts. Projects can be created via API/UI or even as a side effect, when you run a job or save an object (such as a function or artifact) to a specific project. For more information about MLRun projects, see the MLRun projects documentation.

Use the set_environment MLRun method to configure the working environment and default configuration. This method returns a tuple with the current project name and artifacts path.

Set the method’s project parameter to your selected project name. You can also optionally set the user_project parameter to True to automatically append the username of the running user to the project name specified in the project parameter, resulting in a <project>-<username> project name; this is useful for avoiding project-name conflicts among different users.

You can optionally pass additional parameters to set_environment, as detailed in the MLRun API reference. For example:

  • You can set the artifact_path parameter to override the default path for storing project artifacts, as explained in the MLRun artifacts documentation.

  • When using a remote MLRun or Kubernetes cluster, you can set the api_path parameter to the URL of your remote environment, and set the access_key parameter to an authentication key for this environment.

Run the following code to initialize your MLRun environment to use a “getting-started-tutorial-<username>” project and store the project artifacts in the default artifacts path:

from os import path
import mlrun

# Set the base project name
project_name_base = 'getting-started-tutorial'
# Initialize the MLRun environment and save the project name and artifacts path
project_name, artifact_path = mlrun.set_environment(project=project_name_base,
                                                    user_project=True)
                                                    
# Display the current project name and artifacts path
print(f'Project name: {project_name}')
print(f'Artifacts path: {artifact_path}')
Project name: getting-started-tutorial-iguazio
Artifacts path: /v3io/projects/{{run.project}}/artifacts

Step 2: Creating a Basic Function

This step introduces you to MLRun functions and artifacts and walks you through the process of converting a local function to an MLRun function.

Defining a Local Function

The following example code defines a data-preparation function (prep_data) that reads (ingests) a CSV file from the provided source URL into a pandas DataFrame; prepares (“cleans”) the data by changing the type of the categorical data in the specified label column; and returns the DataFrame and its length. In the next sub-step you’ll redefine this function and convert it to an MLRun function that leverages MLRun to perform the following tasks:

  • Reading the data

  • Logging the data to the MLRun database

import pandas as pd

# Ingest a data set
def prep_data(source_url, label_column):

    df = pd.read_csv(source_url)
    df[label_column] = df[label_column].astype('category').cat.codes    
    return df, df.shape[0]

Creating and Running Your First MLRun Function

MLRun Functions

MLRun jobs and pipelines run over serverless functions. These functions can include the function code and specification (spec”). The spec contains metadata for configuring related operational aspects, such as the image, required packages, CPU/memory/GPU resources, storage, and the environment. The different serverless runtime engines automatically transform the function code and spec into fully managed and elastic services that run over Kubernetes. Functions are versioned and can be generated from code or notebooks, or loaded from a marketplace.

To work with functions you need to be familiar with the following function components:

  • Context — a function-context object. The code can be set up to get parameters, secrets, and inputs from the context, as well as log run outputs, artifacts, tags, and metrics in the context.

  • Parameters — the parameters (arguments) that are passed to the functions.

  • Inputs — MLRun functions have a special inputs parameter for passing data objects (such as data sets, models, or files) as input to a function. Use this parameter to pass data items to a function. An MLRun data item (DataItem) represents either a single data item or a collection of data times (such as files, directories, and tables) for any type of data that is produced or consumed by functions or jobs. MLRun artifacts are versioned, and contains metadata that describes one or more data items.

For more information see the following MLRun documentation:

MLRun Function Code

The following code demonstrates how to redefine your local data-preparation function to make it compatible with MLRun, and then convert the local notebook code into an MLRun function.

# nuclio: start-code
import mlrun
def prep_data(context, source_url: mlrun.DataItem, label_column='label'):

    # Convert the DataItem to a pandas DataFrame
    df = source_url.as_df()
    df[label_column] = df[label_column].astype('category').cat.codes    
    
    # Record the DataFrane length after the run
    context.log_result('num_rows', df.shape[0])

    # Store the data set in your artifacts database
    context.log_dataset('cleaned_data', df=df, index=False, format='csv')
# nuclio: end-code

The MLRun function has the following parameter changes compared to the original local function:

  • To effectively run your code in MLRun, you need to add a context parameter to your function (or alternatively, get the context by using get_or_create_ctx()). This allows you to log and retrieve information related to the function’s execution.

  • The tutorial example sets the source_url parameter to mlrun.DataItem to send a data item as input when the function is called (using the inputs parameter).

The example tutorial function code works as follows:

  • Obtain a pandas DataFrame from the source_url data item, by calling the as_df method.

  • Prepare (clean) the data, as done in the local-function implementation in the previous step.

  • Record the data length (number of rows) using the log_result method. This method records (logs) the values of standard function variables (such as int, float, string, and list).

  • Log the data-set artifact using the log_dataset method. This method saves and logs function data items and related metadata (i.e., logs function artifacts).

Converting the Notebook Code to a Function

Use the # nuclio: ... comment annotations of the nuclio-jupyter parser at the beginning of relevant code cells to identify the code that needs to be converted into an MLRun function. These annotations provide non-intrusive hints as to how you want to convert the notebook into a full function and function specification:

  • The # nuclio: ignore annotation identifies code that shouldn’t be included in the MLRun function (such as prints, plots, tests, and debug code).

  • The # nuclio: start-code and # nuclio: end-code annotations identify code to be converted to an MLRun function: everything before the start-code annotation and after the end-code annotation is ignored, and only code between these two annotations is converted. These annotations are used in the tutorial notebook instead of adding the ignore annotation to all cells that shouldn’t be converted.

    Note: You can use the nuclio: start-code and nuclio: end-code annotations only once in a notebook. If there are multiple uses, only the first use will be selected.

For more information about using the annotations and magic commands, see the nuclio-jupyter documentation.

Note

Don’t confuse the # nuclio annotations with the Nuclio serverless runtime, MLRun leverages the Nuclio Jupyter Notebook parser (nuclio-jupyter) across all runtimes.

The following code uses the code_to_function MLRun method to convert your local prep_data function code to a data_prep_func MLRun function.

The kind parameter of the code_to_function method determines the engine for running the code. MLRun allows running function code using different engines — such as Python, Spark, MPI, Nuclio, and Dask. The following example sets the kind parameter to job to run the code as a Python process (“job”).

# Convert the local prep_data function to an MLRun project function
data_prep_func = mlrun.code_to_function(name='prep_data', kind='job', image='mlrun/mlrun')

Running the MLRun Function Locally

Now you’re ready to run your MLRun function (data_prep_func). The following example uses the run MLRun method and sets its local parameter to True to run the function code locally within your Jupyter pod, meaning that the function uses the environment variables, volumes, and image that are running in this pod.

Note: When running a function locally, the function code is saved only in a temporary local directory and not in your project’s ML functions code repository. In the next step of this tutorial you’ll run the function on a cluster, which automatically saves the function object in the project.

The execution results are stored in the MLRun database. The tutorial example sets the following function parameters:

  • name — the job name

  • handler — the name of the function handler

  • input — the data-set URL

As input for the function, the example uses a CSV file from a cloud object-store service named wasabisys.

Note: You can also use the function to ingest data in other formats than CSV, such as Parquet, without modifying the code.

# Set the source-data URL
source_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'
# Run the `data_prep_func` MLRun function locally
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler=prep_data,
                                   inputs={'source_url': source_url},
                                   local=True)
> 2021-01-25 08:29:50,130 [info] starting run prep_data uid=7bb634808a4141b58610bf0c5b7c6c70 DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
getting-started-tutorial-iguazio 0 Jan 25 08:29:50 completed prep_data
v3io_user=iguazio
kind=
owner=iguazio
host=jupyter-iguazio-d676cf5f-f6g2k
source_url
num_rows=150
cleaned_data
to track results use .show() or .logs() or in CLI: 
!mlrun get run 7bb634808a4141b58610bf0c5b7c6c70 --project getting-started-tutorial-iguazio , !mlrun logs 7bb634808a4141b58610bf0c5b7c6c70 --project getting-started-tutorial-iguazio
> 2021-01-25 08:29:50,657 [info] run executed, status=completed

Getting Information About the Run Object

Every run object that’s returned by the MLRun run method has the following methods:

  • uid — returns the unique ID.

  • state — returns the last known state.

  • show — shows the latest job state and data in a visual widget (with hyperlinks and hints).

  • outputs — returns a dictionary of the run results and artifact paths.

  • logs — returns the latest logs. Use Watch=False to disable the interactive mode in running jobs.

  • artifact — returns full artifact details for the provided key.

  • output — returns a specific result or an artifact path for the provided key.

  • to_dict, to_yaml, to_json — converts the run object to a dictionary, YAML, or JSON format (respectively).

# example
prep_data_run.state()
'completed'
prep_data_run.outputs['cleaned_data']
'store://artifacts/getting-started-tutorial-iguazio/prep_data_cleaned_data:7bb634808a4141b58610bf0c5b7c6c70'

Reading the Output

The data-set location is returned in the outputs field. Therefore, you can get the location by calling prep_data_run.outputs['cleaned_data'] and using run.get_dataitem to get the data set itself.

dataset = mlrun.run.get_dataitem(prep_data_run.outputs['cleaned_data'])

You can also get the data as a pandas DataFrame by calling the dataset.as_df method:

dataset.as_df()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) label
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
2 4.7 3.2 1.3 0.2 0
3 4.6 3.1 1.5 0.2 0
4 5.0 3.6 1.4 0.2 0
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 2
146 6.3 2.5 5.0 1.9 2
147 6.5 3.0 5.2 2.0 2
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2

150 rows × 5 columns

Saving the Artifacts in Run-Specific Paths

In the previous steps, each time the function was executed its artifacts were saved to the same directory, overwriting the existing artifacts in this directory. But you can also select to save the run results (source-data file) to a different directory for each job execution. This is done by setting the artifacts path and using the unique run-ID parameter ({{run.uid}}) in the path. Now, under the artifact path you should be able to see the source-data file in a new directory whose name is derived from the unique run ID.

out = artifact_path 

prep_data_run = data_prep_func.run(name='prep_data',
                         handler=prep_data,
                         inputs={'source_url': source_url},
                         local=True,
                         artifact_path=path.join(out, '{{run.uid}}'))
> 2021-01-25 08:29:50,739 [info] starting run prep_data uid=e545e9989ab641fdb1da6f10a308f2a9 DB=http://mlrun-api:8080
project uid iter start state name labels inputs parameters results artifacts
getting-started-tutorial-iguazio 0 Jan 25 08:29:50 completed prep_data
v3io_user=iguazio
kind=
owner=iguazio
host=jupyter-iguazio-d676cf5f-f6g2k
source_url
num_rows=150
cleaned_data
to track results use .show() or .logs() or in CLI: 
!mlrun get run e545e9989ab641fdb1da6f10a308f2a9 --project getting-started-tutorial-iguazio , !mlrun logs e545e9989ab641fdb1da6f10a308f2a9 --project getting-started-tutorial-iguazio
> 2021-01-25 08:29:51,054 [info] run executed, status=completed

Step 3: Running the Function on a Cluster

You can also run MLRun functions on the cluster itself, as opposed to running them locally in the Jupyter pod, as done in the previous steps. Running a function on the cluster allows you to leverage the cluster’s resources and run a more resource-intensive workloads. MLRun helps you to easily run your code without the hassle of creating configuration files and build images. To run an MLRun function on a cluster, just change the value of the local flag in the call to the run method to False.

from mlrun.platforms import auto_mount
data_prep_func.apply(auto_mount())
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler='prep_data',
                                   inputs={'source_url': source_url},
                                   local=False)
> 2021-01-25 08:29:51,071 [info] starting run prep_data uid=1e6ebd36a3344230bd288224fe77e3dc DB=http://mlrun-api:8080
> 2021-01-25 08:29:51,223 [info] Job is running in the background, pod: prep-data-p7crp
> 2021-01-25 08:29:54,947 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
getting-started-tutorial-iguazio 0 Jan 25 08:29:54 completed prep_data
v3io_user=iguazio
kind=job
owner=iguazio
host=prep-data-p7crp
source_url
num_rows=150
cleaned_data
to track results use .show() or .logs() or in CLI: 
!mlrun get run 1e6ebd36a3344230bd288224fe77e3dc --project getting-started-tutorial-iguazio , !mlrun logs 1e6ebd36a3344230bd288224fe77e3dc --project getting-started-tutorial-iguazio
> 2021-01-25 08:29:57,337 [info] run executed, status=completed
print(prep_data_run.outputs)
{'num_rows': 150, 'cleaned_data': 'store://artifacts/getting-started-tutorial-iguazio/prep_data_cleaned_data:1e6ebd36a3344230bd288224fe77e3dc'}

Step 4: Viewing Jobs on the Dashboard (UI)

On the Projects dashboard page, select your project and then navigate to the project’s jobs and workflow page by selecting the relevant link. For this tutorial, after running the prep_data method twice, you should see three records with types local (<>) and job. In this view you can track all jobs running in your project and view detailed job information. Select a job name to display tabs with additional information such as an input data set, artifacts that were generated by the job, and execution results and logs.

Jobs

Step 5: Scheduling Jobs

To schedule a job, you can set the schedule parameter of the run method. The scheduling is done by using a crontab format.

You can also schedule jobs from the dashboard: on the jobs and monitoring project page, you can create a new job using the New Job wizard. At the end of the wizard flow you can set the job scheduling. In the following example, the job is set to run every 30 minutes.

data_prep_func.apply(auto_mount())
prep_data_run = data_prep_func.run(name='prep_data',
                                   handler='prep_data',
                                   inputs={'source_url': source_url},
                                   local=False,
                                   schedule='*/30 * * * *')
> 2021-01-25 08:29:57,352 [info] starting run prep_data uid=925f50a6c2694955a48eb8be3aca1b61 DB=http://mlrun-api:8080
> 2021-01-25 08:29:57,438 [info] task scheduled, {'schedule': '*/30 * * * *', 'project': 'getting-started-tutorial-iguazio', 'name': 'prep_data'}

List Scheduled Jobs

Use the get_run_db.list_schedules MLRun method to list your project’s scheduled jobs, and display the results.

print(mlrun.get_run_db().list_schedules(project_name))
schedules=[ScheduleOutput(name='prep_data', kind=<ScheduleKinds.job: 'job'>, scheduled_object={'task': {'spec': {'inputs': {'source_url': 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'}, 'output_path': '/v3io/projects/getting-started-tutorial-iguazio/artifacts', 'function': 'getting-started-tutorial-iguazio/prep-data@2d36aae0c7d6af2d40b7dc33f1e9de89222ed8c5', 'secret_sources': [], 'scrape_metrics': False, 'handler': 'prep_data'}, 'metadata': {'uid': '925f50a6c2694955a48eb8be3aca1b61', 'name': 'prep_data', 'project': 'getting-started-tutorial-iguazio', 'labels': {'v3io_user': 'iguazio', 'kind': 'job', 'owner': 'iguazio'}, 'iteration': 0}, 'status': {'state': 'created'}}, 'schedule': '*/30 * * * *'}, cron_trigger=ScheduleCronTrigger(year=None, month='*', day='*', week=None, day_of_week='*', hour='*', minute='*/30', second=None, start_date=None, end_date=None, timezone=None, jitter=None), desired_state=None, labels={'v3io_user': 'iguazio', 'kind': 'job', 'owner': 'iguazio'}, creation_time=datetime.datetime(2021, 1, 25, 8, 29, 57, 424175, tzinfo=datetime.timezone.utc), project='getting-started-tutorial-iguazio', last_run_uri=None, state=None, next_run_time=datetime.datetime(2021, 1, 25, 8, 30, tzinfo=datetime.timezone.utc), last_run=None)]

Get Scheduled Jobs

Use the get_run_db.get_schedule MLRun method to get the job schedule for a scheduled job.

mlrun.get_run_db().get_schedule(project_name, 'prep_data')
ScheduleOutput(name='prep_data', kind=<ScheduleKinds.job: 'job'>, scheduled_object={'task': {'spec': {'inputs': {'source_url': 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'}, 'output_path': '/v3io/projects/getting-started-tutorial-iguazio/artifacts', 'function': 'getting-started-tutorial-iguazio/prep-data@2d36aae0c7d6af2d40b7dc33f1e9de89222ed8c5', 'secret_sources': [], 'scrape_metrics': False, 'handler': 'prep_data'}, 'metadata': {'uid': '925f50a6c2694955a48eb8be3aca1b61', 'name': 'prep_data', 'project': 'getting-started-tutorial-iguazio', 'labels': {'v3io_user': 'iguazio', 'kind': 'job', 'owner': 'iguazio'}, 'iteration': 0}, 'status': {'state': 'created'}}, 'schedule': '*/30 * * * *'}, cron_trigger=ScheduleCronTrigger(year=None, month='*', day='*', week=None, day_of_week='*', hour='*', minute='*/30', second=None, start_date=None, end_date=None, timezone=None, jitter=None), desired_state=None, labels={'v3io_user': 'iguazio', 'kind': 'job', 'owner': 'iguazio'}, creation_time=datetime.datetime(2021, 1, 25, 8, 29, 57, 424175, tzinfo=datetime.timezone.utc), project='getting-started-tutorial-iguazio', last_run_uri=None, state=None, next_run_time=datetime.datetime(2021, 1, 25, 8, 30, tzinfo=datetime.timezone.utc), last_run=None)

View Scheduled Jobs on the Dashboard (UI)

You can also see your scheduled jobs on your project’s Jobs | Schedule dashboard page.

scheduled-jobs

Deleting Scheduled Jobs

When you no longer need to run the scheduled jobs, remove them by using the get_run_db().delete_schedule MLRun method to delete the job-schedule objects that you created.

mlrun.get_run_db().delete_schedule(project_name, 'prep_data')

You can verify that a scheduled job has been deleted by calling get_schedule to get the job schedule. If the delete operation was successful, this call should fail.

#mlrun.get_run_db().get_schedule(project_name,'prep_data')

Done!

Congratulation! You’ve completed Part 1 of the MLRun getting-started tutorial. Proceed to Part 2 to learn how to train an ML model.