Projects and automated ML pipeline#

This notebook demonstrate how to work with projects, source control (git), and automating the ML pipeline.

Make sure you went over the basics in MLRun Quick Start Tutorial.

MLRun Project is a container for all your work on a particular activity: all the associated code, functions, jobs, workflows, data, models, and artifacts. Projects can be mapped to git repositories to enable versioning, collaboration, and CI/CD.

You can create project definitions using the SDK or a yaml file and store those in MLRun DB, file, or archive. Once the project is loaded you can run jobs/workflows which refer to any project element by name, allowing separation between configuration and code. See Create and load projects for details.

Projects contain workflows that execute the registered functions in a sequence/graph (DAG), and which can reference project parameters, secrets and artifacts by name. MLRun currently supports two workflow engines, local (for simple tasks) and Kubeflow Pipelines (for more complex/advanced tasks). MLRun also supports a real-time workflow engine (see online serving pipelines (graphs)).

An ML Engineer can gather the different functions created by the Data Engineer and Data Scientist and create this automated pipeline.

Tutorial steps:

MLRun installation and configuration#

Before running this notebook make sure the mlrun package is installed (pip install mlrun) and that you have configured the access to MLRun service.

# install MLRun if not installed, run this only once (restart the notebook after the install !!!)
%pip install mlrun

Setup the project and functions#

Get or create a project:

There are three ways to create/load MLRun projects:

  • mlrun.projects.new_project() — Create a new MLRun project and optionally load it from a yaml/zip/git template.

  • mlrun.projects.load_project() — Load a project from a context directory or remote git/zip/tar archive.

  • mlrun.projects.get_or_create_project() — Load a project from the MLRun DB if it exists, or from a specified context/archive.

Projects refer to a context directory that holds all the project code and configuration. The context dir is usually mapped to a git repository and/or to an IDE (PyCharm, VSCode, etc.) project.

import mlrun
project = mlrun.get_or_create_project("tutorial", context="./", user_project=True)
> 2022-09-20 14:59:47,322 [info] loaded project tutorial from MLRun DB

Register project functions#

To run workflows, you must save the definitions for the functions in the project so function objects will be initialized automatically when you load a project or when running a project version in automated CI/CD workflows. In addition, you might want to set/register other project attributes such as global parameters, secrets, and data.

Functions are registered using the set_function() command, where you can specify the code, requirements, image, etc. Functions can be created from a single code/notebook file or have access to the entire project context directory (by adding the with_repo=True flag, it will guarantee the project context is cloned into the function runtime environment).

Function registration examples:

    # example: register a notebook file as a function
    project.set_function('mynb.ipynb', name='test-function', image="mlrun/mlrun", handler="run_test")

    # define a job (batch) function which uses code/libs from the project repo
    project.set_function(
        name="myjob", handler="my_module.job_handler",
        image="mlrun/mlrun", kind="job", with_repo=True,
    )

Function code:

Run the following cell to generate the data prep file (or copy it manually):

%%writefile data-prep.py

import pandas as pd
from sklearn.datasets import load_breast_cancer

import mlrun


@mlrun.handler(outputs=["dataset", "label_column"])
def breast_cancer_generator():
    """
    A function which generates the breast cancer dataset
    """
    breast_cancer = load_breast_cancer()
    breast_cancer_dataset = pd.DataFrame(
        data=breast_cancer.data, columns=breast_cancer.feature_names
    )
    breast_cancer_labels = pd.DataFrame(data=breast_cancer.target, columns=["label"])
    breast_cancer_dataset = pd.concat(
        [breast_cancer_dataset, breast_cancer_labels], axis=1
    )

    return breast_cancer_dataset, "label"
Overwriting data-prep.py

Register the function above in the project:

project.set_function("data-prep.py", name="data-prep", kind="job", image="mlrun/mlrun", handler="breast_cancer_generator")
<mlrun.runtimes.kubejob.KubejobRuntime at 0x7fd96c30a0a0>

Register additional project objects and metadata:

You can define other objects (workflows, artifacts, secrets) and parameters in the project and use them in your functions, for example:

    # register a simple named artifact in the project (to be used in workflows)  
    data_url = 'https://s3.wasabisys.com/iguazio/data/iris/iris.data.raw.csv'
    project.set_artifact('data', target_path=data_url)

    # add a multi-stage workflow (./workflow.py) to the project with the name 'main' and save the project 
    project.set_workflow('main', "./workflow.py")
    
    # read env vars from dict or file and set as project secrets
    project.set_secrets({"SECRET1": "value"})
    project.set_secrets(file_path="secrets.env")
    
    project.spec.params = {"x": 5}

Save the project:

# save the project in the db (and into the project.yaml file)
project.save()
<mlrun.projects.project.MlrunProject at 0x7fd96c2fdb50>

When you save the project it stores the project definitions in the project.yaml, this allows reconstructing the project in a remote cluster or a CI/CD system.

See the generated project file: project.yaml.

Working with GIT and archives#

Push the project code/metadata into an Archive#

Use standard Git commands to push the current project tree into a git archive, make sure you .save() the project before pushing it

git remote add origin <server>
git commit -m "Commit message"
git push origin master

Alternatively you can use MLRun SDK calls:

  • project.create_remote(git_uri, branch=branch) - to register the remote Git path

  • project.push() - save project state and commit/push updates to remote git repo

you can also save the project content and metadata into a local or remote .zip archive, examples:

project.export("../archive1.zip")
project.export("s3://my-bucket/archive1.zip")
project.export(f"v3io://projects/{project.name}/archive1.zip")

Load a project from local/remote archive#

The project metadata and context (code and configuration) can be loaded and initialized using the load_project() method. when url (of the git/zip/tar) is specified it clones a remote repo into the local context dir.

# load the project and run the 'main' workflow
project = load_project(context="./", name="myproj", url="git://github.com/mlrun/project-archive.git")
project.run("main", arguments={'data': data_url})

Projects can also be loaded and executed using the CLI:

mlrun project -n myproj -u "git://github.com/mlrun/project-archive.git" .
mlrun project -r main -w -a data=<data-url> .
# load the project in the current context dir
project = mlrun.load_project("./")

Build and run automated ML pipelines and CI/CD#

A pipeline is created by running an MLRun “workflow”. The following code defines a workflow and writes it to a file in your local directory, with the file name workflow.py. The workflow describes a directed acyclic graph (DAG) which is executed using the local, remote, or kubeflow engines.

See running a multi-stage workflow. The defined pipeline includes the following steps:

  • Generate/prepare the data (ingest).

  • Train and the model (train).

  • Deploy the model as a real-time serverless function (serving).

Note: A pipeline can also include continuous build integration and deployment (CI/CD) steps, such as building container images and deploying models.

%%writefile './workflow.py'

from kfp import dsl
import mlrun

# Create a Kubeflow Pipelines pipeline
@dsl.pipeline(name="breast-cancer-demo")
def pipeline(model_name="cancer-classifier"):
    # run the ingestion function with the new image and params
    ingest = mlrun.run_function(
        "data-prep",
        name="get-data",
        outputs=["dataset"],
    )

    # Train a model using the auto_trainer hub function
    train = mlrun.run_function(
        "hub://auto_trainer",
        inputs={"dataset": ingest.outputs["dataset"]},
        params = {
            "model_class": "sklearn.ensemble.RandomForestClassifier",
            "train_test_split_size": 0.2,
            "label_columns": "label",
            "model_name": model_name,
        }, 
        handler='train',
        outputs=["model"],
    )

    # Deploy the trained model as a serverless function
    serving_fn = mlrun.new_function("serving", image="mlrun/mlrun", kind="serving")
    serving_fn.with_code(body=" ")
    mlrun.deploy_function(
        serving_fn,
        models=[
            {
                "key": model_name,
                "model_path": train.outputs["model"],
                "class_name": 'mlrun.frameworks.sklearn.SklearnModelServer',
            }
        ],
    )
Writing ./workflow.py

Run the workflow:

# run the workflow
run_id = project.run(
    workflow_path="./workflow.py",
    arguments={"model_name": "cancer-classifier"}, 
    watch=True)
Pipeline running (id=6907fa23-dcdc-49bd-adbd-dfb7f8d25997), click here to view the details in MLRun UI
../_images/0f23a9b1e8d13f7384b50668cc0422deb2ab81cd776222becb0d948ee465045f.svg

Run Results

Workflow 6907fa23-dcdc-49bd-adbd-dfb7f8d25997 finished, state=Succeeded
click the hyper links below to see detailed results
uid start state name parameters results
Sep 20 15:00:35 completed auto-trainer-train
model_class=sklearn.ensemble.RandomForestClassifier
train_test_split_size=0.2
label_columns=label
model_name=cancer-classifier
accuracy=0.956140350877193
f1_score=0.9635036496350365
precision_score=0.9565217391304348
recall_score=0.9705882352941176
Sep 20 15:00:07 completed get-data
label_column=label

View the pipeline in MLRun UI:

workflow


Run workflows using the CLI:

With MLRun you can use a single command to load the code from local dir or remote archive (Git, zip, …) and execute a pipeline. This can be very useful for integration with CI/CD frameworks and practices. See CI/CD integration for more details.

The following command loads the project from the current dir (.) and executes the workflow with an argument, for running locally (without k8s).

mlrun project -r ./workflow.py -w -a model_name=classifier2 .!mlrun project -r ./workflow.py -w -a model_name=classifier2 .

Test the deployed model endpoint#

Now that your model is deployed using the pipeline, you can invoke it as usual:

serving_fn = project.get_function("serving")
# create a mock (simulator of the real-time function)
my_data = {"inputs"
           :[[
               1.371e+01, 2.083e+01, 9.020e+01, 5.779e+02, 1.189e-01, 1.645e-01,
               9.366e-02, 5.985e-02, 2.196e-01, 7.451e-02, 5.835e-01, 1.377e+00,
               3.856e+00, 5.096e+01, 8.805e-03, 3.029e-02, 2.488e-02, 1.448e-02,
               1.486e-02, 5.412e-03, 1.706e+01, 2.814e+01, 1.106e+02, 8.970e+02,
               1.654e-01, 3.682e-01, 2.678e-01, 1.556e-01, 3.196e-01, 1.151e-01]
            ]
}
serving_fn.invoke("/v2/models/cancer-classifier/infer", body=my_data)
> 2022-09-20 15:09:02,664 [info] invoking function: {'method': 'POST', 'path': 'http://nuclio-tutorial-iguazio-serving.default-tenant.svc.cluster.local:8080/v2/models/cancer-classifier/infer'}
{'id': '7ecaf987-bd79-470e-b930-19959808b678',
 'model_name': 'cancer-classifier',
 'outputs': [0]}

Done!#

Congratulations! You’ve completed the getting started tutorial.

You might also want to explore the following demos: