Hyperparameter tuning optimization#

MLRun supports iterative tasks for automatic and distributed execution of many tasks with variable parameters (hyperparams). Iterative tasks can be distributed across multiple containers. They can be used for:

Parallel loading and preparation of many data objects
Model training with different parameter sets and/or algorithms
Parallel testing with many test vector options
AutoML

MLRun iterations can be viewed as child runs under the main task/run. Each child run gets a set of parameters that are computed/selected from the input hyperparameters based on the chosen strategy (Grid, List, Random or Custom).

The different iterations can run in parallel over multiple containers (using Dask or Nuclio runtimes, which manage the workers). Read more in Parallel execution over containers.

The hyperparameters and options are specified in the task or the run() command through the hyperparams (for hyperparam values) and hyper_param_options (for HyperParamOptions) properties. See the examples below.

The hyperparams are specified as a struct of key: list values. The values can be of any type (int, string, float, …). The lists are used to compute the parameter combinations using one of the following strategies:

Grid search (grid) — running all the parameter combinations. The key: list values structure is similar to: { "p1": [1,2], "p2": [2,4] }
The result is the four iterations with all the combinations of p1 and p2. Hyperparameters can also be loaded directly from a JSON file (specify param_file in HyperParamOptions).
Random (random) — running a sampled set from all the parameter combinations. Hyperparameters can also be loaded directly from a JSON file, the same as grid.
List (list) — running the first parameter from each list followed by the second from each list and so on. All the lists must be of equal length. Hyperparameters can also be loaded directly from a JSON or CSV file containing a list of the iterations to be executed. Example JSON: {"p1": [1], "p2": [10]} (specify param_file in HyperParamOptions).

You can specify a selection criteria to select the best run among the different child runs by setting the selector option. This marks the selected result as the parent (iteration 0) result, and marks the best result in the user interface.

You can also specify the stop_condition to stop the execution of child runs when some criteria, based on the returned results, is met (for example stop_condition="accuracy>=0.9").

In this section

Basic code
Review the results
Examples
Parallel execution over containers

Basic code#

Here's a basic example of running multiple jobs in parallel for hyperparameters tuning, selecting the best run with respect to the max accuracy.

Run the hyperparameters tuning job by using the keywords arguments:

hyperparams for the hyperparameters options and values of choice.
selector for specifying how to select the best model.

hp_tuning_run = project.run_function(
    "trainer", 
    inputs={"dataset": gen_data_run.outputs["dataset"]}, 
    hyperparams={
        "n_estimators": [100, 500, 1000], 
        "max_depth": [5, 15, 30]
    }, 
    selector="max.accuracy", 
    local=True
)

The returned run object in this case represents the parent (and the best result). You can also access the individual child runs (called iterations) in the MLRun UI.

Review the results#

When running a hyperparam job, the job results tab shows the list and marks the best run:

You can also view results by printing the artifact iteration_results:

hp_tuning_run.artifact("iteration_results").as_df()

MLRun also generates a parallel coordinates plot for the run, you can view it in the MLRun UI.

parallel_coordinates

Examples#

Base dummy function:

import mlrun

> 2021-10-23 12:47:39,982 [warning] Failed resolving version info. Ignoring and using defaults
> 2021-10-23 12:47:43,488 [warning] Unable to parse server or client version. Assuming compatible: {'server_version': '0.8.0-rc7', 'client_version': 'unstable'}

def hyper_func(context, p1, p2):
    print(f"p1={p1}, p2={p2}, result={p1 * p2}")
    context.log_result("multiplier", p1 * p2)

Grid search (default)#

grid_params = {"p1": [2, 4, 1], "p2": [10, 20]}
task = mlrun.new_task("grid-demo").with_hyper_params(
    grid_params, selector="max.multiplier"
)
run = mlrun.new_function().run(task, handler=hyper_func)

> 2021-10-23 12:47:43,505 [info] starting run grid-demo uid=29c9083db6774e5096a97c9b6b6c8e93 DB=http://mlrun-api:8080
p1=2, p2=10, result=20
p1=4, p2=10, result=40
p1=1, p2=10, result=10
p1=2, p2=20, result=40
p1=4, p2=20, result=80
p1=1, p2=20, result=20
> 2021-10-23 12:47:44,851 [info] best iteration=5, used criteria max.multiplier

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
default	...6b6c8e93	0	Oct 23 12:47:43	completed	grid-demo	v3io_user=admin kind=handler owner=admin			best_iteration=5 multiplier=80	iteration_results

> to track results use the .show() or .logs() methods or click here to open in UI

> 2021-10-23 12:47:45,071 [info] run executed, status=completed

UI Screenshot:

Random Search#

MLRun chooses random parameter combinations. Limit the number of combinations using the max_iterations attribute.

grid_params = {"p1": [2, 4, 1, 3], "p2": [10, 20, 30]}
task = mlrun.new_task("random-demo")
task.with_hyper_params(
    grid_params, selector="max.multiplier", strategy="random", max_iterations=4
)
run = mlrun.new_function().run(task, handler=hyper_func)

> 2021-10-23 12:47:45,077 [info] starting run random-demo uid=cac368c7fc33455f97ca806e5c7abf2f DB=http://mlrun-api:8080
p1=2, p2=20, result=40
p1=4, p2=10, result=40
p1=3, p2=10, result=30
p1=3, p2=20, result=60
> 2021-10-23 12:47:45,966 [info] best iteration=4, used criteria max.multiplier

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
default	...5c7abf2f	0	Oct 23 12:47:45	completed	random-demo	v3io_user=admin kind=handler owner=admin			best_iteration=4 multiplier=60	iteration_results

> to track results use the .show() or .logs() methods or click here to open in UI

> 2021-10-23 12:47:46,177 [info] run executed, status=completed

List search#

This example also shows how to use the stop_condition option.

list_params = {"p1": [2, 3, 7, 4, 5], "p2": [15, 10, 10, 20, 30]}
task = mlrun.new_task("list-demo").with_hyper_params(
    list_params,
    selector="max.multiplier",
    strategy="list",
    stop_condition="multiplier>=70",
)
run = mlrun.new_function().run(task, handler=hyper_func)

> 2021-10-23 12:47:46,184 [info] starting run list-demo uid=136edfb9c9404a61933c73bbbd35b18b DB=http://mlrun-api:8080
p1=2, p2=15, result=30
p1=3, p2=10, result=30
p1=7, p2=10, result=70
> 2021-10-23 12:47:47,193 [info] reached early stop condition (multiplier>=70), stopping iterations!
> 2021-10-23 12:47:47,195 [info] best iteration=3, used criteria max.multiplier

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
default	...bd35b18b	0	Oct 23 12:47:46	completed	list-demo	v3io_user=admin kind=handler owner=admin			best_iteration=3 multiplier=70	iteration_results

> to track results use the .show() or .logs() methods or click here to open in UI

> 2021-10-23 12:47:47,385 [info] run executed, status=completed

Custom iterator#

You can define a child iteration context under the parent/main run. The child run is logged independently.

def handler(context: mlrun.MLClientCtx, param_list):
    best_multiplier = total = 0
    for param in param_list:
        with context.get_child_context(**param) as child:
            hyper_func(child, **child.parameters)
            multiplier = child.results["multiplier"]
            total += multiplier
            if multiplier > best_multiplier:
                child.mark_as_best()
                best_multiplier = multiplier

    # log result at the parent
    context.log_result("avg_multiplier", total / len(param_list))

param_list = [{"p1": 2, "p2": 10}, {"p1": 3, "p2": 30}, {"p1": 4, "p2": 7}]
run = mlrun.new_function().run(handler=handler, params={"param_list": param_list})

> 2021-10-23 12:47:47,403 [info] starting run mlrun-a79c5c-handler uid=c3eb08ebae02464ca4025c77b12e3c39 DB=http://mlrun-api:8080
p1=2, p2=10, result=20
p1=3, p2=30, result=90
p1=4, p2=7, result=28

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
default	...b12e3c39	0	Oct 23 12:47:47	completed	mlrun-a79c5c-handler	v3io_user=admin kind=handler owner=admin host=jupyter-6476bb5f85-bjc4m		param_list=[{'p1': 2, 'p2': 10}, {'p1': 3, 'p2': 30}, {'p1': 4, 'p2': 7}]	best_iteration=2 multiplier=90 avg_multiplier=46.0

> to track results use the .show() or .logs() methods or click here to open in UI

> 2021-10-23 12:47:48,734 [info] run executed, status=completed

Parallel execution over containers#

When working with compute intensive or long running tasks you'll want to run your iterations over a cluster of containers. At the same time, you don't want to bring up too many containers, and you want to limit the number of parallel tasks.

MLRun supports distribution of the child runs over Dask or Nuclio clusters. This is handled automatically by MLRun. You only need to deploy the Dask or Nuclio function used by the workers, and set the level of parallelism in the task. The execution can be controlled from the client/notebook, or can have a job (immediate or scheduled) that controls the execution.

Code example (single task)## mark the start of a code section that will be sent to the job
# mlrun: start-code
import socket
import pandas as pd


def hyper_func2(context, data, p1, p2, p3):
    print(data.as_df().head())
    context.logger.info(f"p2={p2}, p3={p3}, r1={p2 * p3} at {socket.gethostname()}")
    context.log_result("r1", p2 * p3)
    raw_data = {
        "first_name": ["Jason", "Molly", "Tina", "Jake", "Amy"],
        "age": [42, 52, 36, 24, 73],
        "testScore": [25, 94, 57, 62, 70],
    }
    df = pd.DataFrame(raw_data, columns=["first_name", "age", "testScore"])
    context.log_dataset("mydf", df=df, stats=True)
# mlrun: end-code

Running the workers using Dask#

This example creates a new function and executes the parent/controller as an MLRun job and the different child runs over a Dask cluster (MLRun Dask function).

Define a Dask cluster (using MLRun serverless Dask)#

dask_cluster = mlrun.new_function("dask-cluster", kind="dask", image="mlrun/mlrun")
dask_cluster.apply(mlrun.mount_v3io())  # add volume mounts
dask_cluster.spec.service_type = "NodePort"  # open interface to the dask UI dashboard
dask_cluster.spec.replicas = 2  # define two containers
uri = dask_cluster.save()
uri

'db://default/dask-cluster'

# initialize the dask cluster and get its dashboard url
dask_cluster.client

> 2021-10-23 12:48:49,020 [info] trying dask client at: tcp://mlrun-dask-cluster-eea516ff-5.default-tenant:8786
> 2021-10-23 12:48:49,049 [info] using remote dask scheduler (mlrun-dask-cluster-eea516ff-5) at: tcp://mlrun-dask-cluster-eea516ff-5.default-tenant:8786

Mismatched versions found

+-------------+--------+-----------+---------+
| Package     | client | scheduler | workers |
+-------------+--------+-----------+---------+
| blosc       | 1.7.0  | 1.10.6    | None    |
| cloudpickle | 1.6.0  | 2.0.0     | None    |
| distributed | 2.30.0 | 2.30.1    | None    |
| lz4         | 3.1.0  | 3.1.3     | None    |
| msgpack     | 1.0.0  | 1.0.2     | None    |
| tornado     | 6.0.4  | 6.1       | None    |
+-------------+--------+-----------+---------+
Notes: 
-  msgpack: Variation is ok, as long as everything is above 0.6

dashboard link: default-tenant.app.yh38.iguazio-cd2.com:31350

Client

Scheduler: tcp://mlrun-dask-cluster-eea516ff-5.default-tenant:8786
Dashboard: http://mlrun-dask-cluster-eea516ff-5.default-tenant:8787/status

Cluster

Workers: 0
Cores: 0
Memory: 0 B

Define the parallel work#

Set the parallel_runs attribute to indicate how many child tasks to run in parallel. Set the dask_cluster_uri to point to the dask cluster (if it's not set the cluster uri uses dask local). You can also set the teardown_dask flag to free up all the dask resources after completion.

grid_params = {"p2": [2, 1, 4, 1], "p3": [10, 20]}
task = mlrun.new_task(
    params={"p1": 8},
    inputs={"data": "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv"},
)
task.with_hyper_params(
    grid_params,
    selector="r1",
    strategy="grid",
    parallel_runs=4,
    dask_cluster_uri=uri,
    teardown_dask=True,
)

<mlrun.model.RunTemplate at 0x7f673d7b1910>

Define a job that will take the code (using code_to_function) and run it over the cluster

fn = mlrun.code_to_function(name="hyper-tst", kind="job", image="mlrun/mlrun")

run = fn.run(task, handler=hyper_func2)

> 2021-10-23 12:49:56,388 [info] starting run hyper-tst-hyper_func2 uid=50eb72f5b0734954b8b1c57494f325bc DB=http://mlrun-api:8080
> 2021-10-23 12:49:56,565 [info] Job is running in the background, pod: hyper-tst-hyper-func2-9g6z8
> 2021-10-23 12:49:59,813 [info] trying dask client at: tcp://mlrun-dask-cluster-eea516ff-5.default-tenant:8786
> 2021-10-23 12:50:09,828 [warning] remote scheduler at tcp://mlrun-dask-cluster-eea516ff-5.default-tenant:8786 not ready, will try to restart Timed out trying to connect to tcp://mlrun-dask-cluster-eea516ff-5.default-tenant:8786 after 10 s
> 2021-10-23 12:50:15,733 [info] using remote dask scheduler (mlrun-dask-cluster-04574796-5) at: tcp://mlrun-dask-cluster-04574796-5.default-tenant:8786
remote dashboard: default-tenant.app.yh38.iguazio-cd2.com:32577
> --------------- Iteration: (1) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:21,353 [info] p2=2, p3=10, r1=20 at mlrun-dask-cluster-04574796-5k5lhq

> --------------- Iteration: (3) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:21,459 [info] p2=4, p3=10, r1=40 at mlrun-dask-cluster-04574796-5k5lhq

> --------------- Iteration: (4) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:21,542 [info] p2=1, p3=10, r1=10 at mlrun-dask-cluster-04574796-5k5lhq

> --------------- Iteration: (6) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:21,629 [info] p2=1, p3=20, r1=20 at mlrun-dask-cluster-04574796-5k5lhq

> --------------- Iteration: (7) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:21,792 [info] p2=4, p3=20, r1=80 at mlrun-dask-cluster-04574796-5k5lhq

> --------------- Iteration: (8) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:22,052 [info] p2=1, p3=20, r1=20 at mlrun-dask-cluster-04574796-5k5lhq

> --------------- Iteration: (2) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:23,134 [info] p2=1, p3=10, r1=10 at mlrun-dask-cluster-04574796-5j6v59

> --------------- Iteration: (5) ---------------
   sepal length (cm)  sepal width (cm)  ...  petal width (cm)  label
0                5.1               3.5  ...               0.2      0
1                4.9               3.0  ...               0.2      0
2                4.7               3.2  ...               0.2      0
3                4.6               3.1  ...               0.2      0
4                5.0               3.6  ...               0.2      0

[5 rows x 5 columns]
> 2021-10-23 12:50:23,219 [info] p2=2, p3=20, r1=40 at mlrun-dask-cluster-04574796-5k5lhq

> 2021-10-23 12:50:23,261 [info] tearing down the dask cluster..
> 2021-10-23 12:50:43,363 [info] best iteration=7, used criteria r1
> 2021-10-23 12:50:43,626 [info] run executed, status=completed
final state: completed

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
default	...94f325bc	0	Oct 23 12:49:59	completed	hyper-tst-hyper_func2	v3io_user=admin kind=job owner=admin	data	p1=8	best_iteration=7 r1=80	mydf iteration_results

> to track results use the .show() or .logs() methods or click here to open in UI

> 2021-10-23 12:50:53,303 [info] run executed, status=completed

Running the workers using Nuclio#

Nuclio is a high-performance serverless engine that can process many events in parallel. It can also separate initialization from execution. Certain parts of the code (imports, loading data, etc.) can be done once per worker vs. in any run.

Nuclio, by default, process events (http, stream, …). There is a special Nuclio kind that runs MLRun jobs (nuclio:mlrun).

Notes

Nuclio tasks are relatively short (preferably under 5 minutes), use it for running many iterations where each individual run is less than 5 min.
Use context.logger to drive text outputs (vs print()).

Create a nuclio:mlrun function#

fn = mlrun.code_to_function(name="hyper-tst2", kind="nuclio:mlrun", image="mlrun/mlrun")
# replicas * workers need to match or exceed parallel_runs
fn.spec.replicas = 2
fn.with_http(workers=2)
fn.deploy()

> 2021-10-23 12:51:10,152 [info] Starting remote function deploy
2021-10-23 12:51:10  (info) Deploying function
2021-10-23 12:51:10  (info) Building
2021-10-23 12:51:10  (info) Staging files and preparing base images
2021-10-23 12:51:10  (info) Building processor image
2021-10-23 12:51:11  (info) Build complete
2021-10-23 12:51:19  (info) Function deploy complete
> 2021-10-23 12:51:22,296 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-default-hyper-tst2.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['default-tenant.app.yh38.iguazio-cd2.com:32760']}

'http://default-tenant.app.yh38.iguazio-cd2.com:32760'

Run the parallel task over the function#

# this is required to fix Jupyter issue with asyncio (not required outside of Jupyter)
# run it only once
import nest_asyncio

nest_asyncio.apply()

grid_params = {"p2": [2, 1, 4, 1], "p3": [10, 20]}
task = mlrun.new_task(
    params={"p1": 8},
    inputs={"data": "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv"},
)
task.with_hyper_params(
    grid_params, selector="r1", strategy="grid", parallel_runs=4, max_errors=3
)
run = fn.run(task, handler=hyper_func2)

> 2021-10-23 12:51:31,618 [info] starting run hyper-tst2-hyper_func2 uid=97cc3e255f3c4c93822b0154d63f47f5 DB=http://mlrun-api:8080
> --------------- Iteration: (4) ---------------
2021-10-23 12:51:32.130812  info   logging run results to: http://mlrun-api:8080  worker_id=1
2021-10-23 12:51:32.401258  info   p2=1, p3=10, r1=10 at nuclio-default-hyper-tst2-5d4976b685-47dh6  worker_id=1

> --------------- Iteration: (2) ---------------
2021-10-23 12:51:32.130713  info   logging run results to: http://mlrun-api:8080  worker_id=0
2021-10-23 12:51:32.409468  info   p2=1, p3=10, r1=10 at nuclio-default-hyper-tst2-5d4976b685-47dh6  worker_id=0

> --------------- Iteration: (1) ---------------
2021-10-23 12:51:32.130765  info   logging run results to: http://mlrun-api:8080  worker_id=0
2021-10-23 12:51:32.432121  info   p2=2, p3=10, r1=20 at nuclio-default-hyper-tst2-5d4976b685-2gdtc  worker_id=0

> --------------- Iteration: (5) ---------------
2021-10-23 12:51:32.568848  info   logging run results to: http://mlrun-api:8080  worker_id=0
2021-10-23 12:51:32.716415  info   p2=2, p3=20, r1=40 at nuclio-default-hyper-tst2-5d4976b685-47dh6  worker_id=0

> --------------- Iteration: (7) ---------------
2021-10-23 12:51:32.855399  info   logging run results to: http://mlrun-api:8080  worker_id=1
2021-10-23 12:51:33.054417  info   p2=4, p3=20, r1=80 at nuclio-default-hyper-tst2-5d4976b685-2gdtc  worker_id=1

> --------------- Iteration: (6) ---------------
2021-10-23 12:51:32.970002  info   logging run results to: http://mlrun-api:8080  worker_id=0
2021-10-23 12:51:33.136621  info   p2=1, p3=20, r1=20 at nuclio-default-hyper-tst2-5d4976b685-47dh6  worker_id=0

> --------------- Iteration: (3) ---------------
2021-10-23 12:51:32.541187  info   logging run results to: http://mlrun-api:8080  worker_id=1
2021-10-23 12:51:33.301200  info   p2=4, p3=10, r1=40 at nuclio-default-hyper-tst2-5d4976b685-47dh6  worker_id=1

> --------------- Iteration: (8) ---------------
2021-10-23 12:51:33.419442  info   logging run results to: http://mlrun-api:8080  worker_id=0
2021-10-23 12:51:33.672165  info   p2=1, p3=20, r1=20 at nuclio-default-hyper-tst2-5d4976b685-47dh6  worker_id=0

> 2021-10-23 12:51:34,153 [info] best iteration=7, used criteria r1

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
default	...d63f47f5	0	Oct 23 12:51:31	completed	hyper-tst2-hyper_func2	v3io_user=admin kind=remote owner=admin	data	p1=8	best_iteration=7 r1=80	mydf iteration_results

> to track results use the .show() or .logs() methods or click here to open in UI

> 2021-10-23 12:51:34,420 [info] run executed, status=completed

Hyperparameter tuning optimization

Contents

Hyperparameter tuning optimization#

Basic code#

Review the results#

Examples#

Grid search (default)#

Random Search#

List search#

Custom iterator#

Parallel execution over containers#

Code example (single task)#

Running the workers using Dask#

Define a Dask cluster (using MLRun serverless Dask)#

Client

Cluster

Define the parallel work#

Running the workers using Nuclio#

Create a nuclio:mlrun function#

Run the parallel task over the function#