Using the built-in training function#

The MLRun [function marketplace](MLRun Functions Marketplace includes, among other things, training functions. The most commonly used function for training is auto_trainer, which includes the following handlers:

Train#

The main and default handler of any training function is called "train". In the Auto Trainer this handler performs an ML training function using SciKit-Learn’s API, meaning the function follows the structure below:

  1. Get the data: Get the dataset passed to a local path.

  2. Split the data into datasets: Split the given data into a training set and a testing set.

  3. Get the model: Initialize a model instance out of a given class or load a provided model
    The supported classes are anything based on sklearn.Estimator, xgboost.XGBModel, lightgbm.LGBMModel, including custom code as well.

  4. Train: Call the model’s fit method to train it on the training set.

  5. Test: Test the model on the testing set.

  6. Log: Calculate the metrics and produce the artifacts to log the results and plots.

MLRun orchestrates all of the above steps. The training is done with the shortcut function apply_mlrun that enables the automatic logging and additional features.

To start, run import mlrun and create a project:

import mlrun
# Set the base project name
project_name_base = 'training-test'

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name_base, context="./", user_project=True)

Next, import the Auto Trainer from the Functions Marketplace using MLRun’s import_function function:

auto_trainer = project.set_function(mlrun.import_function("hub://auto_trainer"))

The following example trains a Random Forest model:

dataset_url = "https://s3.wasabisys.com/iguazio/data/function-marketplace-data/xgb_trainer/classifier-data.csv"

train_run = auto_trainer.run(
    handler="train",
    inputs={"dataset": dataset_url},
    params={
        # Model parameters:
        "model_class": "sklearn.ensemble.RandomForestClassifier",
        "model_kwargs": {"max_depth": 8},  # Could be also passed as "MODEL_max_depth": 8
        "model_name": "MyModel",
        # Dataset parameters:
        "drop_columns": ["feat_0", "feat_2"],
        "train_test_split_size": 0.2,
        "random_state": 42,
        "label_columns": "labels",
    }
)

Outputs#

train_run.outputs returns all the outputs. The outputs are:

  • Trained model: The trained model is logged as a ModelArtifact with all the following artifacts registered to it.

  • Test dataset: The test set used to test the model post training is logged as a DatasetArtifact.

  • Plots: Informative plots regarding the model like confusion matrix and features importance are drawn and logged as PlotArtifacts.

  • Results: List of all the calculations of metrics tested on the testing set.

For instance, calling train_run.artifact('confusion-matrix').show() shows the following confusion matrix:

confusion matrix

Parameters#

To view the parameters of train, expand the section below:

train handler parameters:

Model Parameters

Parameters to initialize a new model object or load a logged one for retraining.

  • model_class: str — The class of the model to initialize. Can be a module path like "sklearn.linear_model.LogisticRegression" or a custom model passed through the custom objects parameters below. Only one of model_class and model_path can be given.

  • model_path: str — A ModelArtifact URI to load and retrain. Only one of model_class and model_path can be given.

  • model_kwargs: dict — Additional parameters to pass onto the initialization of the model object (the model’s class __init__ method).

Data parameters

Parameters to get a dataset and prepare it for training, splitting into training and testing if required.

  • dataset: Union[str, list, dict] — The dataset to train the model on.

    • Can be passed as part of inputs to be parsed as mlrun.DataItem, meaning it supports either a URI or a FeatureVector.

    • Can be passed as part of params, meaning it can be a list or a dict.

  • drop_columns: Union[str, int, List[str], List[int]] — Columns to drop from the dataset. Can be passed as strings representing the column names or integers representing the column numbers.

  • test_set: Union[str, list, dict] — The test set to test the model with post training. Notice only one of test_set or train_test_split_size is expected.

    • Can be passed as part of inputs to be parsed as mlrun.DataItem, meaning it supports either a URI or a FeatureVector.

    • Can be passed as part of params, meaning it can be a list or a dict.

  • train_test_split_size: float = 0.2 — The proportion of the dataset to include in the test split. The size of the Training set is set to the complement of this value. Must be between 0.0 and 1.0. Defaults to 0.2

  • label_columns: Union[str, int, List[str], List[int]] — The target label(s) of the column(s) in the dataset. Can be passed as strings representing the column names or integers representing the column numbers.

  • random_state: int - Random state (seed) for train_test_split.

Train parameters

Parameters to pass to the fit method of the model object.

  • train_kwargs: dict — Additional parameters to pass onto the fit method.

Logging parameters

Parameters to control the automatic logging feature of MLRun. You can adjust the logging outputs as relevant and if not passed, a default list of artifacts and metrics is produced and calculated.

  • model_name: str = "model” — The model’s name to use for storing the model artifact, defaults to ‘model’.

  • tag: str — The model’s tag to log with.

  • sample_set: Union[str, list, dict] — A sample set of inputs for the model for logging its stats alongside the model in favor of model monitoring. If not given, the training set is used instead.

    • Can be passed as part of inputs to be parsed as mlrun.DataItem, meaning it supports either a URI or a FeatureVector.

    • Can be passed as part of params, meaning it can be a list or a dict.

  • _artifacts: Dict[str, Union[list, dict]] — Additional artifacts to produce post training. See the ArtifactsLibrary of the desired framework to see the available list of artifacts.

  • _metrics: Union[List[str], Dict[str, Union[list, dict]]] — Additional metrics to calculate post training. See how to pass metrics and custom metrics in the MetricsLibrary of the desired framework.

  • apply_mlrun_kwargs: dict — Framework specific apply_mlrun key word arguments. Refer to the framework of choice to know more (SciKit-Learn, XGBoost or LightGBM)

Custom objects parameters

Parameters to include custom objects like custom model class, metric code and artifact plan. Keep in mind that the model artifact created is logged with the custom objects, so if model_path is used, the custom objects used to train it are not required for loading it, it happens automatically.

  • custom_objects_map: Union[str, Dict[str, Union[str, List[str]]]] — A map of all the custom objects required for loading, training and testing the model. Can be passed as a dictionary or a json file path. Each key is a path to a python file and its value is the custom object name to import from it. If multiple objects needed to be imported from the same py file a list can be given. For example:

    {
        "/.../custom_model.py": "MyModel",
        "/.../custom_objects.py": ["object1", "object2"]
    }
    

    All the paths are accessed from the given ‘custom_objects_directory’, meaning each py file is read from ‘custom_objects_directory/’. If the model path given is of a store object, the custom objects map is read from the logged custom object map artifact of the model.

    Note

    The custom objects are imported in the order they came in this dictionary (or json). If a custom object is dependant on another, make sure to put it below the one it relies on.

  • custom_objects_directory: Path to the directory with all the python files required for the custom objects. Can be passed as a zip file as well (and are extracted during the start of the run).

    Note

    The parameters for additional arguments model_kwargs, train_kwargs and apply_mlrun_kwargs can be also passed in the global kwargs with the matching prefixes: "MODEL_", "TRAIN_", "MLRUN_".

Evaluate#

The "evaluate" handler is used to test the model on a given testing set and log its results. This is a common phase in every model lifecycle and should be done periodically on updated testing sets to confirm that your model is still relevant. The function uses SciKit-Learn’s API for evaluation, meaning the function follows the structure below:

  1. Get the data: Get the testing dataset passed to a local path.

  2. Get the model: Get the model object out of the ModelArtifact URI.

  3. Predict: Call the model’s predict (and predict_proba if needed) method to test it on the testing set.

  4. Log: Test the model on the testing set and log the results and artifacts.

MLRun orchestrates all of the above steps. The evaluation is done with the shortcut function apply_mlrun that enables the automatic logging and further features.

To evaluate the test-set, use the following command:

evaluate_run = auto_trainer.run(
    handler="evaluate",
    inputs={"dataset": train_run.outputs['test_set']},
    params={
        "model": train_run.outputs['model'],
        "label_columns": "labels",
    },
)

Outputs#

evaluate_run.outputs returns all the outputs. The outputs are:

  • Evaluated model: The evaluated model’s ModelArtifact is updated with all the following artifacts registered to it.

  • Test dataset: The test set used to test the model post-training is logged as a DatasetArtifact.

  • Plots: Informative plots regarding the model like confusion matrix and features importance are drawn and logged as PlotArtifacts.

  • Results: List of all the calculations of metrics tested on the testing set.

Parameters#

To view the parameters of evaluate, expand the section below:

evaluate handler parameters:

Model Parameters

Parameters to load a logged model.

  • model_path: str — A ModelArtifact URI to load.

Data parameters

Parameters to get a dataset and prepare it for training, splitting into training and testing if required.

  • dataset: Union[str, list, dict] — The dataset to train the model on.

    • Can be passed as part of inputs to be parsed as mlrun.DataItem, meaning it supports either a URI or a FeatureVector.

    • Can be passed as part of params, meaning it can be a list or a dict.

  • drop_columns: Union[str, int, List[str], List[int]] — columns to drop from the dataset. Can be passed as strings representing the column names or integers representing the column numbers.

  • label_columns: Union[str, int, List[str], List[int]] — The target label(s) of the column(s) in the dataset. Can be passed as strings representing the column names or integers representing the column numbers.

Predict parameters

Parameters to pass to the predict method of the model object.

  • predict_kwargs: dict — Additional parameters to pass onto the predict method.

Logging parameters

Parameters to control the automatic logging feature of MLRun. You can adjust the logging outputs as relervant, and if not passed, a default list of artifacts and metrics is produced and calculated.

  • _artifacts: Dict[str, Union[list, dict]] — Additional artifacts to produce post training. See the ArtifactsLibrary of the desired framework to see the available list of artifacts.

  • _metrics: Union[List[str], Dict[str, Union[list, dict]]] — Additional metrics to calculate post training. See how to pass metrics and custom metrics in the MetricsLibrary of the desired framework.

  • apply_mlrun_kwargs: dict — Framework specific apply_mlrun key word arguments. Refer to the framework of choice to know more (SciKit-Learn, XGBoost or LightGBM).

Custom objects parameters

Parameters to include custom objects for the evaluation like custom metric code and artifact plans. Keep in mind that the custom objects used to train the model are not required for loading it, it happens automatically.

  • custom_objects_map: Union[str, Dict[str, Union[str, List[str]]]] — A map of all the custom objects required for loading, training and testing the model. Can be passed as a dictionary or a json file path. Each key is a path to a python file and its value is the custom object name to import from it. If multiple objects needed to be imported from the same py file a list can be given. For example:

    {
        "/.../custom_metric.py": "MyMetric",
        "/.../custom_plans.py": ["plan1", "plan2"]
    }
    

    All the paths are accessed from the given ‘custom_objects_directory’, meaning each py file is read from the ‘custom_objects_directory/’. If the model path given is of a store object, the custom objects map is read from the logged custom object map artifact of the model.

    Note

    The custom objects are imported in the order they came in this dictionary (or json). If a custom object is depended on another, make sure to put it below the one it relies on.

  • custom_objects_directory — Path to the directory with all the python files required for the custom objects. Can be passed as a zip file as well (iti is extracted during the start of the run).

    Note

    The parameters for additional arguments predict_kwargs and apply_mlrun_kwargs can be also passed in the global kwargs with the matching prefixes: "PREDICT_", "MLRUN_".