Function of type job

Function of type job#

You can deploy a model using a KubejobRuntime() type function, which runs the code in a Kubernetes Pod.

You can create (register) a job function with basic attributes such as code, requirements, image, etc. using the set_function() method. You can also import an existing job function/template from the MLRun hub.

Functions can be created from a single code, notebook file, or have access to the entire project context directory. (By adding the with_repo=True flag, the project context is cloned into the function runtime environment.)

Examples:

# register a (single) python file as a function
project.set_function(
    "src/data_prep.py",
    name="data-prep",
    image="mlrun/mlrun",
    handler="prep",
    kind="job",
)

# register a notebook file as a function, specify custom image and extra requirements
project.set_function(
    "src/mynb.ipynb",
    name="test-function",
    image="my-org/my-image",
    handler="run_test",
    requirements=["scikit-learn"],
    kind="job",
)

# register a module.handler as a function (requires defining the default sources/work dir, if it's not root)
project.spec.workdir = "src"
project.set_function(
    name="train",
    handler="training.train",
    image="mlrun/mlrun",
    kind="job",
    with_repo=True,
)

To run the job:

project.run_function("train")

You can configure automatic retries for failed job runs by passing a retry parameter when running a function. This enables MLRun to automatically handle transient errors such as runtime failures, OOM, or pod evictions.

Example:

project.run_function(
    "train",
    retry={
        "count": 3,  # total retries allowed (0 to disable retries)
        "backoff": {
            "base_delay": 30,  # delay in seconds between retries
        },
    },
)

In the UI, you can view:

  • Retries status in Jobs and Workflows > Monitor Jobs under the Retries column.

  • Pending retries in Jobs and Workflows > Monitor Jobs.

  • Logs per attempt in the Logs tab by selecting an attempt from the drop-down list.

If notifications are configured for the run, the final notification (success or failure) is sent after the last attempt and includes the total number of retries.

See also