Packagers overview#
Learn about built-in and custom packagers, and how to configure and use them.
In this section
What are packagers?#
Writing a function locally and running it remotely should feel identical. For example, running locally, your function accepts a DataFrame and returns a cleaned dataset — objects live in memory and everything just works. But when you move that same function to a remote job, Python objects can't be sent over the wire, and whatever the function returns simply disappears into the void.
Packagers bridge this gap: they serialize inputs before they reach your code and capture outputs after it runs, handling all the MLRun-specific I/O behind the scenes so your function stays pure Python regardless of where it executes.
Every MLRun function has two I/O touch-points:
Input parsing — automatically cast
DataIteminputs to the type-hinted Python type (e.g.pd.DataFrame,np.ndarray,dict)Output logging — automatically serialize, log, and upload returned objects as artifacts or results based on log hints
The flow looks like this:
Input flow: inputs={"data": "store://..."} → DataItem → packager unpack() → typed Python object →
your function
Output flow: your function return → Python object → packager pack() → Artifact / Result → artifact store
Why use packagers?#
Packagers offer several advantages over manual artifact handling and the legacy context-based API.
Better and faster learning curve#
With packagers you don't need to learn about Artifacts, DataItems, or the MLRun context object. You write standard
Python with type hints and returning values — MLRun wraps your existing code without changing it.
Before — manual artifact handling:
import mlrun
import pandas as pd
def clean_data(context: mlrun.MLClientCtx, raw_data: mlrun.DataItem):
# Parse input manually
df = raw_data.as_df()
# Drop rows with missing values and duplicates
cleaned = df.dropna().drop_duplicates()
row_count = len(cleaned)
# Log outputs manually
context.log_result("row_count", row_count)
context.log_dataset("cleaned_data", df=cleaned, format="parquet")
After — with packagers:
import pandas as pd
def clean_data(raw_data: pd.DataFrame) -> tuple[int, pd.DataFrame]:
cleaned = raw_data.dropna().drop_duplicates()
return len(cleaned), cleaned
The function is pure Python — no MLRun imports, no manual serialization. When you run it with:
fn.run(
handler="clean_data",
inputs={"raw_data": "store://my-raw-data"},
returns=["row_count", "cleaned_data : dataset"],
)
MLRun automatically converts the DataItem to a DataFrame on input and logs the row count as a result and the
cleaned DataFrame as a dataset artifact on output.
Uniformity of artifacts between users and projects#
ML engineers can establish a standardized method for artifact serialization once, ensuring consistent enforcement across every development notebook, CI pipeline, and production project in the organization. Because packagers standardize the serialization format, artifacts become truly portable — a DataFrame logged in one project can be consumed by a function in a completely different project without conversion steps or format mismatches (assuming, of course, that access is allowed across these projects).
In a pipeline, functions don't need to agree on file formats or know about MLRun's artifact
API. The producer just returns the object and the consumer receives it as a typed
parameter:
# Producer — returns a DataFrame
def prepare_data(raw: pd.DataFrame) -> pd.DataFrame:
return raw.dropna()
# Consumer — receives a DataFrame directly
def train_model(data: pd.DataFrame): ...
No manual data_item.as_df() calls, no format negotiation — the same artifact flows
cleanly between functions, projects, and teams.
Adaptive to user needs#
MLRun provides common built-in packagers with rich options and configurations. For example, you can control the output format with a single log hint string:
returns = ['data : dataset[format="parquet"]']
or equivalently using the LogHint class for full
control:
from mlrun import LogHint
returns = [
LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "parquet"})
]
Either form replaces manual pandas I/O and artifact construction.
Beyond built-in packagers, MLRun supports custom packagers that you write and register in your project to handle domain-specific types. See the custom packagers tutorials.
How to use packagers#
Parsing inputs with type hints#
Type hints on function parameters tell packagers what Python type each input
should be converted to. When you pass a value via inputs={}, it arrives as a
DataItem. The packager looks at the type hint and automatically converts it
to the declared type — pd.DataFrame, dict, np.ndarray, etc.
def my_handler(data: pd.DataFrame, config: dict):
# `data` is already a DataFrame — no .as_df() needed
# `config` is already a dict — no .get() / json.loads() needed
...
fn.run(
handler="my_handler",
inputs={"data": "store://my-dataset", "config": "store://my-config"},
)
Packagers are enabled by default (mlrun.mlconf.packagers.enabled = True).
When enabled, the runtime automatically parses all type-hinted arguments
that are passed via inputs={}. To disable parsing for a specific run,
set mlrun.mlconf.packagers.enabled = False.
Logging outputs with log hints#
A log hint tells MLRun how to log a single returned value — what key to store it
under, what artifact type to use, and any serialization options. Log hints are
passed via the returns parameter on run:
fn.run(
handler="train",
inputs={"dataset": "store://my-dataset"},
returns=["accuracy", "X_test : dataset", "model : model"],
)
Each entry in the returns list is a log hint — either a LogHint object or a
string shortcut. The sections below cover artifact types, the LogHint class, and
the string shortcut format.
Artifact types#
The artifact type is a string that determines how an object is serialized and
what metadata is stored. MLRun defines common types in mlrun.package.ArtifactType,
but custom packagers can implement any artifact type string they need — these are
just conventions that built-in packagers share:
Type |
Description |
Typical objects |
|---|---|---|
|
Scalar/simple value stored in run metadata |
|
|
Tabular data logged as a |
|
|
Generic file upload |
|
|
ML model artifact |
scikit-learn models, torch models |
|
Visualization |
matplotlib figures |
|
Pickle serialization (fallback) |
Any Python object |
|
File/directory path |
|
If you don't specify an artifact type, the packager for the object's type chooses
a sensible default. Custom packagers define their own defaults via
DEFAULT_PACKING_ARTIFACT_TYPE.
Asymmetric (pack-only / unpack-only) artifact types#
Packing and unpacking artifact types are discovered independently. A DefaultPackager
subclass with pack_foo but no unpack_foo supports "foo" for packing only —
is_packable accepts it but is_unpackable rejects it. The reverse also applies:
unpack_bar without pack_bar means "bar" is unpack-only.
Common scenarios:
Pack-only — saving plots as images, logging summary metrics as plain results, rendering a model to an image (the PNG can't be deserialized back to the original object)
Unpack-only — legacy/migration support (e.g.
unpack_v1reads artifacts from an older packager version while new writes always usepack_v2); cross-format compatibility (e.g. a DataFrame packager canunpack_csvto read manually-logged CSV artifacts but alwayspack_parquetfor new outputs)
The LogHint class#
A LogHint gives you full control over logging —
artifact type, labels, extra data, metrics, and more:
from mlrun import LogHint
returns = [
LogHint(key="model", artifact_type="model", labels={"version": "1"}),
LogHint(key="data", artifact_type="dataset", packing_kwargs={"format": "csv"}),
]
A LogHint has the following fields:
Field |
Type |
Description |
|---|---|---|
|
|
Required. The artifact key to log the object under. |
|
|
The artifact type (e.g. |
|
|
Tag for the artifact. Default: |
|
|
Unbundling control. |
|
|
Extra keyword arguments passed to the packager's |
|
|
Labels to add to the logged artifact. |
|
|
Extra data to attach to the artifact. Use |
|
|
Metrics to log alongside a model artifact. Use |
Linking artifacts#
When a function returns multiple values, you can link them together so that
related outputs are attached to a primary artifact. For example, you might want a
model artifact to carry its evaluation metrics and supporting artifacts (plots, test
data) as part of its metadata. This is done through the extra_data and metrics
fields of LogHint, using Python's ... (Ellipsis) as a placeholder meaning
"fill this in with the package that has this key."
Consider a training function that returns a model alongside its metrics, a loss plot, and a test dataset:
def train(dataset: pd.DataFrame):
# ... training logic ...
return my_model, some_result, loss_plot, test_dataset
fn.run(
handler="train",
inputs={"dataset": "store://my-dataset"},
returns=[
LogHint(
key="my_model",
artifact_type="model",
metrics={"some_result": ...},
extra_data={"loss_plot": ..., "test_dataset": ...},
),
"some_results : result",
"loss_plot : plot",
"test_dataset : dataset",
],
)
After all four values are packed, the packager manager resolves every ...:
"some_result"is a result (scalar), so it is placed into the model'smetrics"loss_plot"and"test_dataset"are artifacts, so they are placed into the model'sextra_data
The result is a model artifact with its evaluation metrics and supporting data attached directly — visible as a single unit in the MLRun UI.
Note
Linking rules:
metricsis available only on model artifacts and can link to results only (scalar values in run metadata)extra_dataworks with any artifact type and can link to both artifacts and resultsIf a referenced key is not found among the packed outputs, the entry is removed and a warning is logged
The order of items in
returnsdoes not matter — linking is resolved after all packing is complete
String shortcut#
The most common way to specify a log hint. A string shortcut has up to four parts:
Part |
Syntax |
Purpose |
Example |
|---|---|---|---|
Key (required) |
|
The artifact name |
|
Artifact type |
|
Override the default type |
|
Packing kwargs |
|
Pass options to the packager |
|
Itemization prefix |
|
Unbundle a collection |
|
Examples and their LogHint equivalents:
String |
Equivalent LogHint |
|---|---|
|
|
|
|
|
|
|
|
|
|
Itemization (unbundling)#
Unbundling breaks a collection (list or dict) into separate artifacts, each logged individually. This is useful when a function returns a dictionary of DataFrames and you want each one as its own dataset artifact.
def evaluate(data: pd.DataFrame) -> dict[str, pd.DataFrame]:
"""Returns per-category evaluation results."""
results = {}
for category in data["category"].unique():
subset = data[data["category"] == category]
results[category] = compute_metrics(subset)
return results
Without unbundling, the entire dict is logged as a single artifact. With unbundling:
fn.run(handler="evaluate", inputs={"data": "store://eval-data"}, returns=["*results"])
Each DataFrame in the dict becomes its own dataset artifact, keyed as results_<category_name>.
Depth control
"*results"oritemized=True— fully recursive unbundling."2*results"oritemized=2— unbundle up to 2 levels deep. Nested collections beyond that depth are logged as single artifacts.
Configuration#
Packager behavior is controlled by settings under mlrun.mlconf.packagers:
Setting |
Default |
Description |
|---|---|---|
|
|
Master switch. When enabled, MLRun automatically wraps every function execution with the packager handler — parsing typed inputs and logging returned outputs. Set to |
|
|
When |
|
|
When |
|
|
The base key used in the auto-generated artifact name when |
|
|
When |
|
|
In multi-worker runs, only the worker with this rank packs outputs and logs results/artifacts. Other workers skip logging to avoid overriding each other. Default is |
You can change these settings globally:
import mlrun
mlrun.mlconf.packagers.auto_unpack_inputs = True
Note
You can also set these options via environment variables. Use the MLRUN_ prefix
with __ (double underscore) as the nesting separator:
MLRUN_PACKAGERS__ENABLED=true
MLRUN_PACKAGERS__AUTO_PACK_OUTPUTS=true
Built-in packagers#
MLRun includes packagers for common Python types. All built-in packagers are available automatically — no registration needed.
Python standard library#
Handles None, int, float, bool, str, dict, list, tuple, set,
frozenset, bytes, bytearray, and pathlib.Path.
API reference: python_standard_library_packagers
NumPy#
Handles np.ndarray, np.number, and collections of arrays (list[np.ndarray],
dict[str, np.ndarray]).
API reference: numpy_packagers
Pandas#
Handles pd.DataFrame and pd.Series.
API reference: pandas_packagers
Default (fallback)#
Any unrecognized type is handled by the DefaultPackager, which
serializes objects using cloudpickle (or any pickling module configured). The default artifact type is object.
Creating a custom packager#
When a built-in packager doesn't handle your type (or you want human-readable serialization instead of pickle), you can write a custom packager. The custom packagers guide walks through the full process — choosing a base class, setting class variables, implementing pack/unpack methods, and registering the packager in your project.
Note
When running remotely, set the project source with pull_at_runtime=True
so the packager module can be imported on the remote worker.
See Setting a project source, either remote or archive. The git repo or the archive file needs to include the custom packager files.
The custom packager needs to be available during runtime. See how to do that in Make the packager importable on the remote worker.
See also
Automated experiment tracking — framework-specific auto-logging with
apply_mlrun()Working with data and model artifacts — manual artifact handling
mlrun.package — API reference