Using packagers to automate I/O in a gen AI agent pipeline

Using packagers to automate I/O in a gen AI agent pipeline#

This tutorial demonstrates how MLRun packagers automate input parsing and output logging in a mock scenario: evaluating multiple gen-AI agents on a set of test prompts.

To spare the need for API keys, the "agents" are simple string-formatting heuristics rather than real LLM calls. The focus is on how packagers handle mixed output types and input parsing, not on the evaluation logic itself.

You will learn how to:

Log mixed output types (DataFrames, dicts, strings) using log hints
Use LogHint objects for fine-grained control (labels, artifact types)
Itemize (unbundle) a dict of responses into separate artifacts with the * prefix
Pass a previously logged artifact as a typed input to a downstream function
Understand the difference between params (direct values) and inputs (DataItems parsed by packagers)

In this section

Setup
Define the evaluation handler
Create the MLRun function
Run with mixed log hints
Inspect the results
Consuming packaged artifacts as inputs
What the packagers did automatically
Next steps

Setup#

Import the required packages and create (or load) an MLRun project.

import mlrun
from mlrun import LogHint

project = mlrun.get_or_create_project("packagers-tutorial", "./")

Define the evaluation handler#

The handler below simulates evaluating multiple gen AI agents. Each "agent" is just a string-formatting heuristic — no real LLM calls are made, so the notebook runs anywhere without API keys.

Notice the function is pure Python — no MLRun imports, no context object. The returns log hints (shown later) tell packagers how to log each returned value.

%%writefile eval_agents.py

import pandas as pd


def evaluate_agents(
    agents_config: dict,
    prompts: list,
) -> tuple[pd.DataFrame, dict, dict, str]:
    """
    Evaluate simulated gen AI agents on a set of prompts.

    :param agents_config: Mapping of agent name to its configuration dict.
                          Each config has keys like 'style' and 'max_words'.
    :param prompts:       List of test prompt strings.

    :returns: A tuple of (evaluation DataFrame, best agent dict,
              all responses dict, summary string).
    """
    scores = []
    all_responses = {}

    for agent_name, config in agents_config.items():
        style = config.get("style", "neutral")
        max_words = config.get("max_words", 20)
        responses = []

        for prompt in prompts:
            # Simulated response — no real LLM call
            response = f"[{style}] Re: {prompt[:40]}... (max {max_words} words)"
            responses.append(response)

        # Simulated scoring heuristic
        relevance = len(style) * 7 % 100          # deterministic pseudo-score
        clarity = (max_words * 3 + 10) % 100
        overall = round((relevance + clarity) / 2, 1)

        scores.append({
            "agent": agent_name,
            "style": style,
            "relevance": relevance,
            "clarity": clarity,
            "overall": overall,
        })
        all_responses[agent_name] = responses

    # Build the evaluation DataFrame
    evaluation = pd.DataFrame(scores)

    # Identify the best agent
    best_idx = evaluation["overall"].idxmax()
    best_agent = {
        "name": evaluation.loc[best_idx, "agent"],
        "overall_score": evaluation.loc[best_idx, "overall"],
        "config": agents_config[evaluation.loc[best_idx, "agent"]],
    }

    # Human-readable summary
    summary = (
        f"Evaluated {len(agents_config)} agents on {len(prompts)} prompts. "
        f"Best agent: {best_agent['name']} (score: {best_agent['overall_score']})."
    )

    return evaluation, best_agent, all_responses, summary

Create the MLRun function#

Register the handler file as an MLRun function. The kind="job" means it can run locally or on a Kubernetes cluster.

eval_agents = project.set_function(
    "eval_agents.py",
    name="eval-agents",
    kind="job",
    image="mlrun/mlrun",
    handler="evaluate_agents",
)

Run with mixed log hints#

Note that agents_config and prompts are passed via params={} — they arrive as plain Python objects (a dict and a list) directly, with no packager involvement. Packagers only parse values passed via inputs={}, which flow through DataItems. The second handler below shows input parsing in action.

The returns list uses four different log-hint styles to demonstrate the full range of packagers output capabilities:

Return value	Log hint	What happens
`evaluation` (DataFrame)	`"evaluation : dataset"`	String shortcut — logged as a `DatasetArtifact`
`best_agent` (dict)	`LogHint(key="best_agent", labels={...})`	LogHint object — logged as a `result` (dict default) with custom labels
`all_responses` (dict of lists)	`"*all_responses"`	Unbundled — each agent's response list becomes a separate artifact
`summary` (str)	`"summary"`	Key only — artifact type inferred from the value type (`result` for `str`)

eval_agents_run = eval_agents.run(
    local=True,
    params={
        "agents_config": {
            "concise-bot": {"style": "concise", "max_words": 15},
            "verbose-bot": {"style": "verbose", "max_words": 50},
            "formal-bot": {"style": "formal", "max_words": 30},
        },
        "prompts": [
            "Explain the benefits of retrieval-augmented generation.",
            "Compare fine-tuning vs. prompt engineering.",
            "Summarize best practices for LLM evaluation.",
        ],
    },
    returns=[
        "evaluation : dataset",  # string shortcut
        LogHint(key="best_agent", labels={"stage": "eval"}),  # LogHint with labels
        "*all_responses",  # unbundled dict
        "summary",  # key only
    ],
)

Inspect the results#

The run's outputs dictionary contains all logged artifacts and results. Here's a look at each one.

eval_agents_run.outputs

Evaluation DataFrame#

The evaluation output was logged as a DatasetArtifact. You can retrieve it as a DataFrame directly.

eval_agents_run.artifact("evaluation").as_df()

Best agent (result)#

The best_agent dict was logged as a result — a lightweight value stored directly in the run's metadata (no artifact file created).

eval_agents_run.outputs["best_agent"]

Summary (result)#

The summary string was also logged as a result.

eval_agents_run.outputs["summary"]

Itemized responses#

Because the log hint was "*all_responses", the packager unbundled the dict: each key (concise-bot, verbose-bot, formal-bot) became a separate artifact. You can see them in the outputs with a prefix of all_responses_.

# List all unbundled response keys
[key for key in eval_agents_run.outputs if key.startswith("all_responses")]

Consuming packaged artifacts as inputs#

This is where packager input parsing comes into play. When you pass a previously logged artifact via inputs={}, it arrives as a DataItem. The packager sees the type hint on the function parameter and automatically converts it to the declared Python type — no manual .as_df() or json.loads() needed.

This is distinct from params={}, which pass plain JSON serializable Python values directly to the function with no packager involvement.

Here's a second handler that takes the evaluation DataFrame as input and returns a filtered version.

%%writefile filter_agents.py

import pandas as pd


def filter_top_agents(
    evaluation: pd.DataFrame,
    min_score: float = 40.0,
) -> pd.DataFrame:
    """
    Filter agents whose overall score meets the threshold.

    :param evaluation: The evaluation DataFrame produced by evaluate_agents.
    :param min_score:  Minimum overall score to keep.

    :returns: Filtered DataFrame.
    """
    return evaluation[evaluation["overall"] >= min_score]

filter_agents = project.set_function(
    "filter_agents.py",
    name="filter-agents",
    kind="job",
    image="mlrun/mlrun",
    handler="filter_top_agents",
)

Now pass the evaluation artifact from the previous run as an input. The packager sees the type hint evaluation: pd.DataFrame and automatically converts the DataItem to a DataFrame — no .as_df() call needed inside the function.

filter_agents_run = filter_agents.run(
    local=True,
    inputs={"evaluation": eval_agents_run.outputs["evaluation"]},
    params={"min_score": 40.0},
    returns=["top_agents : dataset"],
)

filter_agents_run.artifact("top_agents").as_df()

What the packagers did automatically#

Here's a summary of what happened behind the scenes — and what you would have had to do manually without packagers:

Step	With packagers	Without packagers
Pass `agents_config` and `prompts`	`params={}` — direct values, no packager involved	Same — `params` always pass values directly
Log evaluation DataFrame	`"evaluation : dataset"`	`context.log_dataset("evaluation", df=evaluation)`
Log best_agent dict with labels	`LogHint(key=..., labels=...)`	`context.log_result("best_agent", best_agent)`
Unbundle responses per agent	`"*all_responses"`	Manual loop: `for name, resp in all_responses.items(): context.log_artifact(...)`
Log summary string	`"summary"`	`context.log_result("summary", summary)`
Parse DataFrame input in 2nd function	`inputs={}` + type hint `pd.DataFrame` — automatic	`evaluation = data_item.as_df()`

Key distinction: params pass plain Python values — no packager processing. inputs pass DataItem references — packagers parse them into the type-hinted type. Packager output logging applies to returns regardless of how the inputs were provided.

With packagers, the handler functions contain zero MLRun-specific code — they are pure Python functions that can also be tested and debugged outside of MLRun.

Next steps#

Read the full packagers guide for details on all built-in packagers, the LogHint fields, and artifact types
See the custom packagers tutorials to learn how to write packagers for your own types