Logging datasets#

Storing datasets is important in order to have a record of the data that was used to train models, as well as storing any processed data. MLRun comes with built-in support for the DataFrame format. MLRun not only stores the DataFrame, but it also provides information about the data, such as statistics.

The simplest way to store a dataset is with the following code:

context.log_dataset(key="my_data", df=df)

Where key is the name of the artifact and df is the DataFrame. By default, MLRun stores a short preview of 20 lines. You can change the number of lines by changing the value of the preview parameter.

MLRun also calculates statistics on the DataFrame on all numeric fields. You can enable statistics regardless to the DataFrame size by setting the stats parameter to True.

Logging a dataset from a job#

The following example shows how to work with datasets from a job:

from os import path
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem


# Ingest a data set into the platform
def get_data(context: MLClientCtx, source_url: DataItem, format: str = "csv"):

    iris_dataset = source_url.as_df()

    target_path = path.join(context.artifact_path, "data")
    # Optionally print data to your logger
    context.logger.info("Saving Iris data set to {} ...".format(target_path))

    # Store the data set in your artifacts database
    context.log_dataset(
        "iris_dataset",
        df=iris_dataset,
        format=format,
        index=False,
        artifact_path=target_path,
    )

This code can be placed in a python file, or as a cell in the Python notebook. You can run this function locally or as a job. For example, to run it locally:

from os import path
from mlrun import new_project, mlconf

project_name = "my-project"
project_path = path.abspath("conf")
project = new_project(project_name, project_path, init_git=True)

# Target location for storing pipeline artifacts
artifact_path = path.abspath("jobs")
# MLRun DB path or API service URL
mlconf.dbpath = mlconf.dbpath or "http://mlrun-api:8080"

source_url = "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv"

# Create a function from py or notebook (ipynb) file
get_data_func = project.set_function(
    "./get_data.py", name="get_data", kind="job", image="mlrun/mlrun"
)

# Run get-data function locally
get_data_run = get_data_func.run(
    handler="get_data",
    inputs={"source_url": source_url},
    artifact_path=artifact_path,
    local=True,
)

The dataset location is returned in the outputs field, therefore you can get the location by calling get_data_run.artifact('iris_dataset') to get the dataset itself.

# Read your data set
get_data_run.artifact("iris_dataset").as_df()

# Visualize an artifact in Jupyter (image, html, df, ..)
get_data_run.artifact("confusion-matrix").show()

The dataset returned from the run result is of the DataItem type. It allows access to the data itself as a Pandas Dataframe by calling the dataset.as_df(). It also contains the metadata of the artifact, accessed by the using dataset.meta. This artifact metadata object contains in it the statistics calculated, the schema of the dataset and other fields describing the dataset. For example, call dataset.meta.stats to obtain the data statistics.