Logging datasets#
Storing datasets is important in order to have a record of the data that was used to train models, as well as storing any processed data. MLRun comes with built-in support for the DataFrame format. MLRun not only stores the DataFrame, but it also provides information about the data, such as statistics.
The simplest way to store a dataset is with the following code:
context.log_dataset(key="my_data", df=df)
Where key
is the name of the artifact and df
is the DataFrame. By default, MLRun stores a short preview of 20 lines.
You can change the number of lines by changing the value of the preview
parameter.
MLRun also calculates statistics on the DataFrame on all numeric fields. You can enable statistics regardless to the
DataFrame size by setting the stats
parameter to True
.
Logging a dataset from a job#
The following example shows how to work with datasets from a job:
from os import path
from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem
# Ingest a data set into the platform
def get_data(context: MLClientCtx, source_url: DataItem, format: str = "csv"):
iris_dataset = source_url.as_df()
target_path = path.join(context.artifact_path, "data")
# Optionally print data to your logger
context.logger.info("Saving Iris data set to {} ...".format(target_path))
# Store the data set in your artifacts database
context.log_dataset(
"iris_dataset",
df=iris_dataset,
format=format,
index=False,
artifact_path=target_path,
)
This code can be placed in a python file, or as a cell in the Python notebook. You can run this function locally or as a job. For example, to run it locally:
from os import path
from mlrun import new_project, mlconf
project_name = "my-project"
project_path = path.abspath("conf")
project = new_project(project_name, project_path, init_git=True)
# Target location for storing pipeline artifacts
artifact_path = path.abspath("jobs")
# MLRun DB path or API service URL
mlconf.dbpath = mlconf.dbpath or "http://mlrun-api:8080"
source_url = "https://s3.wasabisys.com/iguazio/data/iris/iris_dataset.csv"
# Create a function from py or notebook (ipynb) file
get_data_func = project.set_function(
"./get_data.py", name="get_data", kind="job", image="mlrun/mlrun"
)
# Run get-data function locally
get_data_run = get_data_func.run(
handler="get_data",
inputs={"source_url": source_url},
artifact_path=artifact_path,
local=True,
)
The dataset location is returned in the outputs
field, therefore you can get the location by calling
get_data_run.artifact('iris_dataset')
to get the dataset itself.
# Read your data set
get_data_run.artifact("iris_dataset").as_df()
# Visualize an artifact in Jupyter (image, html, df, ..)
get_data_run.artifact("confusion-matrix").show()
The dataset returned from the run result is of the DataItem
type. It allows access to the data itself as a Pandas
Dataframe by calling the dataset.as_df()
. It also contains the metadata of the artifact, accessed by the using
dataset.meta
. This artifact metadata object contains in it the statistics calculated, the schema of the dataset and
other fields describing the dataset. For example, call dataset.meta.stats
to obtain the data statistics.