Data items
Data items#
A data item can be one item or a collection of items (file, dir, table, etc.).
When running jobs or pipelines, data is passed using the DataItem
objects. Data items objects abstract away
the data backend implementation, provide a set of convenience methods (.as_df
, .get
, .show
, …), and enable auto logging/versioning
of data and metadata.
Example function:
# Save this code as a .py file:
import mlrun
def prep_data(context, source_url: mlrun.DataItem, label_column='label'):
# Convert the DataItem to a Pandas DataFrame
df = source_url.as_df()
df = df.drop(label_column, axis=1).dropna()
context.log_dataset('cleaned_data', df=df, index=False, format='csv')
Creating a project, setting the function into it, defining the URL with the data and running the function:
source_url = mlrun.get_sample_path('data/batch-predict/training_set.parquet')
project = mlrun.get_or_create_project("data-items", "./", user_project=True)
data_prep_func = project.set_function("data-prep.py", name="data-prep", kind="job", image="mlrun/mlrun", handler="prep_data")
prep_data_run = data_prep_func.run(name='prep_data',
handler=prep_data,
inputs={'source_url': source_url},
params={'label_column': 'label'})
To call the function with an input
you can use the inputs
dictionary attribute. To pass
a simple parameter, use the params
dictionary attribute. The input value is the specific item uri
(per data store schema) as explained in Shared data stores.
From v1.3, DataItem
objects are automatically parsed to the hinted type when a type hint is available.
Reading the data results from the run, you can easily get a run output artifact as a DataItem
(so that you can view/use the artifact) using:
# read the data locally as a Dataframe
prep_data_run.artifact('cleaned_data').as_df()
The DataItem
supports multiple convenience methods such as:
get(), put() - to read/write data
download(), upload() - to download/upload files
as_df() - to convert the data to a DataFrame object
local - to get a local file link to the data (that is downloaded locally if needed)
listdir(), stat - file system like methods
meta - access to the artifact metadata (in case of an artifact uri)
show() - visualizes the data in Jupyter (as image, html, etc.)
See the DataItem
class documentation for details.
In order to get a DataItem object from a url use get_dataitem()
or
get_object()
(returns the DataItem.get()
).
For example:
df = mlrun.get_dataitem('s3://demo-data/mydata.csv').as_df()
print(mlrun.get_object('https://my-site/data.json'))