Data items
Data items#
A data item can be one item or a or collection of items (file, dir, table, etc.).
When running jobs or pipelines, data is passed using the DataItem
objects. Data items objects abstract away
the data backend implementation, provide a set of convenience methods (.as_df
, .get
, .show
, …), and enable auto logging/versioning
of data and metadata.
Example function:
def prep_data(context, source_url: mlrun.DataItem, label_column='label'):
# Convert the DataItem to a Pandas DataFrame
df = source_url.as_df()
df = df.drop(label_column, axis=1).dropna()
context.log_dataset('cleaned_data', df=df, index=False, format='csv')
Running the function:
prep_data_run = data_prep_func.run(name='prep_data',
handler=prep_data,
inputs={'source_url': source_url},
params={'label_column': 'userid'})
In order to call the function with an input
you can use the inputs
dictionary attribute. In order to pass
a simple parameter, use the params
dictionary attribute. The input value is the specific item uri
(per data store schema) as explained in Shared data stores.
Reading the data results from the run, you can easily get a run output artifact as a DataItem
(so that you can view/use the artifact) using:
# read the data locally as a Dataframe
prep_data_run.artifact('cleaned_data').as_df()
The DataItem
supports multiple convenience methods such as:
get(), put() - to read/write data
download(), upload() - to download/upload files
as_df() - to convert the data to a DataFrame object
local - to get a local file link to the data (that is downloaded locally if needed)
listdir(), stat - file system like methods
meta - access to the artifact metadata (in case of an artifact uri)
show() - visualizes the data in Jupyter (as image, html, etc.)
See the DataItem
class documentation for details. mlrun.datastore.DataItem
In order to get a DataItem object from a url use get_dataitem()
or
get_object()
(returns the DataItem.get()
).
For example:
df = mlrun.get_dataitem('s3://demo-data/mydata.csv').as_df()
print(mlrun.get_object('https://my-site/data.json'))