Training with the feature store

Training with the feature store#

Learn how to train your model using an offline dataset created by the MLRun feature store.

In this section

Creating an offline dataset#

An offline dataset is a specific instance of the feature vector definition. To create this instance, use the feature store's get_offline_features(<feature_vector>, <target>) function on the feature vector using the store://<project_name>/<feature_vector> reference and an offline target (as in Parquet, CSV, etc.).

You can add a time-based filter condition when running get_offline_features with a given vector. You can also filter with the query argument on all the other features as you like. See get_offline_features().


feature-store-vector-ui

import mlrun.feature_store as fstore

feature_vector = "<feature_vector_name>"
fvec = fstore.get_feature_vector(feature_vector)
offline_fv = fvec.get_offline_features(target=ParquetTarget())

Behind the scenes, get_offline_features() runs a local or Kubernetes job (can be specific by the run_config parameter) to retrieve all the relevant data from the feature sets, merge them and return it to the specified target which can be a local parquet, AZ Blob store or any other type of available storage.

Once instantiated with a target, the feature vector holds a reference to the instantiated dataset and references it as its current offline source.

You can also use MLRun's log_dataset() to log the specific dataset to the project as a specific dataset resource.

You can use the additional_filters attribute while reading from a ParquetTarget, similar to additional_filters in ParquetSource.

Training#

Training your model using the feature store is a fairly simple task. (The offline dataset can also be used for your EDA.)

To retrieve a feature vector's offline dataset, use MLRun's data item mechanism, referencing the feature vector and specifying to receive it as a DataFrame.

df = mlrun.get_dataitem(
    f"store://feature-vectors/{project}/patient-deterioration"
).as_df()

When trying to retrieve the dataset in your training function, you can put the feature vector reference as an input to the function and use the as_df() function to retrieve it automatically.

# A sample MLRun training function
def my_training_function(
    context,  # MLRun context
    dataset,  # our feature vector reference
    **kwargs,
):

    # retrieve the dataset
    df = dataset.as_df()

    # The rest of your training code...

And now you can create the MLRun project and function and run it locally or over the Kubernetes cluster:

project = mlrun.get_or_create_project("training")
# Creating the training MLRun function with the code
fn = project.set_function(name="training", kind="job", handler="my_training_function")

# Creating the task to run the function with its dataset
task = mlrun.new_task(
    "training",
    inputs={"dataset": f"store://feature-vectors/{project}/{feature_vector_name}"},
)  # The feature vector is given as an input to the function

# Running the function over the kubernetes cluster
fn.run(task)  # Set local=True to run locally