Part 2: Training#

In this part you learn how to use MLRun’s Feature Store to easily define a Feature Vector and create the dataset you need to run the training process.
By the end of this tutorial you’ll learn how to:

  • Combine multiple data sources to a single feature vector

  • Create training dataset

  • Create a model using an MLRun hub function

project_name = 'fraud-demo'
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2023-02-15 14:43:21,980 [info] loaded project fraud-demo from MLRun DB

Step 1 - Create a feature vector#

In this section you create a feature vector.
The Feature vector has a name so you can reference to it later via the URI or your serving function, and it has a list of features from the available feature sets. You can add a feature from a feature set by adding <FeatureSet>.<Feature> to the list, or add <FeatureSet>.* to add all the feature set’s available features.

By default, the first FeatureSet in the feature list acts as the spine, meaning that all the other features are joined to it.
For example, in this instance you use the early sense sensor data as the spine, so for each early sense event you create produces a row in the resulted feature vector.

# Define the list of features to use
features = ['events.*',
            'transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_sum_14d', 
            'transactions.es_health_sum_14d',
            'transactions.es_otherservices_sum_14d', 
            'transactions.es_food_sum_14d',
            'transactions.es_hotelservices_sum_14d', 
            'transactions.es_barsandrestaurants_sum_14d',
            'transactions.es_tech_sum_14d', 
            'transactions.es_sportsandtoys_sum_14d',
            'transactions.es_wellnessandbeauty_sum_14d', 
            'transactions.es_hyper_sum_14d',
            'transactions.es_fashion_sum_14d', 
            'transactions.es_home_sum_14d', 
            'transactions.es_travel_sum_14d', 
            'transactions.es_leisure_sum_14d',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week']
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
fv_name = 'transactions-fraud'

# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(fv_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the feature store
transactions_fv.save()

Step 2 - Preview the feature vector data#

Obtain the values of the features in the feature vector, to ensure the data appears as expected.

# Import the Parquet Target so you can directly save your dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())
> 2023-02-15 14:43:23,376 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-dani/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2023-02-15T14:43:23.375968+00:00', 'size': 140838, 'partitioned': True}
# Preview your dataset
train_dataset.to_dataframe().tail(5)
event_details_change event_login event_password_change amount_max_2h amount_sum_2h amount_count_2h amount_avg_2h amount_max_12h amount_sum_12h amount_count_12h ... es_home_sum_14d es_travel_sum_14d es_leisure_sum_14d gender_F gender_M step amount timestamp_hour timestamp_day_of_week label
1763 0 0 1 45.28 144.56 5.0 28.9120 161.75 1017.80 33.0 ... 0.0 1.0 0.0 1.0 0.0 96.0 24.02 14.0 2.0 0.0
1764 1 0 0 26.81 47.75 2.0 23.8750 68.16 653.02 24.0 ... 0.0 0.0 0.0 0.0 1.0 134.0 26.81 14.0 2.0 0.0
1765 0 1 0 33.10 91.11 4.0 22.7775 121.96 1001.32 32.0 ... 2.0 0.0 0.0 1.0 0.0 141.0 14.95 14.0 2.0 0.0
1766 0 0 1 22.35 37.68 3.0 12.5600 71.63 1052.44 37.0 ... 0.0 0.0 0.0 0.0 1.0 101.0 13.62 14.0 2.0 0.0
1767 0 0 1 44.37 76.87 4.0 19.2175 159.32 1189.73 39.0 ... 0.0 0.0 0.0 0.0 1.0 40.0 12.82 14.0 2.0 0.0

5 rows × 36 columns

Step 3 - Train models and choose the highest accuracy#

With MLRun, you can easily train different models and compare the results. In the code below, you train three different models. Each one uses a different algorithm (random forest, XGBoost, adabost), and you choose the model with the highest accuracy.

# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function('hub://auto_trainer')
# Prepare the parameters list for the training function
# you use 3 different models
training_params = {"model_name": ['transaction_fraud_rf', 
                                  'transaction_fraud_xgboost', 
                                  'transaction_fraud_adaboost'],
              
                  "model_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}

# Define the training task, including your feature vector, label and hyperparams definitions
train_task = mlrun.new_task('training', 
                      inputs={'dataset': transactions_fv.uri},
                      params={'label_columns': 'label'}
                     )

train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

# Specify your cluster image
classifier_fn.spec.image = 'mlrun/mlrun'

# Run training
classifier_fn.run(train_task, local=False)
> 2023-02-15 14:43:23,870 [info] starting run training uid=946725e1c01f4e0ba9d7eb62f7f24142 DB=http://mlrun-api:8080
> 2023-02-15 14:43:24,069 [info] Job is running in the background, pod: training-68dct
> 2023-02-15 14:43:58,472 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:00,031 [info] label columns: label
> 2023-02-15 14:44:00,031 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:00,278 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:03,298 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:04,277 [info] label columns: label
> 2023-02-15 14:44:04,277 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:04,281 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:07,773 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:09,037 [info] label columns: label
> 2023-02-15 14:44:09,037 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:09,040 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:11,957 [info] best iteration=1, used criteria max.accuracy
> 2023-02-15 14:44:12,668 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani'}
> 2023-02-15 14:44:12,668 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/946725e1c01f4e0ba9d7eb62f7f24142/overview'}
> 2023-02-15 14:44:12,669 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-dani 0 Feb 15 14:43:57 completed training
v3io_user=dani
kind=job
owner=dani
mlrun/client_version=1.3.0-rc23
mlrun/client_python_version=3.9.16
dataset
label_columns=label
best_iteration=1
accuracy=1.0
f1_score=1.0
precision_score=1.0
recall_score=1.0
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model
iteration_results
parallel_coordinates

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-02-15 14:44:15,576 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f3288543e20>

Step 4 - Perform feature selection#

As part of the data science process, try to reduce the training dataset’s size to get rid of bad or unuseful features and save computation time.

Use your ready-made feature selection function from MLRun’s hub://feature_selection to select the best features to keep on a sample from your dataset, and run the function on that.

feature_selection_fn = mlrun.import_function('hub://feature_selection')

feature_selection_run = feature_selection_fn.run(
            params={"k": 18,
                    "min_votes": 2,
                    "label_column": 'label',
                    'output_vector_name':fv_name + "-short",
                    'ignore_type_errors': True},
    
            inputs={'df_artifact': transactions_fv.uri},
            name='feature_extraction',
            handler='feature_selection',
    local=False)
> 2023-02-15 14:44:16,098 [info] starting run feature_extraction uid=da55327c222f4a9389232f25fc6b9739 DB=http://mlrun-api:8080
> 2023-02-15 14:44:16,262 [info] Job is running in the background, pod: feature-extraction-pv66m
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-dani 0 Feb 15 14:45:50 completed feature_extraction
v3io_user=dani
kind=job
owner=dani
mlrun/client_version=1.3.0-rc23
mlrun/client_python_version=3.9.16
host=feature-extraction-pv66m
df_artifact
k=18
min_votes=2
label_column=label
output_vector_name=transactions-fraud-short
ignore_type_errors=True
top_features_vector=store://feature-vectors/fraud-demo-dani/transactions-fraud-short
f_classif
mutual_info_classif
chi2
f_regression
LinearSVC
LogisticRegression
ExtraTreesClassifier
feature_scores
max_scaled_scores_feature_scores
selected_features_count
selected_features

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-02-15 14:46:05,989 [info] run executed, status=completed
mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df().tail(5)
amount_max_2h amount_sum_2h amount_count_2h amount_avg_2h amount_max_12h amount_sum_12h amount_count_12h amount_avg_12h amount_max_24h amount_sum_24h amount_count_24h amount_avg_24h es_transportation_sum_14d es_health_sum_14d es_otherservices_sum_14d label
9995 54.55 118.62 4.0 29.655 70.47 805.10 27.0 29.818519 85.97 1730.23 58.0 29.831552 120.0 0.0 0.0 0
9996 31.14 31.14 1.0 31.140 119.50 150.64 2.0 75.320000 119.50 330.61 5.0 66.122000 0.0 7.0 0.0 0
9997 218.48 365.30 5.0 73.060 218.48 1076.37 25.0 43.054800 218.48 1968.00 59.0 33.355932 107.0 5.0 1.0 0
9998 34.93 118.22 5.0 23.644 79.16 935.26 31.0 30.169677 89.85 2062.69 68.0 30.333676 116.0 0.0 0.0 0
9999 77.76 237.95 5.0 47.590 95.71 1259.07 37.0 34.028919 95.71 2451.98 72.0 34.055278 122.0 0.0 0.0 0

Step 5 - Train your models with top features#

Following the feature selection, you train new models using the resultant features. You can observe that the accuracy and other results remain high, meaning you get a model that requires less features to be accurate and thus less error-prone.

# Define your training task, including your feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task('training', 
                      inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
                      params={'label_columns': 'label'}
                     )
ensemble_train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

classifier_fn.run(ensemble_train_task)
> 2023-02-15 14:46:06,131 [info] starting run training uid=4ac3afbfb6a1409daa1e834f8f153295 DB=http://mlrun-api:8080
> 2023-02-15 14:46:07,756 [info] Job is running in the background, pod: training-hgz6t
> 2023-02-15 14:46:17,141 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:17,731 [info] label columns: label
> 2023-02-15 14:46:17,732 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:18,031 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:21,793 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:22,767 [info] label columns: label
> 2023-02-15 14:46:22,767 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:22,770 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:28,944 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:29,507 [info] label columns: label
> 2023-02-15 14:46:29,507 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:29,511 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:31,968 [info] best iteration=2, used criteria max.accuracy
> 2023-02-15 14:46:32,376 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani'}
> 2023-02-15 14:46:32,376 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/4ac3afbfb6a1409daa1e834f8f153295/overview'}
> 2023-02-15 14:46:32,377 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-dani 0 Feb 15 14:46:16 completed training
v3io_user=dani
kind=job
owner=dani
mlrun/client_version=1.3.0-rc23
mlrun/client_python_version=3.9.16
dataset
label_columns=label
best_iteration=2
accuracy=0.992503748125937
f1_score=0.4827586206896552
precision_score=0.5833333333333334
recall_score=0.4117647058823529
feature-importance
test_set
confusion-matrix
roc-curves
calibration-curve
model
iteration_results
parallel_coordinates

> to track results use the .show() or .logs() methods or click here to open in UI
> 2023-02-15 14:46:33,094 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f324160af40>

Done!#

You’ve completed Part 2 of the model training with the feature store. Proceed to Part 3 to learn how to deploy and monitor the model.