Part 2: Training
Contents
Part 2: Training#
In this part you learn how to use MLRun’s Feature Store to easily define a Feature Vector and create the dataset you need to run the training process.
By the end of this tutorial you’ll learn how to:
Combine multiple data sources to a single feature vector
Create training dataset
Create a model using an MLRun hub function
project_name = 'fraud-demo'
import mlrun
# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2023-02-15 14:43:21,980 [info] loaded project fraud-demo from MLRun DB
Step 1 - Create a feature vector#
In this section you create a feature vector.
The Feature vector has a name
so you can reference to it later via the URI or your serving function, and it has a list of
features
from the available feature sets. You can add a feature from a feature set by adding <FeatureSet>.<Feature>
to
the list, or add <FeatureSet>.*
to add all the feature set’s available features.
By default, the first FeatureSet in the feature list acts as the spine, meaning that all the other features are joined to it.
For example, in this instance you use the early sense sensor data as the spine, so for each early sense event you create produces a row in the resulted feature vector.
# Define the list of features to use
features = ['events.*',
'transactions.amount_max_2h',
'transactions.amount_sum_2h',
'transactions.amount_count_2h',
'transactions.amount_avg_2h',
'transactions.amount_max_12h',
'transactions.amount_sum_12h',
'transactions.amount_count_12h',
'transactions.amount_avg_12h',
'transactions.amount_max_24h',
'transactions.amount_sum_24h',
'transactions.amount_count_24h',
'transactions.amount_avg_24h',
'transactions.es_transportation_sum_14d',
'transactions.es_health_sum_14d',
'transactions.es_otherservices_sum_14d',
'transactions.es_food_sum_14d',
'transactions.es_hotelservices_sum_14d',
'transactions.es_barsandrestaurants_sum_14d',
'transactions.es_tech_sum_14d',
'transactions.es_sportsandtoys_sum_14d',
'transactions.es_wellnessandbeauty_sum_14d',
'transactions.es_hyper_sum_14d',
'transactions.es_fashion_sum_14d',
'transactions.es_home_sum_14d',
'transactions.es_travel_sum_14d',
'transactions.es_leisure_sum_14d',
'transactions.gender_F',
'transactions.gender_M',
'transactions.step',
'transactions.amount',
'transactions.timestamp_hour',
'transactions.timestamp_day_of_week']
# Import MLRun's Feature Store
import mlrun.feature_store as fstore
# Define the feature vector name for future reference
fv_name = 'transactions-fraud'
# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(fv_name,
features,
label_feature="labels.label",
description='Predicting a fraudulent transaction')
# Save the feature vector in the feature store
transactions_fv.save()
Step 2 - Preview the feature vector data#
Obtain the values of the features in the feature vector, to ensure the data appears as expected.
# Import the Parquet Target so you can directly save your dataset as a file
from mlrun.datastore.targets import ParquetTarget
# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())
> 2023-02-15 14:43:23,376 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-dani/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2023-02-15T14:43:23.375968+00:00', 'size': 140838, 'partitioned': True}
# Preview your dataset
train_dataset.to_dataframe().tail(5)
event_details_change | event_login | event_password_change | amount_max_2h | amount_sum_2h | amount_count_2h | amount_avg_2h | amount_max_12h | amount_sum_12h | amount_count_12h | ... | es_home_sum_14d | es_travel_sum_14d | es_leisure_sum_14d | gender_F | gender_M | step | amount | timestamp_hour | timestamp_day_of_week | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1763 | 0 | 0 | 1 | 45.28 | 144.56 | 5.0 | 28.9120 | 161.75 | 1017.80 | 33.0 | ... | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 96.0 | 24.02 | 14.0 | 2.0 | 0.0 |
1764 | 1 | 0 | 0 | 26.81 | 47.75 | 2.0 | 23.8750 | 68.16 | 653.02 | 24.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 134.0 | 26.81 | 14.0 | 2.0 | 0.0 |
1765 | 0 | 1 | 0 | 33.10 | 91.11 | 4.0 | 22.7775 | 121.96 | 1001.32 | 32.0 | ... | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 141.0 | 14.95 | 14.0 | 2.0 | 0.0 |
1766 | 0 | 0 | 1 | 22.35 | 37.68 | 3.0 | 12.5600 | 71.63 | 1052.44 | 37.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 101.0 | 13.62 | 14.0 | 2.0 | 0.0 |
1767 | 0 | 0 | 1 | 44.37 | 76.87 | 4.0 | 19.2175 | 159.32 | 1189.73 | 39.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 40.0 | 12.82 | 14.0 | 2.0 | 0.0 |
5 rows × 36 columns
Step 3 - Train models and choose the highest accuracy#
With MLRun, you can easily train different models and compare the results. In the code below, you train three different models. Each one uses a different algorithm (random forest, XGBoost, adabost), and you choose the model with the highest accuracy.
# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function('hub://auto_trainer')
# Prepare the parameters list for the training function
# you use 3 different models
training_params = {"model_name": ['transaction_fraud_rf',
'transaction_fraud_xgboost',
'transaction_fraud_adaboost'],
"model_class": ['sklearn.ensemble.RandomForestClassifier',
'sklearn.ensemble.GradientBoostingClassifier',
'sklearn.ensemble.AdaBoostClassifier']}
# Define the training task, including your feature vector, label and hyperparams definitions
train_task = mlrun.new_task('training',
inputs={'dataset': transactions_fv.uri},
params={'label_columns': 'label'}
)
train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')
# Specify your cluster image
classifier_fn.spec.image = 'mlrun/mlrun'
# Run training
classifier_fn.run(train_task, local=False)
> 2023-02-15 14:43:23,870 [info] starting run training uid=946725e1c01f4e0ba9d7eb62f7f24142 DB=http://mlrun-api:8080
> 2023-02-15 14:43:24,069 [info] Job is running in the background, pod: training-68dct
> 2023-02-15 14:43:58,472 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:00,031 [info] label columns: label
> 2023-02-15 14:44:00,031 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:00,278 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:
The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).
> 2023-02-15 14:44:03,298 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:04,277 [info] label columns: label
> 2023-02-15 14:44:04,277 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:04,281 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:
The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).
> 2023-02-15 14:44:07,773 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:09,037 [info] label columns: label
> 2023-02-15 14:44:09,037 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:09,040 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:
The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).
> 2023-02-15 14:44:11,957 [info] best iteration=1, used criteria max.accuracy
> 2023-02-15 14:44:12,668 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani'}
> 2023-02-15 14:44:12,668 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/946725e1c01f4e0ba9d7eb62f7f24142/overview'}
> 2023-02-15 14:44:12,669 [info] run executed, status=completed
final state: completed
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
fraud-demo-dani | 0 | Feb 15 14:43:57 | completed | training | v3io_user=dani kind=job owner=dani mlrun/client_version=1.3.0-rc23 mlrun/client_python_version=3.9.16 |
dataset |
label_columns=label |
best_iteration=1 accuracy=1.0 f1_score=1.0 precision_score=1.0 recall_score=1.0 |
feature-importance test_set confusion-matrix roc-curves calibration-curve model iteration_results parallel_coordinates |
> 2023-02-15 14:44:15,576 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f3288543e20>
Step 4 - Perform feature selection#
As part of the data science process, try to reduce the training dataset’s size to get rid of bad or unuseful features and save computation time.
Use your ready-made feature selection function from MLRun’s hub://feature_selection
to select the best features to keep on a sample from your dataset, and run the function on that.
feature_selection_fn = mlrun.import_function('hub://feature_selection')
feature_selection_run = feature_selection_fn.run(
params={"k": 18,
"min_votes": 2,
"label_column": 'label',
'output_vector_name':fv_name + "-short",
'ignore_type_errors': True},
inputs={'df_artifact': transactions_fv.uri},
name='feature_extraction',
handler='feature_selection',
local=False)
> 2023-02-15 14:44:16,098 [info] starting run feature_extraction uid=da55327c222f4a9389232f25fc6b9739 DB=http://mlrun-api:8080
> 2023-02-15 14:44:16,262 [info] Job is running in the background, pod: feature-extraction-pv66m
final state: completed
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
fraud-demo-dani | 0 | Feb 15 14:45:50 | completed | feature_extraction | v3io_user=dani kind=job owner=dani mlrun/client_version=1.3.0-rc23 mlrun/client_python_version=3.9.16 host=feature-extraction-pv66m |
df_artifact |
k=18 min_votes=2 label_column=label output_vector_name=transactions-fraud-short ignore_type_errors=True |
top_features_vector=store://feature-vectors/fraud-demo-dani/transactions-fraud-short |
f_classif mutual_info_classif chi2 f_regression LinearSVC LogisticRegression ExtraTreesClassifier feature_scores max_scaled_scores_feature_scores selected_features_count selected_features |
> 2023-02-15 14:46:05,989 [info] run executed, status=completed
mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df().tail(5)
amount_max_2h | amount_sum_2h | amount_count_2h | amount_avg_2h | amount_max_12h | amount_sum_12h | amount_count_12h | amount_avg_12h | amount_max_24h | amount_sum_24h | amount_count_24h | amount_avg_24h | es_transportation_sum_14d | es_health_sum_14d | es_otherservices_sum_14d | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9995 | 54.55 | 118.62 | 4.0 | 29.655 | 70.47 | 805.10 | 27.0 | 29.818519 | 85.97 | 1730.23 | 58.0 | 29.831552 | 120.0 | 0.0 | 0.0 | 0 |
9996 | 31.14 | 31.14 | 1.0 | 31.140 | 119.50 | 150.64 | 2.0 | 75.320000 | 119.50 | 330.61 | 5.0 | 66.122000 | 0.0 | 7.0 | 0.0 | 0 |
9997 | 218.48 | 365.30 | 5.0 | 73.060 | 218.48 | 1076.37 | 25.0 | 43.054800 | 218.48 | 1968.00 | 59.0 | 33.355932 | 107.0 | 5.0 | 1.0 | 0 |
9998 | 34.93 | 118.22 | 5.0 | 23.644 | 79.16 | 935.26 | 31.0 | 30.169677 | 89.85 | 2062.69 | 68.0 | 30.333676 | 116.0 | 0.0 | 0.0 | 0 |
9999 | 77.76 | 237.95 | 5.0 | 47.590 | 95.71 | 1259.07 | 37.0 | 34.028919 | 95.71 | 2451.98 | 72.0 | 34.055278 | 122.0 | 0.0 | 0.0 | 0 |
Step 5 - Train your models with top features#
Following the feature selection, you train new models using the resultant features. You can observe that the accuracy and other results remain high, meaning you get a model that requires less features to be accurate and thus less error-prone.
# Define your training task, including your feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task('training',
inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
params={'label_columns': 'label'}
)
ensemble_train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')
classifier_fn.run(ensemble_train_task)
> 2023-02-15 14:46:06,131 [info] starting run training uid=4ac3afbfb6a1409daa1e834f8f153295 DB=http://mlrun-api:8080
> 2023-02-15 14:46:07,756 [info] Job is running in the background, pod: training-hgz6t
> 2023-02-15 14:46:17,141 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:17,731 [info] label columns: label
> 2023-02-15 14:46:17,732 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:18,031 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:
The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).
> 2023-02-15 14:46:21,793 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:22,767 [info] label columns: label
> 2023-02-15 14:46:22,767 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:22,770 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:
The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).
> 2023-02-15 14:46:28,944 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:29,507 [info] label columns: label
> 2023-02-15 14:46:29,507 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:29,511 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:
The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).
> 2023-02-15 14:46:31,968 [info] best iteration=2, used criteria max.accuracy
> 2023-02-15 14:46:32,376 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani'}
> 2023-02-15 14:46:32,376 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/4ac3afbfb6a1409daa1e834f8f153295/overview'}
> 2023-02-15 14:46:32,377 [info] run executed, status=completed
final state: completed
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
fraud-demo-dani | 0 | Feb 15 14:46:16 | completed | training | v3io_user=dani kind=job owner=dani mlrun/client_version=1.3.0-rc23 mlrun/client_python_version=3.9.16 |
dataset |
label_columns=label |
best_iteration=2 accuracy=0.992503748125937 f1_score=0.4827586206896552 precision_score=0.5833333333333334 recall_score=0.4117647058823529 |
feature-importance test_set confusion-matrix roc-curves calibration-curve model iteration_results parallel_coordinates |
> 2023-02-15 14:46:33,094 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f324160af40>
Done!#
You’ve completed Part 2 of the model training with the feature store. Proceed to Part 3 to learn how to deploy and monitor the model.