Part 2: Training#

In this part you learn how to use MLRun’s Feature Store to easily define a Feature Vector and create the dataset you need to run the training process.
By the end of this tutorial you’ll learn how to:

Combine multiple data sources to a single feature vector
Create training dataset
Create a model using an MLRun hub function

project_name = 'fraud-demo'

import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)

> 2023-02-15 14:43:21,980 [info] loaded project fraud-demo from MLRun DB

Step 1 - Create a feature vector#

In this section you create a feature vector.
The Feature vector has a name so you can reference to it later via the URI or your serving function, and it has a list of features from the available feature sets. You can add a feature from a feature set by adding <FeatureSet>.<Feature> to the list, or add <FeatureSet>.* to add all the feature set’s available features.

By default, the first FeatureSet in the feature list acts as the spine, meaning that all the other features are joined to it.
For example, in this instance you use the early sense sensor data as the spine, so for each early sense event you create produces a row in the resulted feature vector.

# Define the list of features to use
features = ['events.*',
            'transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_sum_14d', 
            'transactions.es_health_sum_14d',
            'transactions.es_otherservices_sum_14d', 
            'transactions.es_food_sum_14d',
            'transactions.es_hotelservices_sum_14d', 
            'transactions.es_barsandrestaurants_sum_14d',
            'transactions.es_tech_sum_14d', 
            'transactions.es_sportsandtoys_sum_14d',
            'transactions.es_wellnessandbeauty_sum_14d', 
            'transactions.es_hyper_sum_14d',
            'transactions.es_fashion_sum_14d', 
            'transactions.es_home_sum_14d', 
            'transactions.es_travel_sum_14d', 
            'transactions.es_leisure_sum_14d',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week']

# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
fv_name = 'transactions-fraud'

# Define the feature vector using the feature store (fstore)
transactions_fv = fstore.FeatureVector(fv_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the feature store
transactions_fv.save()

Step 2 - Preview the feature vector data#

Obtain the values of the features in the feature vector, to ensure the data appears as expected.

# Import the Parquet Target so you can directly save your dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())

> 2023-02-15 14:43:23,376 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-dani/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2023-02-15T14:43:23.375968+00:00', 'size': 140838, 'partitioned': True}

# Preview your dataset
train_dataset.to_dataframe().tail(5)

	event_details_change	event_login	event_password_change	amount_max_2h	amount_sum_2h	amount_count_2h	amount_avg_2h	amount_max_12h	amount_sum_12h	amount_count_12h	...	es_home_sum_14d	es_travel_sum_14d	gender_F	gender_M	step	amount	timestamp_hour	timestamp_day_of_week
1763	0	0	1	45.28	144.56	5.0	28.9120	161.75	1017.80	33.0	...	0.0	1.0	1.0	0.0	96.0	24.02	14.0	2.0
1764	1	0	0	26.81	47.75	2.0	23.8750	68.16	653.02	24.0	...	0.0	0.0	0.0	1.0	134.0	26.81	14.0	2.0
1765	0	1	0	33.10	91.11	4.0	22.7775	121.96	1001.32	32.0	...	2.0	0.0	1.0	0.0	141.0	14.95	14.0	2.0
1766	0	0	1	22.35	37.68	3.0	12.5600	71.63	1052.44	37.0	...	0.0	0.0	0.0	1.0	101.0	13.62	14.0	2.0
1767	0	0	1	44.37	76.87	4.0	19.2175	159.32	1189.73	39.0	...	0.0	0.0	0.0	1.0	40.0	12.82	14.0	2.0

5 rows × 36 columns

Step 3 - Train models and choose the highest accuracy#

With MLRun, you can easily train different models and compare the results. In the code below, you train three different models. Each one uses a different algorithm (random forest, XGBoost, adabost), and you choose the model with the highest accuracy.

# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function('hub://auto_trainer')

# Prepare the parameters list for the training function
# you use 3 different models
training_params = {"model_name": ['transaction_fraud_rf', 
                                  'transaction_fraud_xgboost', 
                                  'transaction_fraud_adaboost'],
              
                  "model_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}

# Define the training task, including your feature vector, label and hyperparams definitions
train_task = mlrun.new_task('training', 
                      inputs={'dataset': transactions_fv.uri},
                      params={'label_columns': 'label'}
                     )

train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

# Specify your cluster image
classifier_fn.spec.image = 'mlrun/mlrun'

# Run training
classifier_fn.run(train_task, local=False)

> 2023-02-15 14:43:23,870 [info] starting run training uid=946725e1c01f4e0ba9d7eb62f7f24142 DB=http://mlrun-api:8080
> 2023-02-15 14:43:24,069 [info] Job is running in the background, pod: training-68dct
> 2023-02-15 14:43:58,472 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:00,031 [info] label columns: label
> 2023-02-15 14:44:00,031 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:00,278 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:03,298 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:04,277 [info] label columns: label
> 2023-02-15 14:44:04,277 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:04,281 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:07,773 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:44:09,037 [info] label columns: label
> 2023-02-15 14:44:09,037 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:44:09,040 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:44:11,957 [info] best iteration=1, used criteria max.accuracy
> 2023-02-15 14:44:12,668 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 946725e1c01f4e0ba9d7eb62f7f24142 -p fraud-demo-dani'}
> 2023-02-15 14:44:12,668 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/946725e1c01f4e0ba9d7eb62f7f24142/overview'}
> 2023-02-15 14:44:12,669 [info] run executed, status=completed
final state: completed

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
fraud-demo-dani	...f7f24142	0	Feb 15 14:43:57	completed	training	v3io_user=dani kind=job owner=dani mlrun/client_version=1.3.0-rc23 mlrun/client_python_version=3.9.16	dataset	label_columns=label	best_iteration=1 accuracy=1.0 f1_score=1.0 precision_score=1.0 recall_score=1.0	feature-importance test_set confusion-matrix roc-curves calibration-curve model iteration_results parallel_coordinates

> to track results use the .show() or .logs() methods or click here to open in UI

> 2023-02-15 14:44:15,576 [info] run executed, status=completed

<mlrun.model.RunObject at 0x7f3288543e20>

Step 4 - Perform feature selection#

As part of the data science process, try to reduce the training dataset’s size to get rid of bad or unuseful features and save computation time.

Use your ready-made feature selection function from MLRun’s hub://feature_selection to select the best features to keep on a sample from your dataset, and run the function on that.

feature_selection_fn = mlrun.import_function('hub://feature_selection')

feature_selection_run = feature_selection_fn.run(
            params={"k": 18,
                    "min_votes": 2,
                    "label_column": 'label',
                    'output_vector_name':fv_name + "-short",
                    'ignore_type_errors': True},
    
            inputs={'df_artifact': transactions_fv.uri},
            name='feature_extraction',
            handler='feature_selection',
    local=False)

> 2023-02-15 14:44:16,098 [info] starting run feature_extraction uid=da55327c222f4a9389232f25fc6b9739 DB=http://mlrun-api:8080
> 2023-02-15 14:44:16,262 [info] Job is running in the background, pod: feature-extraction-pv66m
final state: completed

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
fraud-demo-dani	...fc6b9739	0	Feb 15 14:45:50	completed	feature_extraction	v3io_user=dani kind=job owner=dani mlrun/client_version=1.3.0-rc23 mlrun/client_python_version=3.9.16 host=feature-extraction-pv66m	df_artifact	k=18 min_votes=2 label_column=label output_vector_name=transactions-fraud-short ignore_type_errors=True	top_features_vector=store://feature-vectors/fraud-demo-dani/transactions-fraud-short	f_classif mutual_info_classif chi2 f_regression LinearSVC LogisticRegression ExtraTreesClassifier feature_scores max_scaled_scores_feature_scores selected_features_count selected_features

> to track results use the .show() or .logs() methods or click here to open in UI

> 2023-02-15 14:46:05,989 [info] run executed, status=completed

mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df().tail(5)

	amount_max_2h	amount_sum_2h	amount_count_2h	amount_avg_2h	amount_max_12h	amount_sum_12h	amount_count_12h	amount_avg_12h	amount_max_24h	amount_sum_24h	amount_count_24h	amount_avg_24h	es_transportation_sum_14d	es_health_sum_14d	es_otherservices_sum_14d
9995	54.55	118.62	4.0	29.655	70.47	805.10	27.0	29.818519	85.97	1730.23	58.0	29.831552	120.0	0.0	0.0
9996	31.14	31.14	1.0	31.140	119.50	150.64	2.0	75.320000	119.50	330.61	5.0	66.122000	0.0	7.0	0.0
9997	218.48	365.30	5.0	73.060	218.48	1076.37	25.0	43.054800	218.48	1968.00	59.0	33.355932	107.0	5.0	1.0
9998	34.93	118.22	5.0	23.644	79.16	935.26	31.0	30.169677	89.85	2062.69	68.0	30.333676	116.0	0.0	0.0
9999	77.76	237.95	5.0	47.590	95.71	1259.07	37.0	34.028919	95.71	2451.98	72.0	34.055278	122.0	0.0	0.0

Step 5 - Train your models with top features#

Following the feature selection, you train new models using the resultant features. You can observe that the accuracy and other results remain high, meaning you get a model that requires less features to be accurate and thus less error-prone.

# Define your training task, including your feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task('training', 
                      inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
                      params={'label_columns': 'label'}
                     )
ensemble_train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

classifier_fn.run(ensemble_train_task)

> 2023-02-15 14:46:06,131 [info] starting run training uid=4ac3afbfb6a1409daa1e834f8f153295 DB=http://mlrun-api:8080
> 2023-02-15 14:46:07,756 [info] Job is running in the background, pod: training-hgz6t
> 2023-02-15 14:46:17,141 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:17,731 [info] label columns: label
> 2023-02-15 14:46:17,732 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:18,031 [info] training 'transaction_fraud_rf'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:21,793 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:22,767 [info] label columns: label
> 2023-02-15 14:46:22,767 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:22,770 [info] training 'transaction_fraud_xgboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:28,944 [info] test_set or train_test_split_size are not provided, setting train_test_split_size to 0.2
> 2023-02-15 14:46:29,507 [info] label columns: label
> 2023-02-15 14:46:29,507 [info] Sample set not given, using the whole training set as the sample set
> 2023-02-15 14:46:29,511 [info] training 'transaction_fraud_adaboost'
/usr/local/lib/python3.9/site-packages/sklearn/calibration.py:1000: FutureWarning:

The normalize argument is deprecated in v1.1 and will be removed in v1.3. Explicitly normalizing y_prob will reproduce this behavior, but it is recommended that a proper probability is used (i.e. a classifier's `predict_proba` positive class or `decision_function` output calibrated with `CalibratedClassifierCV`).

> 2023-02-15 14:46:31,968 [info] best iteration=2, used criteria max.accuracy
> 2023-02-15 14:46:32,376 [info] To track results use the CLI: {'info_cmd': 'mlrun get run 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani', 'logs_cmd': 'mlrun logs 4ac3afbfb6a1409daa1e834f8f153295 -p fraud-demo-dani'}
> 2023-02-15 14:46:32,376 [info] Or click for UI: {'ui_url': 'https://dashboard.default-tenant.app.vmdev94.lab.iguazeng.com/mlprojects/fraud-demo-dani/jobs/monitor/4ac3afbfb6a1409daa1e834f8f153295/overview'}
> 2023-02-15 14:46:32,377 [info] run executed, status=completed
final state: completed

project	uid	iter	start	state	name	labels	inputs	parameters	results	artifacts
fraud-demo-dani	...8f153295	0	Feb 15 14:46:16	completed	training	v3io_user=dani kind=job owner=dani mlrun/client_version=1.3.0-rc23 mlrun/client_python_version=3.9.16	dataset	label_columns=label	best_iteration=2 accuracy=0.992503748125937 f1_score=0.4827586206896552 precision_score=0.5833333333333334 recall_score=0.4117647058823529	feature-importance test_set confusion-matrix roc-curves calibration-curve model iteration_results parallel_coordinates

> to track results use the .show() or .logs() methods or click here to open in UI

> 2023-02-15 14:46:33,094 [info] run executed, status=completed

<mlrun.model.RunObject at 0x7f324160af40>

Done!#

You’ve completed Part 2 of the model training with the feature store. Proceed to Part 3 to learn how to deploy and monitor the model.

Part 2: Training

Contents

Part 2: Training#

Step 1 - Create a feature vector#

Step 2 - Preview the feature vector data#

Step 3 - Train models and choose the highest accuracy#

Step 4 - Perform feature selection#

Step 5 - Train your models with top features#

Done!#