Part 2: Training
Contents
Part 2: Training#
This part shows how to use MLRun’s Feature Store to easily define a Feature Vector and create the dataset you need to run the training process.
By the end of this tutorial you’ll learn how to:
Combine multiple data sources to a single Feature Vector
Create training dataset
Create a model using an MLRun Hub function
project_name = 'fraud-demo'
import mlrun
# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2021-09-19 17:59:27,165 [info] loaded project fraud-demo from MLRun DB
Step 1 - Create a feature vector#
In this section you create the Feature Vector.
The Feature vector has a name
so you can reference to it later via the URI or the serving function, and a list of features
from the available FeatureSets. You can add a feature from a feature set by adding <FeatureSet>.<Feature>
to the list, or add <FeatureSet>.*
to add all the FeatureSet’s available features.
By default, the first FeatureSet in the feature list acts as the spine, meaning that all the other features will be joined to it.
For example, in this instance the spine is the early_sense sensor data, so for each early_sense event we will create produce a row in the resulted Feature Vector.
# Define the list of features you will be using
features = ['transactions.amount_max_2h',
'transactions.amount_sum_2h',
'transactions.amount_count_2h',
'transactions.amount_avg_2h',
'transactions.amount_max_12h',
'transactions.amount_sum_12h',
'transactions.amount_count_12h',
'transactions.amount_avg_12h',
'transactions.amount_max_24h',
'transactions.amount_sum_24h',
'transactions.amount_count_24h',
'transactions.amount_avg_24h',
'transactions.es_transportation_count_14d',
'transactions.es_health_count_14d',
'transactions.es_otherservices_count_14d',
'transactions.es_food_count_14d',
'transactions.es_hotelservices_count_14d',
'transactions.es_barsandrestaurants_count_14d',
'transactions.es_tech_count_14d',
'transactions.es_sportsandtoys_count_14d',
'transactions.es_wellnessandbeauty_count_14d',
'transactions.es_hyper_count_14d',
'transactions.es_fashion_count_14d',
'transactions.es_home_count_14d',
'transactions.es_travel_count_14d',
'transactions.es_leisure_count_14d',
'transactions.gender_F',
'transactions.gender_M',
'transactions.step',
'transactions.amount',
'transactions.timestamp_hour',
'transactions.timestamp_day_of_week',
'events.*']
# Import MLRun's Feature Store
import mlrun.feature_store as fstore
# Define the feature vector name for future reference
fv_name = 'transactions-fraud'
# Define the feature vector using our Feature Store (fstore)
transactions_fv = fstore.FeatureVector(fv_name,
features,
label_feature="labels.label",
description='Predicting a fraudulent transaction')
# Save the feature vector in the Feature Store
transactions_fv.save()
Step 2 - Preview the feature vector data#
Obtain the values of the features in the feature vector, to ensure the data appears as expected.
# Import the Parquet Target so you can directly save the dataset as a file
from mlrun.datastore.targets import ParquetTarget
# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())
> 2021-09-19 17:59:28,415 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-admin/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2021-09-19T17:59:28.415727+00:00', 'size': 1915182}
# Preview the dataset
train_dataset.to_dataframe().tail(5)
amount_max_2h | amount_sum_2h | amount_count_2h | amount_avg_2h | amount_max_12h | amount_sum_12h | amount_count_12h | amount_avg_12h | amount_max_24h | amount_sum_24h | ... | gender_F | gender_M | step | amount | timestamp_hour | timestamp_day_of_week | event_details_change | event_login | event_password_change | label | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49995 | 2.95 | 2.95 | 1.0 | 2.950 | 2.95 | 2.95 | 1.0 | 2.950 | 2.95 | 2.95 | ... | 1 | 0 | 41 | 2.95 | 17 | 6 | 0.0 | 0.0 | 1.0 | 0 |
49996 | 37.40 | 37.40 | 1.0 | 37.400 | 37.40 | 37.40 | 1.0 | 37.400 | 37.40 | 37.40 | ... | 0 | 1 | 40 | 37.40 | 17 | 6 | 0.0 | 0.0 | 1.0 | 0 |
49997 | 7.75 | 7.75 | 1.0 | 7.750 | 7.75 | 12.99 | 2.0 | 6.495 | 61.23 | 112.76 | ... | 1 | 0 | 91 | 7.75 | 17 | 6 | 1.0 | 0.0 | 0.0 | 0 |
49998 | 28.89 | 28.89 | 1.0 | 28.890 | 38.35 | 107.76 | 4.0 | 26.940 | 52.97 | 249.41 | ... | 0 | 1 | 56 | 28.89 | 17 | 6 | 1.0 | 0.0 | 0.0 | 0 |
49999 | 78.18 | 105.43 | 2.0 | 52.715 | 78.18 | 153.78 | 3.0 | 51.260 | 78.18 | 220.19 | ... | 1 | 0 | 76 | 78.18 | 17 | 6 | 1.0 | 0.0 | 0.0 | 0 |
5 rows × 36 columns
Step 3 - Train models and choose highest accuracy#
With MLRun, one can easily train different models and compare the results. The code below trains three different models, and chooses the model with the highest accuracy. Each uses a different algorithm (random forest, XGBoost, adabost).
# Import the Sklearn classifier function from the function hub
classifier_fn = mlrun.import_function('hub://sklearn-classifier')
# Prepare the parameters list for the training function
# Use 3 different models
training_params = {"model_name": ['transaction_fraud_rf',
'transaction_fraud_xgboost',
'transaction_fraud_adaboost'],
"model_pkg_class": ['sklearn.ensemble.RandomForestClassifier',
'sklearn.ensemble.GradientBoostingClassifier',
'sklearn.ensemble.AdaBoostClassifier']}
# Define the training task, including the feature vector, label and hyperparams definitions
train_task = mlrun.new_task('training',
inputs={'dataset': transactions_fv.uri},
params={'label_column': 'label'}
)
train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')
# Specify the cluster image
classifier_fn.spec.image = 'mlrun/mlrun'
# Run training
classifier_fn.run(train_task, local=False)
> 2021-09-19 17:59:28,799 [info] starting run training uid=9349c60dd9f24a33b536c59e89978e7b DB=http://mlrun-api:8080
> 2021-09-19 17:59:29,042 [info] Job is running in the background, pod: training-2jntc
> 2021-09-19 17:59:47,926 [info] best iteration=1, used criteria max.accuracy
> 2021-09-19 17:59:48,990 [info] run executed, status=completed
final state: completed
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
fraud-demo-admin | 0 | Sep 19 17:59:32 | completed | training | v3io_user=admin kind=job owner=admin |
dataset |
label_column=label |
best_iteration=1 accuracy=0.9901828681424446 test-error=0.009817131857555342 rocauc=0.9556168449721417 brier_score=0.008480115495668912 f1-score=0.6666666666666667 precision_score=0.7846153846153846 recall_score=0.5795454545454546 |
test_set probability-calibration confusion-matrix feature-importances precision-recall-binary roc-binary model iteration_results |
> 2021-09-19 17:59:51,574 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f464baf0c50>
Step 4 - Perform feature selection#
As part of our data science process, try to reduce the training dataset’s size to get rid of bad or unuseful features and save computation time.
Use the ready-made feature selection function from the hub hub://feature_selection
to select the best features to keep on a sample from the dataset, and run the function on that.
feature_selection_fn = mlrun.import_function('hub://feature_selection')
feature_selection_run = feature_selection_fn.run(
params={'sample_ratio':0.25,
'output_vector_name':fv_name + "-short",
'ignore_type_errors': True},
inputs={'df_artifact': transactions_fv.uri},
name='feature_extraction',
handler='feature_selection',
local=False)
> 2021-09-19 17:59:51,768 [info] starting run feature_extraction uid=3a50bd0e4175459fb53873d8f78a440a DB=http://mlrun-api:8080
> 2021-09-19 17:59:52,004 [info] Job is running in the background, pod: feature-extraction-lf46d
> 2021-09-19 17:59:59,099 [info] Couldn't calculate chi2 because of: Input X must be non-negative.
> 2021-09-19 18:00:04,008 [info] votes needed to be selected: 3
> 2021-09-19 18:00:05,329 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-admin/FeatureStore/transactions-fraud-short/parquet/vectors/transactions-fraud-short-latest.parquet', 'status': 'ready', 'updated': '2021-09-19T18:00:05.329695+00:00', 'size': 668722}
> 2021-09-19 18:00:05,677 [info] run executed, status=completed
Pass k=5 as keyword args. From version 0.25 passing these as positional arguments will result in an error
Liblinear failed to converge, increase the number of iterations.
final state: completed
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
fraud-demo-admin | 0 | Sep 19 17:59:56 | completed | feature_extraction | v3io_user=admin kind=job owner=admin host=feature-extraction-lf46d |
df_artifact |
sample_ratio=0.25 output_vector_name=transactions-fraud-short ignore_type_errors=True |
top_features_vector=store://feature-vectors/fraud-demo-admin/transactions-fraud-short |
f_classif mutual_info_classif f_regression LinearSVC LogisticRegression ExtraTreesClassifier feature_scores max_scaled_scores_feature_scores selected_features_count selected_features |
> 2021-09-19 18:00:07,537 [info] run executed, status=completed
mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df().tail(5)
amount_max_2h | amount_sum_2h | amount_count_2h | amount_avg_2h | amount_max_12h | label | |
---|---|---|---|---|---|---|
49996 | 37.40 | 37.40 | 1.0 | 37.400000 | 37.40 | 0 |
49997 | 7.75 | 7.75 | 1.0 | 7.750000 | 7.75 | 0 |
49998 | 28.89 | 28.89 | 1.0 | 28.890000 | 38.35 | 0 |
49999 | 78.18 | 105.43 | 2.0 | 52.715000 | 78.18 | 0 |
50000 | 19.37 | 24.61 | 3.0 | 8.203333 | 19.37 | 0 |
Step 5 - Train the models with top features#
Following the feature selection, you train new models using the resultant features. You can observe the accuracy and other results remain high, meaning you get a model that requires less features to be accurate and thus less error-prone.
# Defining our training task, including our feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task('training',
inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
params={'label_column': 'label'}
)
ensemble_train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')
classifier_fn.run(ensemble_train_task)
> 2021-09-19 18:00:07,661 [info] starting run training uid=a6d9ae72cfd3462cace205f8b363d214 DB=http://mlrun-api:8080
> 2021-09-19 18:00:08,077 [info] Job is running in the background, pod: training-v2bt4
> 2021-09-19 18:00:20,781 [info] best iteration=3, used criteria max.accuracy
> 2021-09-19 18:00:21,696 [info] run executed, status=completed
final state: completed
project | uid | iter | start | state | name | labels | inputs | parameters | results | artifacts |
---|---|---|---|---|---|---|---|---|---|---|
fraud-demo-admin | 0 | Sep 19 18:00:11 | completed | training | v3io_user=admin kind=job owner=admin |
dataset |
label_column=label |
best_iteration=3 accuracy=0.9899143672692674 test-error=0.010085632730732635 rocauc=0.9655151930226706 brier_score=0.19856508884931476 f1-score=0.6490066225165563 precision_score=0.7205882352941176 recall_score=0.5903614457831325 |
test_set probability-calibration confusion-matrix feature-importances precision-recall-binary roc-binary model iteration_results |
> 2021-09-19 18:00:27,561 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f464baed490>
Done!#
You’ve completed Part 2 of the model training with the feature store. Proceed to Part 3 to learn how to deploy and monitor the model.