Part 2: Training

In this part we will show how using MLRun’s Feature Store we can easily define a Feature Vector and create the dataset we need to run our training process.
By the end of this tutorial you’ll learn how to:

  • Combine multiple data sources to a single Feature Vector

  • Create training dataset

  • Create a model using an MLRun Hub function

project_name = 'fraud-demo'
import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)
> 2021-09-19 17:59:27,165 [info] loaded project fraud-demo from MLRun DB

Step 1 - Create a Feature Vector

In this section we will create our Feature Vector.
The Feature vector will have a name so we can reference to it later via the URI or our serving function, and a list of features from the available FeatureSets. We can add a feature from a feature set by adding <FeatureSet>.<Feature> to the list, or add <FeatureSet>.* to add all the FeatureSet’s available features.

By default, the first FeatureSet in the feature list will act as the spine. meaning that all the other features will be joined to it.
For example, in this instance we use the early_sense sensor data as our spine, so for each early_sense event we will create produce a row in the resulted Feature Vector.

# Define the list of features we will be using
features = ['transactions.amount_max_2h', 
            'transactions.amount_sum_2h', 
            'transactions.amount_count_2h',
            'transactions.amount_avg_2h', 
            'transactions.amount_max_12h', 
            'transactions.amount_sum_12h',
            'transactions.amount_count_12h', 
            'transactions.amount_avg_12h', 
            'transactions.amount_max_24h',
            'transactions.amount_sum_24h', 
            'transactions.amount_count_24h', 
            'transactions.amount_avg_24h',
            'transactions.es_transportation_count_14d', 
            'transactions.es_health_count_14d',
            'transactions.es_otherservices_count_14d', 
            'transactions.es_food_count_14d',
            'transactions.es_hotelservices_count_14d', 
            'transactions.es_barsandrestaurants_count_14d',
            'transactions.es_tech_count_14d', 
            'transactions.es_sportsandtoys_count_14d',
            'transactions.es_wellnessandbeauty_count_14d', 
            'transactions.es_hyper_count_14d',
            'transactions.es_fashion_count_14d', 
            'transactions.es_home_count_14d', 
            'transactions.es_travel_count_14d', 
            'transactions.es_leisure_count_14d',
            'transactions.gender_F',
            'transactions.gender_M',
            'transactions.step', 
            'transactions.amount', 
            'transactions.timestamp_hour',
            'transactions.timestamp_day_of_week',
            'events.*']
# Import MLRun's Feature Store
import mlrun.feature_store as fstore

# Define the feature vector name for future reference
fv_name = 'transactions-fraud'

# Define the feature vector using our Feature Store (fstore)
transactions_fv = fstore.FeatureVector(fv_name, 
                          features, 
                          label_feature="labels.label",
                          description='Predicting a fraudulent transaction')

# Save the feature vector in the Feature Store
transactions_fv.save()

Step 2 - Preview the Feature Vector Data

Obtain the values of the features in the feature vector, to ensure the data appears as expected

# Import the Parquet Target so we can directly save our dataset as a file
from mlrun.datastore.targets import ParquetTarget

# Get offline feature vector as dataframe and save the dataset to parquet
train_dataset = fstore.get_offline_features(fv_name, target=ParquetTarget())
> 2021-09-19 17:59:28,415 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-admin/FeatureStore/transactions-fraud/parquet/vectors/transactions-fraud-latest.parquet', 'status': 'ready', 'updated': '2021-09-19T17:59:28.415727+00:00', 'size': 1915182}
# Preview our dataset
train_dataset.to_dataframe().tail(5)
amount_max_2h amount_sum_2h amount_count_2h amount_avg_2h amount_max_12h amount_sum_12h amount_count_12h amount_avg_12h amount_max_24h amount_sum_24h ... gender_F gender_M step amount timestamp_hour timestamp_day_of_week event_details_change event_login event_password_change label
49995 2.95 2.95 1.0 2.950 2.95 2.95 1.0 2.950 2.95 2.95 ... 1 0 41 2.95 17 6 0.0 0.0 1.0 0
49996 37.40 37.40 1.0 37.400 37.40 37.40 1.0 37.400 37.40 37.40 ... 0 1 40 37.40 17 6 0.0 0.0 1.0 0
49997 7.75 7.75 1.0 7.750 7.75 12.99 2.0 6.495 61.23 112.76 ... 1 0 91 7.75 17 6 1.0 0.0 0.0 0
49998 28.89 28.89 1.0 28.890 38.35 107.76 4.0 26.940 52.97 249.41 ... 0 1 56 28.89 17 6 1.0 0.0 0.0 0
49999 78.18 105.43 2.0 52.715 78.18 153.78 3.0 51.260 78.18 220.19 ... 1 0 76 78.18 17 6 1.0 0.0 0.0 0

5 rows × 36 columns

Step 3 - Train Models and Choose Highest Accuracy

With MLRun, one can easily train different models and compare the results. In the code below, we train 3 different models, each uses a different algorithm (random forest, XGBoost, adabost), and choose the model with the highest accuracy

# Import the Sklearn classifier function from the functions hub
classifier_fn = mlrun.import_function('hub://sklearn-classifier')
# Prepare the parameters list for the training function
# We will be using 3 different models
training_params = {"model_name": ['transaction_fraud_rf', 
                                  'transaction_fraud_xgboost', 
                                  'transaction_fraud_adaboost'],
              
              "model_pkg_class": ['sklearn.ensemble.RandomForestClassifier',
                                  'sklearn.ensemble.GradientBoostingClassifier',
                                  'sklearn.ensemble.AdaBoostClassifier']}

# Define the training task, including our feature vector, label and hyperparams definitions
train_task = mlrun.new_task('training', 
                      inputs={'dataset': transactions_fv.uri},
                      params={'label_column': 'label'}
                     )

train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

# Specify our cluster image
classifier_fn.spec.image = 'mlrun/mlrun'

# Run training
classifier_fn.run(train_task, local=False)
> 2021-09-19 17:59:28,799 [info] starting run training uid=9349c60dd9f24a33b536c59e89978e7b DB=http://mlrun-api:8080
> 2021-09-19 17:59:29,042 [info] Job is running in the background, pod: training-2jntc
> 2021-09-19 17:59:47,926 [info] best iteration=1, used criteria max.accuracy
> 2021-09-19 17:59:48,990 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-admin 0 Sep 19 17:59:32 completed training
v3io_user=admin
kind=job
owner=admin
dataset
label_column=label
best_iteration=1
accuracy=0.9901828681424446
test-error=0.009817131857555342
rocauc=0.9556168449721417
brier_score=0.008480115495668912
f1-score=0.6666666666666667
precision_score=0.7846153846153846
recall_score=0.5795454545454546
test_set
probability-calibration
confusion-matrix
feature-importances
precision-recall-binary
roc-binary
model
iteration_results

> to track results use the .show() or .logs() methods or click here to open in UI
> 2021-09-19 17:59:51,574 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f464baf0c50>

Step 4 - Perform Feature Selection

As part of our data science process we will try and reduce the training dataset’s size to get rid of bad or unuseful features and save computation time.

We will use our ready-made feature selection function from our hub hub://feature_selection to select the best features to keep on a sample from our dataset and run the function on that.

feature_selection_fn = mlrun.import_function('hub://feature_selection')

feature_selection_run = feature_selection_fn.run(
            params={'sample_ratio':0.25,
                    'output_vector_name':fv_name + "-short",
                   'ignore_type_errors': True},
    
            inputs={'df_artifact': transactions_fv.uri},
            name='feature_extraction',
            handler='feature_selection',
    local=False)
> 2021-09-19 17:59:51,768 [info] starting run feature_extraction uid=3a50bd0e4175459fb53873d8f78a440a DB=http://mlrun-api:8080
> 2021-09-19 17:59:52,004 [info] Job is running in the background, pod: feature-extraction-lf46d
> 2021-09-19 17:59:59,099 [info] Couldn't calculate chi2 because of: Input X must be non-negative.
> 2021-09-19 18:00:04,008 [info] votes needed to be selected: 3
> 2021-09-19 18:00:05,329 [info] wrote target: {'name': 'parquet', 'kind': 'parquet', 'path': 'v3io:///projects/fraud-demo-admin/FeatureStore/transactions-fraud-short/parquet/vectors/transactions-fraud-short-latest.parquet', 'status': 'ready', 'updated': '2021-09-19T18:00:05.329695+00:00', 'size': 668722}
> 2021-09-19 18:00:05,677 [info] run executed, status=completed
Pass k=5 as keyword args. From version 0.25 passing these as positional arguments will result in an error
Liblinear failed to converge, increase the number of iterations.
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-admin 0 Sep 19 17:59:56 completed feature_extraction
v3io_user=admin
kind=job
owner=admin
host=feature-extraction-lf46d
df_artifact
sample_ratio=0.25
output_vector_name=transactions-fraud-short
ignore_type_errors=True
top_features_vector=store://feature-vectors/fraud-demo-admin/transactions-fraud-short
f_classif
mutual_info_classif
f_regression
LinearSVC
LogisticRegression
ExtraTreesClassifier
feature_scores
max_scaled_scores_feature_scores
selected_features_count
selected_features

> to track results use the .show() or .logs() methods or click here to open in UI
> 2021-09-19 18:00:07,537 [info] run executed, status=completed
mlrun.get_dataitem(feature_selection_run.outputs['top_features_vector']).as_df().tail(5)
amount_max_2h amount_sum_2h amount_count_2h amount_avg_2h amount_max_12h label
49996 37.40 37.40 1.0 37.400000 37.40 0
49997 7.75 7.75 1.0 7.750000 7.75 0
49998 28.89 28.89 1.0 28.890000 38.35 0
49999 78.18 105.43 2.0 52.715000 78.18 0
50000 19.37 24.61 3.0 8.203333 19.37 0

Step 5 - Train our models with top features

Following the feature selection, we train new models using the resultant features. We can observe the accuracy and other results remain high meaning we get a model that requires less features to be accurate and thus less error-prone.

# Defining our training task, including our feature vector, label and hyperparams definitions
ensemble_train_task = mlrun.new_task('training', 
                      inputs={'dataset': feature_selection_run.outputs['top_features_vector']},
                      params={'label_column': 'label'}
                     )
ensemble_train_task.with_hyper_params(training_params, strategy='list', selector='max.accuracy')

classifier_fn.run(ensemble_train_task)
> 2021-09-19 18:00:07,661 [info] starting run training uid=a6d9ae72cfd3462cace205f8b363d214 DB=http://mlrun-api:8080
> 2021-09-19 18:00:08,077 [info] Job is running in the background, pod: training-v2bt4
> 2021-09-19 18:00:20,781 [info] best iteration=3, used criteria max.accuracy
> 2021-09-19 18:00:21,696 [info] run executed, status=completed
final state: completed
project uid iter start state name labels inputs parameters results artifacts
fraud-demo-admin 0 Sep 19 18:00:11 completed training
v3io_user=admin
kind=job
owner=admin
dataset
label_column=label
best_iteration=3
accuracy=0.9899143672692674
test-error=0.010085632730732635
rocauc=0.9655151930226706
brier_score=0.19856508884931476
f1-score=0.6490066225165563
precision_score=0.7205882352941176
recall_score=0.5903614457831325
test_set
probability-calibration
confusion-matrix
feature-importances
precision-recall-binary
roc-binary
model
iteration_results

> to track results use the .show() or .logs() methods or click here to open in UI
> 2021-09-19 18:00:27,561 [info] run executed, status=completed
<mlrun.model.RunObject at 0x7f464baed490>

Done!

You’ve completed Part 2 of the model training with the feature store. Proceed to Part 3 to learn how to deploy and monitor the model.