Part 1: Data Ingestion#

This demo showcases financial fraud prevention. It uses the MLRun feature store to define complex features that help identify fraud.

Fraud prevention is a special challenge since it requires processing raw transaction and events in real-time and being able to quickly respond and block transactions before they occur.

To address this, you’ll create a development pipeline and a production pipeline. Both pipelines share the same feature engineering and model code, but serve data very differently. Furthermore, MLRun automates the data and model monitoring process, drift identification, and trigger retraining in a CI/CD pipeline. This process is described in the diagram below:

Feature store demo diagram - fraud prevention

The raw data is described as follows:

TRANSACTIONS		║	USER EVENTS
age	age group value 0-6. Some values are marked as U for unknown	║	source	The party/entity related to the event
gender	A character to define the age	║	event	event, such as login or password change
zipcodeOri	ZIP code of the person originating the transaction	║	timestamp	The date and time of the event
zipMerchant	ZIP code of the merchant receiving the transaction	║
category	category of the transaction (e.g., transportation, food, etc.)	║
amount	the total amount of the transaction	║
fraud	whether the transaction is fraudulent	║
timestamp	the date and time in which the transaction took place	║
source	the ID of the party/entity performing the transaction	║
target	the ID of the party/entity receiving the transaction	║
device	the device ID used to perform the transaction	║

This notebook introduces how to Ingest different data sources to the Feature Store.

The following FeatureSets are created:

Transactions: Monetary transactions between a source and a target.
Events: Account events such as account login or a password change.
Label: Fraud label for the data.

By the end of this tutorial you’ll know how to:

Create an ingestion pipeline for each data source.
Define preprocessing, aggregation, and validation of the pipeline.
Run the pipeline locally within the notebook.
Launch a real-time function to ingest live data.
Schedule a cron to run the task when needed.

project_name = 'fraud-demo'

import mlrun

# Initialize the MLRun project object
project = mlrun.get_or_create_project(project_name, context="./", user_project=True)

> 2022-03-16 05:45:07,703 [info] loaded project fraud-demo from MLRun DB

Step 1 - Fetch, process and ingest the datasets#

1.1 - Transactions#

Transactions#

# Helper functions to adjust the timestamps of our data
# while keeping the order of the selected events and
# the relative distance from one event to the other

def date_adjustment(sample, data_max, new_max, old_data_period, new_data_period):
    '''
        Adjust a specific sample's date according to the original and new time periods
    '''
    sample_dates_scale = ((data_max - sample) / old_data_period)
    sample_delta = new_data_period * sample_dates_scale
    new_sample_ts = new_max - sample_delta
    return new_sample_ts

def adjust_data_timespan(dataframe, timestamp_col='timestamp', new_period='2d', new_max_date_str='now'):
    '''
        Adjust the dataframe timestamps to the new time period
    '''
    # Calculate old time period
    data_min = dataframe.timestamp.min()
    data_max = dataframe.timestamp.max()
    old_data_period = data_max-data_min
    
    # Set new time period
    new_time_period = pd.Timedelta(new_period)
    new_max = pd.Timestamp(new_max_date_str)
    new_min = new_max-new_time_period
    new_data_period = new_max-new_min
    
    # Apply the timestamp change
    df = dataframe.copy()
    df[timestamp_col] = df[timestamp_col].apply(lambda x: date_adjustment(x, data_max, new_max, old_data_period, new_data_period))
    return df

import pandas as pd

# Fetch the transactions dataset from the server
transactions_data = pd.read_csv('https://s3.wasabisys.com/iguazio/data/fraud-demo-mlrun-fs-docs/data.csv', parse_dates=['timestamp'], nrows=500)

# Adjust the samples timestamp for the past 2 days
transactions_data = adjust_data_timespan(transactions_data, new_period='2d')

# Preview
transactions_data.head(3)

	age	gender	zipcodeOri	zipMerchant	category	amount	timestamp	source	target	device
0	4	M	28007	28007	es_transportation	4.55	2022-03-15 16:13:54.851486383	C1093826151	M348934600	f802e61d76564b7a89a83adcdfa573da
1	2	M	28007	28007	es_transportation	39.68	2022-03-14 09:21:09.710448366	C352968107	M348934600	38ef7fc3eb7442c8ae64579a483f1d2b
2	4	F	28007	28007	es_transportation	26.89	2022-03-15 22:41:20.666966912	C2054744914	M1823072687	7a851d0758894078b5846851ae32d5e3

Transactions - create a feature set and preprocessing pipeline#

Create the feature set (data pipeline) definition for the credit transaction processing that describes the offline/online data transformations and aggregations.
The feature store automatically adds an offline parquet target and an online NoSQL target by using set_targets().

The data pipeline consists of:

Extracting the data components (hour, day of week)
Mapping the age values
One hot encoding for the transaction category and the gender
Aggregating the amount (avg, sum, count, max over 2/12/24 hour time windows)
Aggregating the transactions per category (over 14 days time windows)
Writing the results to offline (Parquet) and online (NoSQL) targets

# Import MLRun's Feature Store
import mlrun.feature_store as fstore
from mlrun.feature_store.steps import OneHotEncoder, MapValues, DateExtractor

# Define the transactions FeatureSet
transaction_set = fstore.FeatureSet("transactions", 
                                 entities=[fstore.Entity("source")], 
                                 timestamp_key='timestamp', 
                                 description="transactions feature set")

# Define and add value mapping
main_categories = ["es_transportation", "es_health", "es_otherservices",
       "es_food", "es_hotelservices", "es_barsandrestaurants",
       "es_tech", "es_sportsandtoys", "es_wellnessandbeauty",
       "es_hyper", "es_fashion", "es_home", "es_contents",
       "es_travel", "es_leisure"]

# One Hot Encode the newly defined mappings
one_hot_encoder_mapping = {'category': main_categories,
                           'gender': list(transactions_data.gender.unique())}

# Define the graph steps
transaction_set.graph\
    .to(DateExtractor(parts = ['hour', 'day_of_week'], timestamp_col = 'timestamp'))\
    .to(MapValues(mapping={'age': {'U': '0'}}, with_original_features=True))\
    .to(OneHotEncoder(mapping=one_hot_encoder_mapping))


# Add aggregations for 2, 12, and 24 hour time windows
transaction_set.add_aggregation(name='amount',
                                column='amount',
                                operations=['avg','sum', 'count','max'],
                                windows=['2h', '12h', '24h'],
                                period='1h')


# Add the category aggregations over a 14 day window
for category in main_categories:
    transaction_set.add_aggregation(name=category,column=f'category_{category}',
                                    operations=['count'], windows=['14d'], period='1d')

# Add default (offline-parquet & online-nosql) targets
transaction_set.set_targets()

# Plot the pipeline so we can see the different steps
transaction_set.plot(rankdir="LR", with_targets=True)

../../_images/05f7d6198ca22a31d8767b4bd995c2cdd00cd7313fcb5967db952c9abd8c6cb5.svg

Transactions - ingestion#

# Ingest the transactions dataset through the defined pipeline
transactions_df = fstore.ingest(transaction_set, transactions_data, 
                 infer_options=fstore.InferOptions.default())

transactions_df.head(3)

persist count = 0
persist count = 100
persist count = 200
persist count = 300
persist count = 400
persist count = 500
persist count = 600
persist count = 700
persist count = 800
persist count = 900
persist count = 1000

	amount_count_2h	amount_count_12h	amount_count_24h	amount_max_2h	amount_max_12h	amount_max_24h	amount_sum_2h	amount_sum_12h	amount_sum_24h	amount_avg_2h	...	category_es_contents	category_es_travel	category_es_leisure	amount	fraud	timestamp	target	device	timestamp_hour	timestamp_day_of_week
source
C1093826151	1.0	1.0	1.0	4.55	4.55	4.55	4.55	4.55	4.55	4.55	...	0	0	0	4.55	0	2022-03-15 16:13:54.851486383	M348934600	f802e61d76564b7a89a83adcdfa573da	16	1
C352968107	1.0	1.0	1.0	39.68	39.68	39.68	39.68	39.68	39.68	39.68	...	0	0	0	39.68	0	2022-03-14 09:21:09.710448366	M348934600	38ef7fc3eb7442c8ae64579a483f1d2b	9	0
C2054744914	1.0	1.0	1.0	26.89	26.89	26.89	26.89	26.89	26.89	26.89	...	0	0	0	26.89	0	2022-03-15 22:41:20.666966912	M1823072687	7a851d0758894078b5846851ae32d5e3	22	1

3 rows × 57 columns

1.2 - User events#

User events - fetching#

# Fetch the user_events dataset from the server
user_events_data = pd.read_csv('https://s3.wasabisys.com/iguazio/data/fraud-demo-mlrun-fs-docs/events.csv', 
                               index_col=0, quotechar="\'", parse_dates=['timestamp'], nrows=500)

# Adjust to the last 2 days to see the latest aggregations in our online feature vectors
user_events_data = adjust_data_timespan(user_events_data, new_period='2d')

# Preview
user_events_data.head(3)

	source	event	timestamp
0	C1974668487	details_change	2022-03-15 15:03:17.518565985
1	C1973547259	login	2022-03-15 18:05:50.652706656
2	C515668508	login	2022-03-15 14:37:49.845093748

User events - create a feature set and preprocessing pipeline#

Define the events feature set. This is a fairly straightforward pipeline in which you only “one hot encode” the event categories and save the data to the default targets.

user_events_set = fstore.FeatureSet("events",
                           entities=[fstore.Entity("source")],
                           timestamp_key='timestamp', 
                           description="user events feature set")

# Define and add value mapping
events_mapping = {'event': list(user_events_data.event.unique())}

# One Hot Encode
user_events_set.graph.to(OneHotEncoder(mapping=events_mapping))

# Add default (offline-parquet & online-nosql) targets
user_events_set.set_targets()

# Plot the pipeline so we can see the different steps
user_events_set.plot(rankdir="LR", with_targets=True)

../../_images/2547dacb3eef282f2b5481cb30e3e26c7b49b53be795a6eec40f091726f2cc31.svg

User events - ingestion#

# Ingestion of the newly created events feature set
events_df = fstore.ingest(user_events_set, user_events_data)
events_df.head(3)

persist count = 0
persist count = 100
persist count = 200
persist count = 300
persist count = 400
persist count = 500

	event_details_change	event_login	event_password_change	timestamp
source
C1974668487	1	0	0	2022-03-15 15:03:17.518565985
C1973547259	0	1	0	2022-03-15 18:05:50.652706656
C515668508	0	1	0	2022-03-15 14:37:49.845093748

Step 2 - Create a labels dataset for model training#

Label set - create a feature set#

This feature set contains the label for the fraud demo, it is ingested directly to the default targets without any changes

def create_labels(df):
    labels = df[['fraud','source','timestamp']].copy()
    labels = labels.rename(columns={"fraud": "label"})
    labels['timestamp'] = labels['timestamp'].astype("datetime64[ms]")
    labels['label'] = labels['label'].astype(int)
    labels.set_index('source', inplace=True)
    return labels

# Define the "labels" feature set
labels_set = fstore.FeatureSet("labels", 
                           entities=[fstore.Entity("source")], 
                           timestamp_key='timestamp',
                           description="training labels",
                           engine="pandas")

labels_set.graph.to(name="create_labels", handler=create_labels)

# specify only Parquet (offline) target since its not used for real-time
labels_set.set_targets(['parquet'], with_defaults=False)
labels_set.plot(with_targets=True)

../../_images/e33377db8925890de06070cdf43cd1caee738cc2ef22bc2799e307247a134987.svg

Label set - ingestion#

# Ingest the labels feature set
labels_df = fstore.ingest(labels_set, transactions_data)
labels_df.head(3)

	label	timestamp
source
C1093826151	0	2022-03-15 16:13:54.851
C352968107	0	2022-03-14 09:21:09.710
C2054744914	0	2022-03-15 22:41:20.666

Step 3 - Deploy a real-time pipeline#

When dealing with real-time aggregation, it’s important to be able to update these aggregations in real-time. For this purpose, you’ll create live serving functions that update the online feature store of the transactions FeatureSet and Events FeatureSet.

Using MLRun’s serving runtime, create a nuclio function loaded with the feature set’s computational graph definition and an HttpSource to define the HTTP trigger.

Notice that the implementation below does not require any rewrite of the pipeline logic.

3.1 - Transactions#

Transactions - deploy the feature set live endpoint#

# Create iguazio v3io stream and transactions push API endpoint
transaction_stream = f'v3io:///projects/{project.name}/streams/transaction'
transaction_pusher = mlrun.datastore.get_stream_pusher(transaction_stream)

# Define the source stream trigger (use v3io streams)
# Define the `key` and `time` fields (extracted from the Json message).
source = mlrun.datastore.sources.StreamSource(path=transaction_stream , key_field='source', time_field='timestamp')

# Deploy the transactions feature set's ingestion service over a real-time (Nuclio) serverless function
# you can use the run_config parameter to pass function/service specific configuration
transaction_set_endpoint = fstore.deploy_ingestion_service(featureset=transaction_set, source=source)

> 2022-03-16 05:45:43,035 [info] Starting remote function deploy
2022-03-16 05:45:43  (info) Deploying function
2022-03-16 05:45:43  (info) Building
2022-03-16 05:45:43  (info) Staging files and preparing base images
2022-03-16 05:45:43  (warn) Python 3.6 runtime is deprecated and will soon not be supported. Please migrate your code and use Python 3.7 runtime (`python:3.7`) or higher
2022-03-16 05:45:43  (info) Building processor image
2022-03-16 05:47:03  (info) Build complete
2022-03-16 05:47:08  (info) Function deploy complete
> 2022-03-16 05:47:08,835 [info] successfully deployed function: {'internal_invocation_urls': ['nuclio-fraud-demo-admin-transactions-ingest.default-tenant.svc.cluster.local:8080'], 'external_invocation_urls': ['fraud-demo-admin-transactions-ingest-fraud-demo-admin.default-tenant.app.xtvtjecfcssi.iguazio-cd1.com/']}

Transactions - test the feature set HTTP endpoint#

By defining the transactions feature set you can now use MLRun and Storey to deploy it as a live endpoint, ready to ingest new data!