FraudFinder: From raw data to AI with Vertex AI and BigQuery.

1. Overview

In this lab, you will learn how to build an end-to-end data to AI system for real-time fraud detection on Google Cloud. The goal is to understand how to go from raw data to having a production-ready ML pipeline running on Google Cloud. This lab uses the following Google Cloud products:

What you will learn?

Building end-to-end ML pipeline can be challenging. In this lab, you will learn how to build and scale an end-to-end ML pipeline using Google Cloud services like BigQuery and Vertex AI. We will take you on a journey on how to go from raw data to having AI in production. The high-level learning objective for this lab are:

  • Learn best practices for building data to AI systems on Google Cloud.
  • Learn how to do feature engineering with BigQuery using SQL (for batch processing) and Apache Beam using Dataflow (real-time processing) and use Vertex AI Feature Store.
  • How to do data analysis using BigQuery and Python libraries like Pandas and Plotly.
  • How to train a ML model with BigQuery ML through SQL.
  • How to leverage Vertex AI to store, deploy and monitor your model.
  • How to use Vertex AI Pipelines to formalize your data to AI workflow.

IMPORTANT: The cost to run this lab on Google Cloud is about $100.

2. From raw data to AI with Vertex AI and BigQuery

This lab covers the latest data analytics and AI products available on Google Cloud like Vertex AI and BigQuery ML. Vertex AI and BigQuery make it easier to go from raw data to AI and offer a seamless development experience to help you be more productive at bringing your models into production. If you need any support, please see the support page.

Vertex AI includes many different products to support end-to-end data to AI workflows. Below you will find an overview of all of the Vertex AI capabilities:

Vertex product overview

3. FraudFinder use case and data

FraudFinder is a series of notebooks that teach the comprehensive data to AI journey on Google Cloud, through the use case of real-time fraud detection. Throughout the notebooks, you will learn how to read historical payment transactions data stored in a data warehouse, read from a live stream of new transactions, perform exploratory data analysis (EDA), do feature engineering, ingest features into a feature store, train a model using feature store, register your model in a model registry, evaluate your model, deploy your model to an endpoint, do real-time inference on your model with feature store, and monitor your model.

Fraud detection covers classification and anomaly detection, arguably extensive domains within machine learning. This makes fraud detection a good use case for a real story that's easy to understand and a great way to showcase and end-to-end data to AI architecture on Google Cloud. You don't need to be a fraud expert to understand the end-to-end architecture. The pattern for the architecture can be applied to other use cases.

Below you'll find an overview of the FraudFinder architecture:

FraudFinder Architecture

Dataset

The dataset is synthesized by using the code from Machine Learning for Credit Card Fraud Detection - Practical Handbook project from Kaggle. Real-time fraud detection is architecturally different from batch-based fraud detection and is characterized by the following:

  • High frequency (e.g., 1000 per second) of prediction requests
  • Low latency (e.g. < 1 sec) of prediction request → response
  • Prediction is typically per 1 sample per prediction request or could be in "micro-batches" (e.g., 1000 transactions sent as a batch for near real-time inference)
  • Feature engineering for serving must be pre-computed or computed in real-time

FraudFinder historic dataset

There are public BigQuery tables with historical payment transactions, allowing users to train their models and do feature engineering using data in BigQuery.

cymbal-fraudfinder (project)
|-`tx` (dataset)  
  |-`tx` (table: transactions without labels)  
  |-`txlabels` (table: transactions with fraud labels (1 or 0))  
|-demographics  
  |-`customers` (table: profiles of customers)  
  |-`terminals` (table: profiles of terminals)  
  |-`customersterminals` (table: profiles of customers and terminals within their radius)  

Why real-time?

In this lab, you will learn how to leverage real-time data and apply real-time feature engineering and inference. Real-time features can help improve your model by leveraging signals you otherwise wouldn't be able to use during inference time.

FraudFinder live, streaming data

As part of the FraudFinder lab there are public Pub/Sub topics with live streaming payment transactions users can conveniently test their model endpoints and stream features. Pub/Sub is an asynchronous and scalable messaging service. You will use these topics to stream features and perform online inference. Users can also switch between topics with baseline vs. higher fraud rates to display model monitoring. The following public Pub/Sub topics are available:

  • ff-tx
  • ff-txlabels

4. Setup your project and notebook instance

You will need a Google Cloud Platform project with billing enabled to run this lab. To create a project, follow the instructions.

IMPORTANT: We advise you to run this lab in a new project. This lab covers many different products, and it's easiest if you delete the entire project after you are done with the lab.

When you have a project, please continue with the following steps. The following steps can also be found in the README.md file found in the repo.

Step 1: Enable the APIs

First, go to the project you just created and open a cloud shell. This step might take a few minutes because it will provision a new Cloud Shell if you haven't enabled one before.

Cloud Shell

Next, execute the following code in your Cloud Shell by copying and pasting. The script will enable the necessary APIs and create Pub/Sub subscriptions to read streaming transactions from public Pub/Sub topics. Please give it some time to execute all the commands.

gcloud services enable notebooks.googleapis.com
gcloud services enable cloudresourcemanager.googleapis.com
gcloud services enable aiplatform.googleapis.com
gcloud services enable pubsub.googleapis.com
gcloud services enable run.googleapis.com
gcloud services enable cloudbuild.googleapis.com
gcloud services enable dataflow.googleapis.com
gcloud services enable bigquery.googleapis.com

gcloud pubsub subscriptions create "ff-tx-sub" --topic="ff-tx" --topic-project="cymbal-fraudfinder"
gcloud pubsub subscriptions create "ff-txlabels-sub" --topic="ff-txlabels" --topic-project="cymbal-fraudfinder"

# Run the following command to grant the Compute Engine default service account access to read and write pipeline artifacts in Google Cloud Storage.
PROJECT_ID=$(gcloud config get-value project)
PROJECT_NUM=$(gcloud projects list --filter="$PROJECT_ID" --format="value(PROJECT_NUMBER)")
gcloud projects add-iam-policy-binding $PROJECT_ID \
      --member="serviceAccount:${PROJECT_NUM}-compute@developer.gserviceaccount.com"\
      --role='roles/storage.admin'

Step 2: Create a Vertex AI Workbench instance

Next navigate to the Vertex AI section of your Cloud Console. Then navigate to Workbench:

Vertex AI menu

Enable the Vertex AI Workbench (notebook API) API if it's not enabled'.

Notebook_api

Once enabled, select USER-MANAGED NOTEBOOKS:

Notebooks_UI

Then select NEW NOTEBOOK. You can choose Python 3.

new_notebook

Give your notebook a name, like fraudfinder, then click Advanced Settings.

create_notebook

Important: Make sure that you select Service Account under Permissions.

Service Account

Important: Under Security select "Enable terminal" if it is not already enabled.

enable_terminal

You can leave all of the other advanced settings as is.

Next, click Create. The instance will take a couple minutes to be provisioned.

Once the instance has been created, select Open JupyterLab.

open_jupyterlab

Step 3: Set IAM Roles

For the sake of simplicity, let's assume you will use the Compute Engine default service account. This is not the best practice for production workloads. The best practice is to create dedicated service accounts for each application, and avoid using default service accounts. You can read more on service account best practices in our documentation. The default compute service account will look something like this: 123456789123-compute@developer.gserviceaccount.com. Go to the IAM Admin, and click on ADD. In the view, search and select Compute Engine default service account and then assign the following roles:

  • BigQuery Admin
  • Storage Admin
  • Storage Object Admin
  • Vertex AI Administrator
  • Pub/Sub Admin

It should look something like the following. Don't forget to save the new settings!

iam-roles.png

Step 4: Clone Github repo

Once you've created and accessed your notebook instance it's time to setup your environment. First open a terminal window.

Open Terminal

Copy, paste and run the following command in your notebook terminal:

git clone https://github.com/GoogleCloudPlatform/fraudfinder.git

Running this command will clone the FraudFinder repository into your notebook instance. After running the git clone, you will find the fraudfinder folder in your Notebook instance on the left. Now navigate to: fraudfinder folder. Here you will find the Notebooks needed for the lab.

From the next sections onward you are expected to follow the instructions in the notebooks. Please continue with environment setup.

5. Environment Setup

This section will go through the steps to help set up your project environment. In this section, you will cover the following learning objectives:

  • Set up your environment, including packages.
  • Load data into BigQuery.
  • Read data from the public Pub/Sub topics.

Please continue with the following notebook and follow the instructions step by step:

  • 00_environment_setup.ipynb

6. Exploratory Data Analysis

This section will teach you how to perform exploratory Data Analysis to understand the Fraud data better. In this section, you will cover the following learning objectives:

  • Extract and explore data from BigQuery using SQL
  • Plot transaction data using BigQuery and Plotly
  • Apply data aggregations and create a scatter plot

Please continue with the next Notebook and follow the instruction step by step:

  • 01_exploratory_data_analysis.ipynb

7. Feature Engineering Batch and Streaming

In this section you will work on Feature Engineering to generate features for model training out of the raw data. We will use batch and streaming. Both use cases are important to cover for Fraud Detection. In this section, you will cover the following learning objectives:

  • How to create features using BigQuery and SQL
  • Create a Vertex AI Feature Store and insert data
  • How to deal with streaming data and ingest it into the Feature Store

Please continue with the following two Notebooks in this order and follow the instructions in the Notebooks:

  • 02_feature_engineering_batch.ipynb
  • 03_feature_engineering_streaming.ipynb

8. Model Training, Prediction, Formalization and Monitoring

In this section, you will train and deploy your first BigQuery Model model to detect possible fraud cases. You will also learn to take your training and deployment code and formalize it into an automated pipeline. You will also learn how to do online predictions and monitor your model in production. In this section, you will cover the following learning objectives:

  • How to train a BigQuery ML model and register it on Vertex AI Model Registry
  • Deploy the model as an endpoint on Vertex AI
  • How to use the Vertex AI SDK
  • How you can take the BigQuery ML model and create an end-to-end ML pipeline
  • How to use Vertex AI Model Monitoring

Please continue with the following Notebooks in this order and follow the instructions in the Notebooks. The notebooks can be found in the BQML Folder. Follow the notebook step by step:

  • 04_model_training_and_prediction.ipynb
  • 05_model_training_pipeline_formalization.ipynb
  • 06_model_monitoring.ipynb
  • 07_model_inference.ipynb
🎉 Congratulations! 🎉

You have learned how to build a data to AI architecture on Google Cloud!

9. Cleanup

We want to advise you to run this lab on a new project. This lab covers many different products, so it's easiest if you delete the entire project after you are done with the lab. In our documentation your can find more information on how to delete the project.

If you want to delete the services instead please follow the instruction in the Notebooks or delete the resources created.