Bigquery ML for Fraud Detection in Credit card transactions using console

1. Introduction

In this lab, you will use BigQuery to train and serve a model with tabular data using a console. This offering is the favorite addition to the SQL based model serving and training. BigQuery ML enables users to create and execute machine learning models in BigQuery by using SQL queries. The goal is to democratize machine learning by enabling SQL practitioners to build models using their existing tools and to increase development speed by eliminating the need for data movement.

What you will learn

  • Explore the data available in Bigquery
  • Create a model using SQL in Bigquery using console
  • Evaluate the results of the model created
  • Predict a transaction if its fraudulent or not with the model created

2. About the data

The datasets contain transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. It is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data.

  1. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time' and ‘Amount'.
  2. Feature ‘Time' contains the seconds elapsed between each transaction and the first transaction in the dataset.
  3. Feature ‘Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.
  4. Feature ‘Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group ( http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection.

More details on current and past projects on related topics are available on https://www.researchgate.net/project/Fraud-detection-5 and the page of the DefeatFraud project

Citation:

Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015

Dal Pozzolo, Andrea; Caelen, Olivier; Le Borgne, Yann-Ael; Waterschoot, Serge; Bontempi, Gianluca. Learned lessons in credit card fraud detection from a practitioner perspective, Expert systems with applications,41,10,4915-4928,2014, Pergamon

Dal Pozzolo, Andrea; Boracchi, Giacomo; Caelen, Olivier; Alippi, Cesare; Bontempi, Gianluca. Credit card fraud detection: a realistic modeling and a novel learning strategy, IEEE transactions on neural networks and learning systems,29,8,3784-3797,2018,IEEE

Dal Pozzolo, Andrea Adaptive Machine learning for credit card fraud detection ULB MLG PhD thesis (supervised by G. Bontempi)

Carcillo, Fabrizio; Dal Pozzolo, Andrea; Le Borgne, Yann-Aël; Caelen, Olivier; Mazzer, Yannis; Bontempi, Gianluca. Scarff: a scalable framework for streaming credit card fraud detection with Spark, Information fusion,41, 182-194,2018,Elsevier

Carcillo, Fabrizio; Le Borgne, Yann-Aël; Caelen, Olivier; Bontempi, Gianluca. Streaming active learning strategies for real-life credit card fraud detection: assessment and visualization, International Journal of Data Science and Analytics, 5,4,285-300,2018,Springer International Publishing

Bertrand Lebichot, Yann-Aël Le Borgne, Liyun He, Frederic Oblé, Gianluca Bontempi Deep-Learning Domain Adaptation Techniques for Credit Cards Fraud Detection, INNSBDDL 2019: Recent Advances in Big Data and Deep Learning, pp 78-88, 2019

Fabrizio Carcillo, Yann-Aël Le Borgne, Olivier Caelen, Frederic Oblé, Gianluca Bontempi Combining Unsupervised and Supervised Learning in Credit Card Fraud Detection Information Sciences, 2019

3. Set up your environment

You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the instructions here.

  1. Enable the bigquery API

Navigate to BigQuery and select Enable if it isn't already enabled. You'll need this to create your model.

4. Explore the data

Step 1: Navigate to Bigquery

Look for Bigquery in https://cloud.google.com/console

Step 2: Explore the data using query

In the editor , type the below SQL query to review the data in the public dataset .

SELECT * FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection` LIMIT 5

Step 3: Execute

Press the Run command above to execute the query

c670e136ecd58236.png

Results

Should be in the Query Results panel and look something like this. The data might vary

be082fb135879158.png

Explore the columns involved and the output.

You can run multiple queries to understand how the data is distributed. Some examples might include

SELECT count(*) FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection` 
where Class=0;
SELECT count(*) FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection` 
where Class=1;

5. Create an output dataset

Step1: Create dataset for model creation

  1. In the Explorer panel - Left hand side pick the project you are currently working on , click the three dots next to it

7618699754e066bd.png

  1. Click Create Dataset in the top right

17851cc9db05dc1.png

  1. Enter the details for dataset name, retention, location etc., Use these settings

b8b6731408f61d57.png

6. Create Logistic Regression Model

Step 1: Create Statement

In the Query window, type out the below query for model creation. Understand the key options with this statement. Explained in this link.

INPUT_LABEL_COLS indicate the prediction label

AUTO_CLASS_WEIGHTS are used for imbalanced datasets

MODEL_TYPE would indicate the algorithm used in this case it is Logistic Regression

DATA_SPLIT_METHOD indicates the split between the training and testing data

CREATE OR REPLACE MODEL advdata.ulb_fraud_detection 
TRANSFORM(
    * EXCEPT(Amount),
    SAFE.LOG(Amount) AS log_amount
)
OPTIONS(
    INPUT_LABEL_COLS=['class'],
    AUTO_CLASS_WEIGHTS = TRUE,
    DATA_SPLIT_METHOD='seq',
    DATA_SPLIT_COL='Time',
    MODEL_TYPE='logistic_reg'
) AS

SELECT 
 *
FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection`

Step 2: Execute

Run the above statement. This should take a couple of minutes to complete

Notice the key things after the run is complete

3385a54d3907443e.png

  1. Explorer panel will have the model created
  2. Query Results panel will have the duration it took to process the ML SQL similar to any SQL statement
  3. Query Results panel will also have the Go to Model link for you to explore

Step 3: Explore

Explore the Model created by clicking on the Go to Model or from the Explorer panel. The tabs provide information on the model created , training, evaluation etc., Review the results

b350f1b94835f2d1.png

We will explore this further in the next sections

To know more about Logistic Regression look here

7. Exploring model evaluation metrics

In this step we'll see how our model performed.

Once your model training job has completed, click on the model you just trained and take a look at the Evaluate tab. There are many evaluation metrics here - we'll focus on just one :roc_auc

 SELECT * FROM ML.EVALUATE(MODEL advdata.ulb_fraud_detection)

The results would look something like this. roc_auc is generally more important on an imbalanced dataset

c3c80286082dc336.png

To know more about the results. Look for the ML.Evaluate docs here

Precision and Recall

ROC Curve and AUC

8. Predict class for a certain time

Press Compose a new query and execute the below SQL. Time here identified has a fraudulent amount. We are trying to evaluate that the prediction works

SELECT Amount, predicted_class_probs, Class
FROM ML.PREDICT( MODEL advdata.ulb_fraud_detection,
 (SELECT * FROM `bigquery-public-data.ml_datasets.ulb_fraud_detection` WHERE Time = 85285.0)
)

Output should look something like this (numbers might vary)

676fc8accdab2f6c.png

In this instance, we are displaying the amount with the associated probability of the label. The class column here indicates what the actual results were.

To know more about ML.PREDICT. Look here

9. Cleanup

If you'd like to cleanup the resources created with this lab, open the dataset from the Explorer panel in the left hand side

e7d18fadd2b781be.png

Click Delete Dataset in the top right corner

b47bc819869bed6c.png

Enter the dataset name again to confirm the details. For our case this will be advdata

10. Congratulations

Congratulations, you've successfully created your first model, evaluated and predicted using supervised machine learning!

You now know the key steps required to create a logistic regression model.

What's next?

Check out some of these other ways involved in predicting

Reference docs