Building a financial ML model with the What-If Tool and Vertex AI

In this lab, you will use the What-if Tool to analyze an XGBoost model trained on financial data. After analyzing the model you'll deploy it to Cloud's new Vertex AI.

What you learn

You'll learn how to:

  • Train an XGBoost model on a public mortgage dataset in a hosted notebook
  • Analyze the model using the What-if Tool
  • Deploy the XGBoost model to Vertex AI

The total cost to run this lab on Google Cloud is about $1.

This lab uses the newest AI product offering available on Google Cloud. Vertex AI integrates the ML offerings across Google Cloud into a seamless development experience. Previously, models trained with AutoML and custom models were accessible via separate services. The new offering combines both into a single API, along with other new products. You can also migrate existing projects to Vertex AI. If you have any feedback, please see the support page.

Vertex AI includes many different products to support end-to-end ML workflows. This lab will focus on the products highlighted below: Prediction and Notebooks.

Vertex product overview

XGBoost is a machine learning framework that uses decision trees and gradient boosting to build predictive models. It works by ensembling multiple decision trees together based on the score associated with different leaf nodes in a tree.

The diagram below is a visualization of a simple decision tree model that evaluates whether a sports game should be played based on the weather forecast:

Tree model example

Why are we using XGBoost for this model? While traditional neural networks have been shown to perform best on unstructured data like images and text, decision trees often perform extremely well on structured data like the mortgage dataset we'll be using in this codelab.

You'll need a Google Cloud Platform project with billing enabled to run this codelab. To create a project, follow the instructions here.

Step 1: Enable the Compute Engine API

Navigate to Compute Engine and select Enable if it isn't already enabled. You'll need this to create your notebook instance.

Step 2: Enable the Vertex AI API

Navigate to the Vertex section of your Cloud Console and click Enable Vertex AI API.

Vertex dashboard

Step 3: Create a Notebooks instance

From the Vertex section of your Cloud Console, click on Notebooks:

Select notebooks

From there, select New Instance. Then select the TensorFlow Enterprise 2.3 instance type without GPUs:

TFE instance

Use the default options and then click Create. Once the instance has been created, select Open JupyterLab.

Step 4: Install XGBoost

Once your JupyterLab instance has opened, you'll need to add the XGBoost package.

To do this, select Terminal from the launcher:

Then run the following to install the latest version of XGBoost supported by Vertex AI:

pip3 install xgboost==1.2

After this completes, open a Python 3 Notebook instance from the launcher. You're ready to get started in your notebook!

Step 5: Import Python packages

In the first cell of your notebook, add the following imports and run the cell. You can run it by pressing the right arrow button in the top menu or pressing command-enter:

import pandas as pd
import xgboost as xgb
import numpy as np
import collections
import witwidget

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.utils import shuffle
from witwidget.notebook.visualization import WitWidget, WitConfigBuilder

We'll use a mortgage dataset from ffiec.gov to train an XGBoost model. We've done some preprocessing on the original dataset and created a smaller version for you to use to train the model. The model will predict whether or not a particular mortgage application will get approved.

Step 1: Download the pre-processed dataset

We've made a version of the dataset available for you in Google Cloud Storage. You can download it by running the following gsutil command in your Jupyter notebook:

!gsutil cp 'gs://mortgage_dataset_files/mortgage-small.csv' .

Step 2: Read the dataset with Pandas

Before we create our Pandas DataFrame we'll create a dict of each column's data type so that Pandas reads our dataset correctly:

COLUMN_NAMES = collections.OrderedDict({
 'as_of_year': np.int16,
 'agency_code': 'category',
 'loan_type': 'category',
 'property_type': 'category',
 'loan_purpose': 'category',
 'occupancy': np.int8,
 'loan_amt_thousands': np.float64,
 'preapproval': 'category',
 'county_code': np.float64,
 'applicant_income_thousands': np.float64,
 'purchaser_type': 'category',
 'hoepa_status': 'category',
 'lien_status': 'category',
 'population': np.float64,
 'ffiec_median_fam_income': np.float64,
 'tract_to_msa_income_pct': np.float64,
 'num_owner_occupied_units': np.float64,
 'num_1_to_4_family_units': np.float64,
 'approved': np.int8
})

Next we'll create a DataFrame, passing it the data types we specified above. It's important to shuffle our data in case the original dataset is ordered in a specific way. We use an sklearn utility called shuffle to do this, which we imported in the first cell:

data = pd.read_csv(
 'mortgage-small.csv',
 index_col=False,
 dtype=COLUMN_NAMES
)
data = data.dropna()
data = shuffle(data, random_state=2)
data.head()

data.head() lets us preview the first five rows of our dataset in Pandas. You should see something like this after running the cell above:

Mortgage dataset preview

These are the features we'll be using to train our model. If you scroll all the way to the end, you'll see the last column approved, which is the thing we're predicting. A value of 1 indicates a particular application was approved, and 0 indicates it was denied.

To see the distribution of approved / denied values in the dataset and create a numpy array of the labels, run the following:

# Class labels - 0: denied, 1: approved
print(data['approved'].value_counts())

labels = data['approved'].values
data = data.drop(columns=['approved'])

About 66% of the dataset contains approved applications.

Step 3: Creating dummy column for categorical values

This dataset contains a mix of categorical and numerical values, but XGBoost requires that all features be numerical. Instead of representing categorical values using one-hot encoding, for our XGBoost model we'll take advantage of the Pandas get_dummies function.

get_dummies takes a column with multiple possible values and converts it into a series of columns each with only 0s and 1s. For example, if we had a column "color" with possible values of "blue" and "red," get_dummies would transform this into 2 columns called "color_blue" and "color_red" with all boolean 0 and 1 values.

To create dummy columns for our categorical features, run the following code:

dummy_columns = list(data.dtypes[data.dtypes == 'category'].index)
data = pd.get_dummies(data, columns=dummy_columns)

data.head()

When you preview the data this time, you'll see single features (like purchaser_type pictured below) split into multiple columns:

Pandas dummy columns

Step 4: Splitting data into train and test sets

An important concept in machine learning is train / test split. We'll take the majority of our data and use it to train our model, and we'll set aside the rest for testing our model on data it's never seen before.

Add the following code to your notebook, which uses the Scikit-learn function train_test_split to split our data:

x,y = data.values,labels
x_train,x_test,y_train,y_test = train_test_split(x,y)

Now you're ready to build and train your model!

Step 1: Define and train the XGBoost model

Creating a model in XGBoost is simple. We'll use the XGBClassifier class to create the model, and just need to pass the right objective parameter for our specific classification task. In this case we use reg:logistic since we've got a binary classification problem and we want the model to output a single value in the range of (0,1): 0 for not approved and 1 for approved.

The following code will create an XGBoost model:

model = xgb.XGBClassifier(
    objective='reg:logistic'
)

You can train the model with one line of code, calling the fit() method and passing it the training data and labels.

model.fit(x_train, y_train)

Step 2: Evaluate the accuracy of your model

We can now use our trained model to generate predictions on our test data with the predict() function.

Then we'll use Scikit-learn's accuracy_score() function to calculate the accuracy of our model based on how it performs on our test data. We'll pass it the ground truth values along with the model's predicted values for each example in our test set:

y_pred = model.predict(x_test)
acc = accuracy_score(y_test, y_pred.round())
print(acc, '\n')

You should see accuracy around 87%, but yours will vary slightly since there is always an element of randomness in machine learning.

Step 3: Save your model

In order to deploy the model, run the following code to save it to a local file:

model.save_model('model.bst')

Step 1: Create the What-if Tool visualization

To connect the What-if Tool to your local model, you need to pass it a subset of your test examples along with the ground truth values for those examples. Let's create a Numpy array of 500 of our test examples along with their ground truth labels:

num_wit_examples = 500
test_examples = np.hstack((x_test[:num_wit_examples],y_test[:num_wit_examples].reshape(-1,1)))

Instantiating the What-if Tool is as simple as creating a WitConfigBuilder object and passing it the model we'd like to analyze.

Since the What-if Tool expects a list of scores for each class in our model (in this case 2), we'll use XGBoost's predict_proba method with the What-If Tool:

config_builder = (WitConfigBuilder(test_examples.tolist(), data.columns.tolist() + ['mortgage_status'])
  .set_custom_predict_fn(model.predict_proba)
  .set_target_feature('mortgage_status')
  .set_label_vocab(['denied', 'approved']))
WitWidget(config_builder, height=800)

Note that it'll take a minute to load the visualization. When it loads, you should see the following:

What-If Tool initial view

The y-axis shows us the model's prediction, with 1 being a high confidence approved prediction, and 0 being a high confidence denied prediction. The x-axis is just the spread of all loaded data points.

Step 2: Explore individual data points

The default view on the What-if Tool is the Datapoint editor tab. Here you can click on any individual data point to see it's features, change feature values, and see how that change impacts the model's prediction on an individual data point.

In the example below we chose a data point close to the .5 threshold. The mortgage application associated with this particular data point originated from the CFPB. We changed that feature to 0 and also changed the value of agency_code_Department of Housing and Urban Development (HUD) to 1 to see what would happen to the model's prediction if this loan instead originated from HUD:

As we can see in the bottom left section of the What-if Tool, changing this feature significantly decreased the model's approved prediction by 32%. This could indicate that the agency a loan originated from has a strong impact on the model's output, but we'll need to do more analysis to be sure.

In the bottom left part of the UI, we can also see the ground truth value for each data point and compare it to the model's prediction:

Step 3: Counterfactual analysis

Next, click on any datapoint and move the Show nearest counterfactual datapoint slider to the right:

Selecting this will show you the data point that has the most similar feature values to the original one you selected, but the opposite prediction. You can then scroll through the feature values to see where the two data points differed (the differences are highlighted in green and bold).

Step 4: Look at partial dependence plots

To see how each feature affects the model's predictions overall, check the Partial dependence plots box and make sure Global partial dependence plots is selected:

Here we can see that loans originating from HUD have a slightly higher likelihood of being denied. The graph is this shape because agency code is a boolean feature, so values can only be exactly 0 or 1.

applicant_income_thousands is a numerical feature, and in the partial dependence plot we can see that higher income slightly increases the likelihood of an application being approved, but only up to around $200k. After $200k, this feature doesn't impact the model's prediction.

Step 5: Explore overall performance and fairness

Next, go to the Performance & Fairness tab. This shows overall performance statistics on the model's results on the provided dataset, including confusion matrices, PR curves, and ROC curves.

Select mortgage_status as the Ground Truth Feature to see a confusion matrix:

This confusion matrix shows our model's correct and incorrect predictions as a percentage of the total. If you add up the Actual Yes / Predicted Yes and Actual No / Predicted No squares, it should add up the same accuracy as your model (in this case around 87%, though your model may vary slightly since there is an element of randomness in training ML models).

You can also experiment with the threshold slider, raising and lowering the positive classification score the model needs to return before it decides to predicted approved for the loan, and see how that changes accuracy, false positives, and false negatives. In this case, accuracy is highest around a threshold of .55.

Next, on the left Slice by dropdown, select loan_purpose_Home_purchase:

You'll now see performance on the two subsets of your data: the "0" slice shows when the loan is not for a home purchase, and the "1" slice is for when the loan is for a home purchase. Check out the accuracy, false postive, and false negative rate between the two slices to look for differences in performance.

If you expand the rows to look at the confusion matrices, you can see that the model predicts "approved" for ~70% loan applications for home purchases and only 46% of loans that aren't for home purchases (exact percentages will vary on your model):

If you select Demographic parity from the radio buttons on the left, the two thresholds will be adjusted so that the model predicts approved for a similar percentage of applicants in both slices. What does this do to the accuracy, false positives and false negatives for each slice?

Step 6: Explore feature distribution

Finally, navigate to the Features tab in the What-if Tool. This shows you the distribution of values for each feature in your dataset:

You can use this tab to make sure your dataset is balanced. For example, it looks like very few loans in the dataset originated from the Farm Service Agency. To improve model accuracy, we might consider adding more loans from that agency if the data is available.

We've described just a few What-if Tool exploration ideas here. Feel free to keep playing around with the tool, there are plenty more areas to explore!

We've got our model working locally, but it would be nice if we could make predictions on it from anywhere (not just this notebook!). In this step we'll deploy it to the cloud.

Step 1: Create a Cloud Storage bucket for our model

Let's first define some environment variables that we'll be using throughout the rest of the codelab. Fill in the values below with the name of your Google Cloud project, the name of the cloud storage bucket you'd like to create (must be globally unique), and the version name for the first version of your model:

# Update the variables below to your own Google Cloud project ID and GCS bucket name. You can leave the model name we've specified below:
GCP_PROJECT = 'your-gcp-project'
MODEL_BUCKET = 'gs://storage_bucket_name'
MODEL_NAME = 'xgb_mortgage'

Now we're ready to create a storage bucket to store our XGBoost model file. We'll point Vertex AI at this file when we deploy.

Run this gsutil command from within your notebook to create a regional storage bucket:

!gsutil mb -l us-central1 $MODEL_BUCKET

Step 2: Copy the model file to Cloud Storage

Next, we'll copy our XGBoost saved model file to Cloud Storage. Run the following gsutil command:

!gsutil cp ./model.bst $MODEL_BUCKET

Head over to the storage browser in your Cloud Console to confirm the file has been copied:

Step 3: Create the model and deploy to an endpoint

We're almost ready to deploy the model to the cloud! In Vertex AI a model can hold multiple endpoints. We'll first create a model, then we'll create an endpoint within that model and deploy it.

First, use the gcloud CLI to create your model:

!gcloud beta ai models upload \
--display-name=$MODEL_NAME \
--artifact-uri=$MODEL_BUCKET \
--container-image-uri=us-docker.pkg.dev/cloud-aiplatform/prediction/xgboost-cpu.1-2:latest \
--region=us-central1

The artifact-uri parameter will point to the Storage location where you saved your XGBoost model. The container-image-uri parameter tells Vertex AI which pre-built container to use for serving. Once this command completes, navigate to the models section of your Vertex console to get the ID of your new model. You can find it here:

Get model ID from console

Copy that ID and save it to a variable:

MODEL_ID = "your_model_id"

Now it's time to create an endpoint within this model. We can do that with this gcloud command:

!gcloud beta ai endpoints create \
--display-name=xgb_mortgage_v1 \
--region=us-central1

When that completes, you should see the location of your endpoint logged in our notebook output. Look for the line that says the endpoint was created with a path that looks like the following: projects/project_ID/locations/us-central1/endpoints/endpoint_ID. Then replace the values below with the IDs of your endpoint created above:

ENDPOINT_ID = "your_endpoint_id"

To deploy your endpoint, run the gcloud command below:

!gcloud beta ai endpoints deploy-model $ENDPOINT_ID \
--region=us-central1 \
--model=$MODEL_ID \
--display-name=xgb_mortgage_v1 \
--machine-type=n1-standard-2 \
--traffic-split=0=100

The endpoint deploy will take ~5-10 minutes to complete. While your endpoint is deploying, head over to the models section of your console. Click on you model and you should see your endpiont deploying:

When the deploy completes successfully you'll see a green check mark where the loading spinner is.

Step 4: Test the deployed model

To make sure your deployed model is working, test it out using gcloud to make a prediction. First, save a JSON file with an example from our test set:

%%writefile predictions.json
{
  "instances": [
    [2016.0, 1.0, 346.0, 27.0, 211.0, 4530.0, 86700.0, 132.13, 1289.0, 1408.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0]
  ]
}

Test your model by running this gcloud command:

!gcloud beta ai endpoints predict $ENDPOINT_ID \
--json-request=predictions.json \
--region=us-central1

You should see your model's prediction in the output. This particular example was approved, so you should see a value close to 1.

If you'd like to continue using this notebook, it is recommended that you turn it off when not in use. From the Notebooks UI in your Cloud Console, select the notebook and then select Stop:

If you'd like to delete all resources you've created in this lab, simply delete the notebook instance instead of stopping it.

To delete the endpoint you deployed, navigate to the Endpoints section of your Vertex console and click the delete icon:

To delete the Storage Bucket, using the Navigation menu in your Cloud Console, browse to Storage, select your bucket, and click Delete: