Get insights from structured and unstructured data using the AI-capable BigQuery DataFrames package

1. Overview

In this lab, you will use BigQuery DataFrames from a Python notebook in BigQuery Studio to gain insights from data using Python. Make use of Google's generative AI to analyze and visualize unstructured text data.

You will create a Python notebook to categorize and summarize a public customer complaints database. This can be adapted to work on any unstructured text data.

Objectives

In this lab, you learn how to perform the following tasks:

Activate and use Python notebooks in BigQuery Studio
Connect to BigQuery using the BigQuery DataFrames package
Create embeddings from unstructured text data using BigQuery ML and connection to a text embedding endpoint in Vertex AI
Cluster embeddings using BigQuery ML
Summarize clusters with an LLM via BigQuery ML

2. Requirements

A browser, such as Chrome or Firefox
A Google Cloud project with billing enabled

Before you begin

To follow the instructions in this codelab, you'll need a Google Cloud Project with BigQuery Studio enabled and a connected billing account.

In the Google Cloud Console, on the project selector page, select or create a Google Cloud project
Ensure that billing is enabled for your Google Cloud project. Learn how to check if billing is enabled on a project
Follow the instructions to Enable BigQuery Studio for asset management.

Prepare BigQuery Studio

Create an empty notebook and connect it to a runtime.

Go to BigQuery Studio in the Google Cloud Console.
Click the ▼ next to the + button.
Select Python notebook.
Close the template selector.
Select + Code to create a new code cell.
Install the latest version of the BigQuery DataFrames package from the code cell.Type the following command.
```
%pip install --upgrade bigframes --quiet
```
Click the 🞂 button or press Shift + Enter to run the code cell.

3. Read a public dataset

Initialize the BigQuery DataFrames package by running the following in a new code cell:

import bigframes.pandas as bpd

bpd.options.bigquery.ordering_mode = "partial"

Note: in this tutorial, we use the experimental "partial ordering mode", which allows for more efficient queries when used with pandas-like filtering. Some pandas features that require a strict ordering or index may not work.

Consumer Complaint Database

The Consumer Complaint Database is provided on BigQuery through Google Cloud's public dataset program. This is a collection of complaints about consumer financial products and services and the data are collected by the United States Consumer Financial Protection Bureau.

In BigQuery, query the bigquery-public-data.cfbp_complaints.complaint_database table. to analyze the Consumer Complaint Database. Use the bigframes.pandas.read_gbq() method to create a DataFrame from a query string or table ID.

Run the following in a new code cell to create a DataFrame named "feedback":

feedback = bpd.read_gbq(
    "bigquery-public-data.cfpb_complaints.complaint_database"
)

Discover basic information about a DataFrame

Use the DataFrame.peek() method to download a small sample of the data.

Run this cell:

feedback.peek()

Expected output:

  date_received                  product ... timely_response  consumer_disputed complaint_id  
0    2014-03-05  Bank account or service ...            True              False       743665   
1    2014-01-21  Bank account or service ...            True              False       678608   
2    2020-12-31          Debt collection ...            True               <NA>      4041190   
3    2014-02-12          Debt collection ...            True              False       714350   
4    2015-02-23          Debt collection ...            True              False      1251358

Note: head() requires ordering and is generally less efficient than peek() if you want to visualize a sample of data.

Just as with pandas, use the DataFrame.dtypes property to see all available columns and their corresponding data types. These are exposed in a pandas-compatible way.

Run this cell:

feedback.dtypes

Expected output:

date_received                   date32[day][pyarrow]
product                              string[pyarrow]
subproduct                           string[pyarrow]
issue                                string[pyarrow]
subissue                             string[pyarrow]
consumer_complaint_narrative         string[pyarrow]
company_public_response              string[pyarrow]
company_name                         string[pyarrow]
state                                string[pyarrow]
zip_code                             string[pyarrow]
tags                                 string[pyarrow]
consumer_consent_provided            string[pyarrow]
submitted_via                        string[pyarrow]
date_sent_to_company            date32[day][pyarrow]
company_response_to_consumer         string[pyarrow]
timely_response                              boolean
consumer_disputed                            boolean
complaint_id                         string[pyarrow]
dtype: object

The DataFrame.describe() method queries some basic statistics from the DataFrame. Since this DataFrame contains no numeric columns, it shows a summary of the non-null value count and number of distinct values.

Run this cell:

# Exclude some of the larger columns to make the query more efficient.
feedback.drop(columns=[
  "consumer_complaint_narrative",
  "company_public_response",
  "company_response_to_consumer",
]).describe()

Expected output:

         product  subproduct    issue  subissue  company_name    state ... timely_response  consumer_disputed  complaint_id
count    3458906     3223615  3458906   2759004       3458906  3417792 ...         3458906             768399       3458906
nunique       18          76      165       221          6694       63 ...               2                  2       3458906

4. Exploring the data

Before diving in to look at the actual complaints, use the pandas-like methods on the DataFrame to visualize the data.

Visualize the DataFrame

There are several built-in visualization methods such as DataFrame.plot.hist(). Since this DataFrame mostly contains string and boolean data, we can first do some aggregation to learn more about various columns.

Count how many complaints are received from each state.

complaints_by_state = (
  feedback.groupby(
    "state", as_index=False,
  ).size()
  .rename(columns={"size": "total_complaints"})
  .sort_values(by="total_complaints", ascending=False)
)

Convert this to a pandas DataFrame using the DataFrame.to_pandas() method.

complaints_pd = complaints_by_state.head(10).to_pandas()

Use pandas visualization methods on this downloaded DataFrame.

complaints_pd.plot.bar(x="state", y="total_complaints")

bar chart showing California as state with most complaints

Join with other datasets

Previously, you looked at complaints per state, but this loses important context. Some states have larger populations than others. Join with a population dataset such as the US Census Bureau's American Community Survey and the bigquery-public-data.geo_us_boundaries.states table.

us_states = bpd.read_gbq("bigquery-public-data.geo_us_boundaries.states")
us_survey = bpd.read_gbq("bigquery-public-data.census_bureau_acs.state_2020_5yr")

# Ensure there are leading 0s on GEOIDs for consistency across tables.
us_states = us_states.assign(
    geo_id=us_states["geo_id"].str.pad(2, fillchar="0")
)

us_survey = us_survey.assign(
    geo_id=us_survey["geo_id"].str.pad(2, fillchar="0")
)

The American Community Survey identifies states by GEOID. Join with the states table to get the population by two letter state code.

pops = us_states.set_index("geo_id")[["state"]].join(
  us_survey.set_index("geo_id")[["total_pop"]]
)

Now, join this to the complaints database to compare population with number of complaints.

complaints_and_pops = complaints_by_state.set_index("state").join(
    pops.set_index("state")
)

Create a scatter plot to compare state populations with number of complaints.

(
  complaints_and_pops
  .to_pandas()
  .plot.scatter(x="total_pop", y="total_complaints")
)

a scatter plot comparing population with complaints

A couple of states appear to be outliers when comparing population with the number of complaints. It is left as an exercise for the reader to plot with point labels to identify these. Likewise, come up with some hypotheses for why this might be the case (e.g. different demographics, different number of financial services companies, etc.) and test them.

5. Calculate embeddings

Often, important information is hidden in unstructured data, such as text, audio, or images. In this example, much of the useful information in the complaints database is contained in the text content of the complaint.

AI and traditional techniques, such as sentiment analysis, "bag of words", and word2vec, can extract some quantitative information unstructured data. More recently, "vector embedding" models, which are closely related to LLMs, can create a sequence of floating point numbers representing the semantic information of text.

Select a subset of the database

Running a vector embedding model uses more resources than other operations. To reduce costs and quota issues, select a subset of the data for the rest of this tutorial.

import bigframes.pandas as bpd

bpd.options.bigquery.ordering_mode = "partial"

feedback = bpd.read_gbq(
    "bigquery-public-data.cfpb_complaints.complaint_database"
)

# Note: if not using ordering_mode = "partial", you must specify these in read_gbq
# for these to affect query efficiency.
# feedback = bpd.read_gbq(
#    "bigquery-public-data.cfpb_complaints.complaint_database",
#     columns=["consumer_complaint_narrative"],
#     filters= [
#         ("consumer_complaint_narrative", "!=", ""),
#         ("date_received", "==", "2022-12-01")])

feedback.shape

There are about 1,000 complaints submitted on 2022-12-01 compared to nearly 3.5 million rows in the total database (check with feedback.shape).

Select just the data for 2022-12-01 and only the consumer_complaint_narrative column.

import datetime

feedback = feedback[
    # Filter rows by passing in a boolean Series.
    (feedback["date_received"] == datetime.date(2022, 12, 1))
    & ~(feedback["date_received"].isnull())
    & ~(feedback["consumer_complaint_narrative"].isnull())
    & (feedback["consumer_complaint_narrative"] != "")
    & (feedback["state"] == "CA")

    # Uncomment the following if using free credits for a workshop.
    # Billing accounts with free credits have limited Vertex AI quota.
    # & (feedback["product"] == "Mortgage")
][
    # Filter columns by passing in a list of strings.
    ["consumer_complaint_narrative"]
]

feedback.shape

The drop_duplicates method from pandas requires a total ordering of rows because it tries to select either the first or last matching row and preserve the index associated with it.

Instead, aggregate with a call to the groupby method to de-duplicate rows.

feedback = (
  feedback.groupby("consumer_complaint_narrative", as_index=False)
  .size()
)[["consumer_complaint_narrative"]]

feedback.shape

Generate embeddings

BigQuery DataFrames generates embedding vectors via the TextEmbeddingGenerator class. This is based on the ML.GENERATE_EMBEDDING method, in BigQuery ML which calls the text embedding models provided by Vertex AI.

from bigframes.ml.llm import TextEmbeddingGenerator

embedding_model = TextEmbeddingGenerator(
    model_name="text-embedding-004"
)
feedback_embeddings = embedding_model.predict(feedback)

Take a look at what embeddings look like. These vectors represent the semantic meaning of the text as it is understood by the text embedding model.

feedback_embeddings.peek()

Expected output:

                        ml_generate_embedding_result  \
0  [ 7.36380890e-02  2.11779331e-03  2.54309829e-...   
1  [-1.10935252e-02 -5.53950183e-02  2.01338865e-...   
2  [-7.85628427e-03 -5.39347418e-02  4.51385677e-...   
3  [ 0.02013054 -0.0224789  -0.00164843  0.011354...   
4  [-1.51684484e-03 -5.02693094e-03  1.72322839e-...

These vectors have many dimensions. Take a look at a single embedding vector:

feedback_embeddings["ml_generate_embedding_result"].peek().iloc[0]

Embeddings generation operates under a "partial success" contract. This means some rows may have errors and not generate an embedding. Error messages are exposed by the 'ml_generate_embedding_status' column. Empty means no errors.

Filter the embeddings to only include rows where no error occurred.

mask = feedback_embeddings["ml_generate_embedding_status"] == ""
valid_embeddings = feedback_embeddings[mask]
valid_embeddings.shape

6. Cluster using text embeddings

Now, cluster the embeddings using k-means. For this demo, use an arbitrary number of groups (aka centroids). A production quality solution should tune the number of centroids using a technique such as the Silhouette method.

from bigframes.ml.cluster import KMeans

num_clusters = 5
cluster_model = KMeans(n_clusters=num_clusters)
cluster_model.fit(valid_embeddings["ml_generate_embedding_result"])

clusters = cluster_model.predict(valid_embeddings)
clusters.peek()

Remove any embedding failures.

mask = clusters["ml_generate_embedding_status"] == ""
clusters = clusters[mask]

Peek and see the distribution of comments per centroid.

clusters.groupby("CENTROID_ID").size()

7. Summarize the clusters

Feed some comments associated with each centroid and ask Gemini to summarize the complaints. Prompt engineering is an emerging area, but there are good examples on the Internet, like https://www.promptingguide.ai/.

from bigframes.ml.llm import GeminiTextGenerator

preamble = "What is the main concern in this list of user complaints:"
suffix = "Write the main issue using a formal tone."

# Now let's sample the raw comments and get the LLM to summarize them.
prompts = []
for centroid_id in range(1, num_clusters + 1):
  cluster = clusters[clusters["CENTROID_ID"] == centroid_id]
  comments = "\n".join(["- {0}".format(x) for x in cluster.content.peek(40)])
  prompts.append("{}:\n{}\n{}".format(preamble, comments, suffix))

prompt_df = bpd.DataFrame(prompts)
gemini = GeminiTextGenerator(model_name="gemini-1.5-flash-001")
issues = gemini.predict(X=prompt_df, temperature=0.0)
issues.peek()

Use Gemini to write a report from the summaries.

from IPython.display import display, Markdown

prompt = "Turn this list of issues into a short, concise report:"
for value in issues["ml_generate_text_llm_result"]:
  prompt += "- {}".format(value)
prompt += "Using a formal tone, write a markdown text format report."

summary_df = bpd.DataFrame(([prompt]))
summary = gemini.predict(X=summary_df, temperature=0.0)

report = (summary["ml_generate_text_llm_result"].values[0])
display(Markdown(report))

8. Clean up

If you have created a new Google Cloud project for this tutorial, you can delete it to prevent additional charges for tables or other resources created.

9. Congratulations!

You have analyzed structured and unstructured data using BigQuery DataFrames. Along the way you've explored Google Cloud's Public Datasets, Python notebooks in BigQuery Studio, BigQuery ML, Vertex AI, and natural language to Python features of BigQuery Studio. Fantastic job!

Next steps

Try generating Python code in your notebook. Python notebooks in BigQuery Studio are powered by Colab Enterprise. Hint: I find asking for help generating test data to be quite useful.
Explore the sample notebooks for BigQuery DataFrames on GitHub.
Create a schedule to run a notebook in BigQuery Studio.
Deploy a Remote Function with BigQuery DataFrames to integrate third-party Python packages with BigQuery.