About this codelab
1. Overview
In this lab, you will use BigQuery DataFrames from a Python notebook in BigQuery Studio to gain insights from data using Python. Make use of Google's generative AI to analyze and visualize unstructured text data.
You will create a Python notebook to categorize and summarize a public customer complaints database. This can be adapted to work on any unstructured text data.
Objectives
In this lab, you learn how to perform the following tasks:
- Activate and use Python notebooks in BigQuery Studio
- Connect to BigQuery using the BigQuery DataFrames package
- Create embeddings from unstructured text data using BigQuery ML and connection to a text embedding endpoint in Vertex AI
- Cluster embeddings using BigQuery ML
- Summarize clusters with an LLM via BigQuery ML
2. Requirements
Before you begin
To follow the instructions in this codelab, you'll need a Google Cloud Project with BigQuery Studio enabled and a connected billing account.
- In the Google Cloud Console, on the project selector page, select or create a Google Cloud project
- Ensure that billing is enabled for your Google Cloud project. Learn how to check if billing is enabled on a project
- Follow the instructions to Enable BigQuery Studio for asset management.
Prepare BigQuery Studio
Create an empty notebook and connect it to a runtime.
- Go to BigQuery Studio in the Google Cloud Console.
- Click the ▼ next to the + button.
- Select Python notebook.
- Close the template selector.
- Select + Code to create a new code cell.
- Install the latest version of the BigQuery DataFrames package from the code cell.Type the following command.
Click the 🞂 button or press Shift + Enter to run the code cell.%pip install --upgrade bigframes --quiet
3. Read a public dataset
Initialize the BigQuery DataFrames package by running the following in a new code cell:
import bigframes.pandas as bpd
bpd.options.bigquery.ordering_mode = "partial"
Note: in this tutorial, we use the experimental "partial ordering mode", which allows for more efficient queries when used with pandas-like filtering. Some pandas features that require a strict ordering or index may not work.
Consumer Complaint Database
The Consumer Complaint Database is provided on BigQuery through Google Cloud's public dataset program. This is a collection of complaints about consumer financial products and services and the data are collected by the United States Consumer Financial Protection Bureau.
In BigQuery, query the bigquery-public-data.cfbp_complaints.complaint_database table. to analyze the Consumer Complaint Database. Use the bigframes.pandas.read_gbq()
method to create a DataFrame from a query string or table ID.
Run the following in a new code cell to create a DataFrame named "feedback":
feedback = bpd.read_gbq(
"bigquery-public-data.cfpb_complaints.complaint_database"
)
Discover basic information about a DataFrame
Use the DataFrame.peek()
method to download a small sample of the data.
Run this cell:
feedback.peek()
Expected output:
date_received product ... timely_response consumer_disputed complaint_id
0 2014-03-05 Bank account or service ... True False 743665
1 2014-01-21 Bank account or service ... True False 678608
2 2020-12-31 Debt collection ... True <NA> 4041190
3 2014-02-12 Debt collection ... True False 714350
4 2015-02-23 Debt collection ... True False 1251358
Note: head()
requires ordering and is generally less efficient than peek()
if you want to visualize a sample of data.
Just as with pandas, use the DataFrame.dtypes
property to see all available columns and their corresponding data types. These are exposed in a pandas-compatible way.
Run this cell:
feedback.dtypes
Expected output:
date_received date32[day][pyarrow]
product string[pyarrow]
subproduct string[pyarrow]
issue string[pyarrow]
subissue string[pyarrow]
consumer_complaint_narrative string[pyarrow]
company_public_response string[pyarrow]
company_name string[pyarrow]
state string[pyarrow]
zip_code string[pyarrow]
tags string[pyarrow]
consumer_consent_provided string[pyarrow]
submitted_via string[pyarrow]
date_sent_to_company date32[day][pyarrow]
company_response_to_consumer string[pyarrow]
timely_response boolean
consumer_disputed boolean
complaint_id string[pyarrow]
dtype: object
The DataFrame.describe()
method queries some basic statistics from the DataFrame. Since this DataFrame contains no numeric columns, it shows a summary of the non-null value count and number of distinct values.
Run this cell:
# Exclude some of the larger columns to make the query more efficient.
feedback.drop(columns=[
"consumer_complaint_narrative",
"company_public_response",
"company_response_to_consumer",
]).describe()
Expected output:
product subproduct issue subissue company_name state ... timely_response consumer_disputed complaint_id
count 3458906 3223615 3458906 2759004 3458906 3417792 ... 3458906 768399 3458906
nunique 18 76 165 221 6694 63 ... 2 2 3458906
4. Exploring the data
Before diving in to look at the actual complaints, use the pandas-like methods on the DataFrame to visualize the data.
Visualize the DataFrame
There are several built-in visualization methods such as DataFrame.plot.hist(). Since this DataFrame mostly contains string and boolean data, we can first do some aggregation to learn more about various columns.
Count how many complaints are received from each state.
complaints_by_state = (
feedback.groupby(
"state", as_index=False,
).size()
.rename(columns={"size": "total_complaints"})
.sort_values(by="total_complaints", ascending=False)
)
Convert this to a pandas DataFrame using the DataFrame.to_pandas()
method.
complaints_pd = complaints_by_state.head(10).to_pandas()
Use pandas visualization methods on this downloaded DataFrame.
complaints_pd.plot.bar(x="state", y="total_complaints")
Join with other datasets
Previously, you looked at complaints per state, but this loses important context. Some states have larger populations than others. Join with a population dataset such as the US Census Bureau's American Community Survey and the bigquery-public-data.geo_us_boundaries.states
table.
us_states = bpd.read_gbq("bigquery-public-data.geo_us_boundaries.states")
us_survey = bpd.read_gbq("bigquery-public-data.census_bureau_acs.state_2020_5yr")
# Ensure there are leading 0s on GEOIDs for consistency across tables.
us_states = us_states.assign(
geo_id=us_states["geo_id"].str.pad(2, fillchar="0")
)
us_survey = us_survey.assign(
geo_id=us_survey["geo_id"].str.pad(2, fillchar="0")
)
The American Community Survey identifies states by GEOID. Join with the states table to get the population by two letter state code.
pops = us_states.set_index("geo_id")[["state"]].join(
us_survey.set_index("geo_id")[["total_pop"]]
)
Now, join this to the complaints database to compare population with number of complaints.
complaints_and_pops = complaints_by_state.set_index("state").join(
pops.set_index("state")
)
Create a scatter plot to compare state populations with number of complaints.
(
complaints_and_pops
.to_pandas()
.plot.scatter(x="total_pop", y="total_complaints")
)
A couple of states appear to be outliers when comparing population with the number of complaints. It is left as an exercise for the reader to plot with point labels to identify these. Likewise, come up with some hypotheses for why this might be the case (e.g. different demographics, different number of financial services companies, etc.) and test them.
5. Calculate embeddings
Often, important information is hidden in unstructured data, such as text, audio, or images. In this example, much of the useful information in the complaints database is contained in the text content of the complaint.
AI and traditional techniques, such as sentiment analysis, "bag of words", and word2vec, can extract some quantitative information unstructured data. More recently, "vector embedding" models, which are closely related to LLMs, can create a sequence of floating point numbers representing the semantic information of text.
Select a subset of the database
Running a vector embedding model uses more resources than other operations. To reduce costs and quota issues, select a subset of the data for the rest of this tutorial.
import bigframes.pandas as bpd
bpd.options.bigquery.ordering_mode = "partial"
feedback = bpd.read_gbq(
"bigquery-public-data.cfpb_complaints.complaint_database"
)
# Note: if not using ordering_mode = "partial", you must specify these in read_gbq
# for these to affect query efficiency.
# feedback = bpd.read_gbq(
# "bigquery-public-data.cfpb_complaints.complaint_database",
# columns=["consumer_complaint_narrative"],
# filters= [
# ("consumer_complaint_narrative", "!=", ""),
# ("date_received", "==", "2022-12-01")])
feedback.shape
There are about 1,000 complaints submitted on 2022-12-01 compared to nearly 3.5 million rows in the total database (check with feedback.shape
).
Select just the data for 2022-12-01 and only the consumer_complaint_narrative
column.
import datetime
feedback = feedback[
# Filter rows by passing in a boolean Series.
(feedback["date_received"] == datetime.date(2022, 12, 1))
& ~(feedback["date_received"].isnull())
& ~(feedback["consumer_complaint_narrative"].isnull())
& (feedback["consumer_complaint_narrative"] != "")
& (feedback["state"] == "CA")
# Uncomment the following if using free credits for a workshop.
# Billing accounts with free credits have limited Vertex AI quota.
# & (feedback["product"] == "Mortgage")
][
# Filter columns by passing in a list of strings.
["consumer_complaint_narrative"]
]
feedback.shape
The drop_duplicates
method from pandas requires a total ordering of rows because it tries to select either the first or last matching row and preserve the index associated with it.
Instead, aggregate with a call to the groupby
method to de-duplicate rows.
feedback = (
feedback.groupby("consumer_complaint_narrative", as_index=False)
.size()
)[["consumer_complaint_narrative"]]
feedback.shape
Generate embeddings
BigQuery DataFrames generates embedding vectors via the TextEmbeddingGenerator class. This is based on the ML.GENERATE_EMBEDDING
method, in BigQuery ML which calls the text embedding models provided by Vertex AI.
from bigframes.ml.llm import TextEmbeddingGenerator
embedding_model = TextEmbeddingGenerator(
model_name="text-embedding-004"
)
feedback_embeddings = embedding_model.predict(feedback)
Take a look at what embeddings look like. These vectors represent the semantic meaning of the text as it is understood by the text embedding model.
feedback_embeddings.peek()
Expected output:
ml_generate_embedding_result \
0 [ 7.36380890e-02 2.11779331e-03 2.54309829e-...
1 [-1.10935252e-02 -5.53950183e-02 2.01338865e-...
2 [-7.85628427e-03 -5.39347418e-02 4.51385677e-...
3 [ 0.02013054 -0.0224789 -0.00164843 0.011354...
4 [-1.51684484e-03 -5.02693094e-03 1.72322839e-...
These vectors have many dimensions. Take a look at a single embedding vector:
feedback_embeddings["ml_generate_embedding_result"].peek().iloc[0]
Embeddings generation operates under a "partial success" contract. This means some rows may have errors and not generate an embedding. Error messages are exposed by the 'ml_generate_embedding_status'
column. Empty means no errors.
Filter the embeddings to only include rows where no error occurred.
mask = feedback_embeddings["ml_generate_embedding_status"] == ""
valid_embeddings = feedback_embeddings[mask]
valid_embeddings.shape
6. Cluster using text embeddings
Now, cluster the embeddings using k-means. For this demo, use an arbitrary number of groups (aka centroids). A production quality solution should tune the number of centroids using a technique such as the Silhouette method.
from bigframes.ml.cluster import KMeans
num_clusters = 5
cluster_model = KMeans(n_clusters=num_clusters)
cluster_model.fit(valid_embeddings["ml_generate_embedding_result"])
clusters = cluster_model.predict(valid_embeddings)
clusters.peek()
Remove any embedding failures.
mask = clusters["ml_generate_embedding_status"] == ""
clusters = clusters[mask]
Peek and see the distribution of comments per centroid.
clusters.groupby("CENTROID_ID").size()
7. Summarize the clusters
Feed some comments associated with each centroid and ask Gemini to summarize the complaints. Prompt engineering is an emerging area, but there are good examples on the Internet, like https://www.promptingguide.ai/.
from bigframes.ml.llm import GeminiTextGenerator
preamble = "What is the main concern in this list of user complaints:"
suffix = "Write the main issue using a formal tone."
# Now let's sample the raw comments and get the LLM to summarize them.
prompts = []
for centroid_id in range(1, num_clusters + 1):
cluster = clusters[clusters["CENTROID_ID"] == centroid_id]
comments = "\n".join(["- {0}".format(x) for x in cluster.content.peek(40)])
prompts.append("{}:\n{}\n{}".format(preamble, comments, suffix))
prompt_df = bpd.DataFrame(prompts)
gemini = GeminiTextGenerator(model_name="gemini-1.5-flash-001")
issues = gemini.predict(X=prompt_df, temperature=0.0)
issues.peek()
Use Gemini to write a report from the summaries.
from IPython.display import display, Markdown
prompt = "Turn this list of issues into a short, concise report:"
for value in issues["ml_generate_text_llm_result"]:
prompt += "- {}".format(value)
prompt += "Using a formal tone, write a markdown text format report."
summary_df = bpd.DataFrame(([prompt]))
summary = gemini.predict(X=summary_df, temperature=0.0)
report = (summary["ml_generate_text_llm_result"].values[0])
display(Markdown(report))
8. Clean up
If you have created a new Google Cloud project for this tutorial, you can delete it to prevent additional charges for tables or other resources created.
9. Congratulations!
You have analyzed structured and unstructured data using BigQuery DataFrames. Along the way you've explored Google Cloud's Public Datasets, Python notebooks in BigQuery Studio, BigQuery ML, Vertex AI, and natural language to Python features of BigQuery Studio. Fantastic job!
Next steps
- Try generating Python code in your notebook. Python notebooks in BigQuery Studio are powered by Colab Enterprise. Hint: I find asking for help generating test data to be quite useful.
- Explore the sample notebooks for BigQuery DataFrames on GitHub.
- Create a schedule to run a notebook in BigQuery Studio.
- Deploy a Remote Function with BigQuery DataFrames to integrate third-party Python packages with BigQuery.